Upload
britton-arnold
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls
Register isn’t written until completion of write-back stage Source operands are read from register file in decode stage
Values need to be in register file at start of stage Leads to many more stall cycles than necessary
Key observation The value we want is generated in execute or memory stage It is “available” 1-2 cycles before write-back
Trick: go get it! Pass value directly from stage of generating instruction to decode
stage Must be available before end of decode stage to avoid stall
Detecting Stall Condition
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
bubble
F
E M W
0x00e: addl %edx,%eax D D E M W
0x010: halt F D E M W
10# demo-h2.ys
F
F D E M W0x00d: nop
11
Cycle 6
W
D
•••
W_dstE = %eaxW_valE = 3
srcA = %edxsrcB = %eax
Data Forwarding Example
irmovl in write-back stage Destination value in W pipeline
register “Forward” as valB for decode stage addl instruction can proceed
without stalling When do we actually know
the values of %eax and %edx?
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M WF D E M W
0x006: irmovl $3,%eax F D E M WF D E M W
0x00c: nop F D E M WF D E M W
0x00d: nop F D E M WF D E M W
0x00e: addl% edx ,%eax F D E M WF D E M W
0x010: halt F D E M WF D E M W
10# demo-h2.ys
Cycle 6
W
R[%eax] f 3
D
valA
f R[%edx] = 10valB
f W_ valE
= 3
•••
W_dstE= % eaxW_valE = 3
srcA= % edxsrcB= % eax
Data Forwarding Example #2
Register %edx Generated by ALU during previous
cycle Forward from memory as valA
Register %eax Value just generated by ALU Forward from execute as valB
Forwarding Hardware
Feedback paths from E, M, and W registers to decode stage
Logic blocks to select source for valA and valB in decode stage
Note: we either do forwarding or stall on data hazards
Forwarding has better performance, higher cost
M
F
D
Instructionmemory
Instructionmemory
PCincr.PC
incr.
Registerfile
Registerfile
CCCC ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Writeback
data out
data in
BM
E
M_valA
W_valE
W_valM
W_valE
M_valA
W_valM
f_pc
PredictPC
Cndicode valE valA dstEdstM
E icode ifun valC valA valB dstE dstM srcA srcBstat
valC valPicodeifun rA
predPC
d_srcBd_srcA
e_Cnd
M_Cnd
Sel+FwdA
FwdB
W icode valE valM dstEdstM
m_valM
W_valM
M_valE
e_valE
stat
stat
dstE
stat
statimem_errorinstr_valid
dmem_error
e_dstE
m_stat
stat
stat
Stat PIPE
## Actions of “Sel+Fwd A” block ## Pick the correct A value## Order is important!int new_E_valA = [ # Use incremented PC
D_icode in {ICALL,IJXX} : D_valP; # Forward valE from execute
d_srcA == E_dstE : e_valE; # Forward valM from memory
d_srcA == M_dstM : m_valM; # Forward valE from memory
d_srcA == M_dstE : M_valE; # Forward valM from write back
d_srcA == W_dstM : W_valM; # Forward valE from write back
d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA;];
M
F
D
Instructionmemory
Instructionmemory
PCincr.PC
incr.
Registerfile
Registerfile
CCCC ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Writeback
data out
data in
BM
E
M_valA
W_valE
W_valM
W_valE
M_valA
W_valM
f_pc
PredictPC
Cndicode valE valA dstEdstM
E icode ifun valC valA valB dstE dstM srcA srcBstat
valC valPicodeifun rA
predPC
d_srcBd_srcA
e_Cnd
M_Cnd
Sel+FwdA
FwdB
W icode valE valM dstEdstM
m_valM
W_valM
M_valE
e_valE
stat
stat
dstE
stat
statimem_errorinstr_valid
dmem_error
e_dstE
m_stat
stat
stat
Stat
Forwarding Control
PIPE
At clock cycle 4d_srcA =d_srcB =E_dstE =e_valE =M_dstE =M_valE =
What are the forwarding conditions?
Highlight the data path in D, E, M stages.
M
F
D
Instructionmemory
Instructionmemory
PCincr.PC
incr.
Registerfile
Registerfile
CCCC ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Writeback
data out
data in
BM
E
M_valA
W_valE
W_valM
W_valE
M_valA
W_valM
f_pc
PredictPC
Cndicode valE valA dstEdstM
E icode ifun valC valA valB dstE dstM srcA srcBstat
valC valPicodeifun rA
predPC
d_srcBd_srcA
e_Cnd
M_Cnd
Sel+FwdA
FwdB
W icode valE valM dstEdstM
m_valM
W_valM
M_valE
e_valE
stat
stat
dstE
stat
statimem_errorinstr_valid
dmem_error
e_dstE
m_stat
stat
stat
Stat
Forwarding ExamplePIPE
At clock cycle 4d_srcA = ecxd_srcB = edxM_dstE = edxm_valE = 128E_dstE = ecxe_valE = 3
What are the forwarding conditions?
M
F
D
Instructionmemory
Instructionmemory
PCincr.PC
incr.
Registerfile
Registerfile
CCCC ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Writeback
data out
data in
BM
E
M_valA
W_valE
W_valM
W_valE
M_valA
W_valM
f_pc
PredictPC
Cndicode valE valA dstEdstM
E icode ifun valC valA valB dstE dstM srcA srcBstat
valC valPicodeifun rA
predPC
d_srcBd_srcA
e_Cnd
M_Cnd
Sel+FwdA
FwdB
W icode valE valM dstEdstM
m_valM
W_valM
M_valE
e_valE
stat
stat
dstE
stat
statimem_errorinstr_valid
dmem_error
e_dstE
m_stat
stat
stat
Stat
Forwarding ExamplePIPE
Limitation of Forwarding
Load-use dependency Value needed by end of decode
stage in cycle 7 Value read from memory in
memory stage of cycle 8 Terminology
This is a load hazard Only solution is a load stall Preferred: compiler avoid
Load/Use Hazard: Desired Behavior
Best we can do in hardware Stall reading instruction for one cycle Then forward value from memory stage
Better yet Have compiler avoid in code it generates
Addressing Load/Use Hazard
M
F
D
Instructionmemory
Instructionmemory
PCincr.PC
incr.
Registerfile
Registerfile
CCCC ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Writeback
data out
data in
BM
E
M_valA
W_valE
W_valM
W_valE
M_valA
W_valM
f_pc
PredictPC
Cndicode valE valA dstEdstM
E icode ifun valC valA valB dstE dstM srcA srcBstat
valC valPicodeifun rA
predPC
d_srcBd_srcA
e_Cnd
M_Cnd
Sel+FwdA
FwdB
W icode valE valM dstEdstM
m_valM
W_valM
M_valE
e_valE
stat
stat
dstE
stat
statimem_errorinstr_valid
dmem_error
e_dstE
m_stat
stat
stat
Stat
Detection Previous instr. is loading from
memory to src register dstM in E register matches
srcA or srcB (and not 0xF) Action
Stall instruction in decode
Interrupts and Exceptions Basic interrupt mechanism
…instr iinstr i+1instr i+2instr i+3instr i+4instr i+5instr i+6…
CPU runningcurrent process Event occurs that
needs attention
(e.g., Disc read finishes)HW asserts CPU
interrupt line
Control transferred to interrupt handler(think HW-induced
function call)
instr 1instr 2instr 3…
Handler
How is state of interrupted process saved?
How is location of handler determined?
Interrupt Handling Calling handler
Save return address (PC) on stack Address of next instruction to be executed for this process
– Depending on event, either current or next instruction PC usually passed through pipeline along with instruction Precise exception: all instructions to PC executed, none past (“Clean Break”)
Jump to handler address Usually obtained from table stored at fixed memory address Index to table entry determined by exception type/interrupt priority level Interrupt vector table written by software, accessed by hardware
Implementation Critical for real hardware Seldom implemented in simulators: no OS running to pass control to!
Exceptions Events occurring within processor under which pipeline cannot
continue normal operation Possible causes
Halt instruction executed (Current)
Bad address for instruction or data (Previous)
Invalid instruction (Previous)
Pipeline control error (Previous)
System calls, page faults, math errors (not in Y86) Desired action
Complete instructions to specific point Either current or previous (depends on exception type)
Discard instructions that follow Transfer control to exception handler in OS
Save return address, get handler address from table
Exception Examples Detect in fetch stage
irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address (for Y86 tools)
jmp $-1 # Invalid jump target
.byte 0xFF # Invalid instruction code
halt # Halt instruction
Detect in memory stage
Exceptions in Pipeline Processor (#1)
Desired behavior rmmovl should cause exception (1st in sequential machine) Tricky because invalid instruction code detected first
# demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop .byte 0xFF # Invalid instruction code
0x000: irmovl $100,%eax
1 2 3 4
F D E M
F D E0x006: rmmovl %eax,0x10000(%eax)
0x00c: nop
0x00d: .byte 0xFF
F D
F
W
5
M
E
D
Exception detected
Exception detected
Exceptions in Pipeline Processor (#2)
Desired behavior No exception should occur Must match behavior and results of sequential execution
# demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t: .byte 0xFF # Target
0x000: xorl %eax,%eax
1 2 3
F D E
F D0x002: jne t
0x014: t: .byte 0xFF
0x???: (I’m lost!)
F
Exception detected
0x007: irmovl $1,%eax
4
M
E
F
D
W
5
M
D
F
E
M
E
W
7
W
M
8
W
9
E
D
M
6
W
Correct Exception Handling
Challenges: respond to exceptions in program order, and only those that “really” occur Motivation for exception status field (stat) in pipeline registers Fetch stage sets to either “AOK,” “ADR” (bad fetch address), or “INS”
(illegal instruction) Decode & execute stages pass values through Memory stage either passes through or sets to “ADR” CPU responds to exception only when instruction reaches write back
F predPC
W icode valE valM dstE dstMstat
M Cndicode valE valA dstE dstMstat
E icode ifun valC valA valB dstE dstM srcA srcBstat
D rB valC valPicode ifun rAstat
Avoiding Side Effects
Desired behavior rmmovl should cause exception No following instruction should change any state
Note special challenge of condition codes!
# demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes
0x000: irmovl $100,%eax
1 2 3 4
F D E M
F D E0x006: rmmovl %eax,0x10000(%eax)
0x00c: addl %eax,%eax F D
W
5
M
E
Exception detected
Condition code set
Avoiding Side Effects Exception should disable state update for following
instructions When exception detected in memory stage
Disable condition code setting in execute Must happen in same clock cycle
When exception passes to write-back stage Disable memory write in memory stage Disable condition code setting in execute stage
Let’s see how these are handled in PIPE processor
PIPE: Fetch Details Main points
Branch prediction Branch misprediction
recovery Return handling Stat initialization
F
D rB
M_icode
PredictPC
valC valPicode ifun rA
Instructionmemory
Instructionmemory
PCincr.PC
incr.
predPC
Needregids
NeedvalCInstr
valid
AlignAlignSplitSplit
Bytes 1-5Byte 0
SelectPC
M_CndM_valA
W_icode
W_valM
f_pc
stat
stat
imem_error
icode ifun
PIPE: Decode and Write-back
Main points Forwarding logic,
paths Forwarding priority valA and valP
merged
d_srcB
DrB
Eicode ifun valC valA valB dstE dstM srcA srcB
valC valPicode ifun rA
Registerfile
A B
M
E
dstE
dstM
srcA srcB
d_srcA
d_rvalA
Sel+FwdA
FwdB
W_v
alM
W_d
stE
e_dst
E
d_rvalB
e_va
lE
M_d
stE
M_v
alE
M_d
stM
m_v
alM
W_d
stM
W_v
alE
dstE dstM srcA srcB
stat
stat
PIPE: Execute
Main points CC update inhibited
by prior exceptions Values for
forwarding Special handling for
dstE (?)
E
M Cndicode valE valA dstE dstM
icode ifun valC valA valB dstE dstM srcA srcB
CCCC ALUALU
ALUA
ALUB
ALUfun.
SetCC
condcond
e_valEe_Cnde_dstE
stat
stat
dstE
m_statW_stat
PIPE: Memory and Write-back
Main points Values for forwarding Stat update logic Feedback for branch
misprediction recovery
M
W
Addr
icode
data in
M_valA
valE valM dstE dstM
Cndicode valE valA dstE dstM
Datamemory
Datamemory
Mem.read
read
write
data out
Mem.write
M_valE
W_dstM
W_valE
W_valM
W_dstE
W_icode
M_icode
M_dstM
M_dstE
m_valM
M_Cnd
stat
stat
statdmem_error
m_stat
stat
Stat
Pipeline Control: Register ModesRisingclockRisingclock_ _ Output = y
yy
RisingclockRisingclock_ _ Output = x
xx
xxnop
RisingclockRisingclock_ _ Output = nop
Output = xInput = y
stall = 0
bubble= 0
xxNormal
Output = xInput = y
stall = 1
bubble= 0
xxStall
Output = xInput = y
stall = 0
bubble= 1
Bubble
PIPE Control Logic
Handles special cases Handles ret, load/use hazards, misprediction recovery, exceptions Existing PIPE logic handles forwarding, branch prediction
E
M
W
F
D
CCCC
rB
srcAsrcB
icode valE valM dstE dstM
Cndicode valE valA dstE dstM
icode ifun valC valA valB dstE dstM srcA srcB
valC valPicode ifun rA
predPC
d_srcB
d_srcA
e_Cnd
D_icode
E_icode
M_icode
E_dstMPipecontrollogic
D_bubble
D_stall
E_bubble
F_stall
M_bubble
W_stall
set_cc
stat
stat
stat
stat
W_stat
stat
m_stat
PIPE: Actual ret handling
0x000: irmovl Stack,%edx
1 2 3 4 5 6 7 8 9
F D E M W0x006: call proc F D E M W
E M W
10# prog7
0x020: ret
bubble
F D E M W
D
E M W bubble D
E M W bubble D0x00b: irmovl $10,%edx # Return point F D E M W
11
F
F
F
0x000: irmovl Stack,%edx
1 2 3 4 5 6 7 8 9
F D E M W0x006: call proc F D E M W
FE M W
10# prog7
0x020: ret
0x021: rrmovl %edx,%ebx # Not executed
bubble
F D E M W
DF
E M W
0x021: rrmovl %edx,%ebx # Not executed
bubble DF
E M W
0x021: rrmovl %edx,%ebx # Not executed
bubble D0x00b: irmovl $10,%edx # Return point F D E M W
11
Simplified view
What hardwareactually does
PIPE: Actual exception handling Scenario: pushl uses bad memory address
Actions: disable CC, inject bubbles into memory stage, stall write-back
0x000: irmovl $1,%eax
1 2 3 4 5 6 7 8 9
F D E M W
0x006: xorl %esp,%esp #CC = 100 F D E M W
10# prog10
0x008: pushl %eax
0x00a: addl %eax,%eax F D
0x00c: irmovl $2, %eax
F D E M W
E
F D E
W W W
Cycle 6
M
mem_error = 1
E
New CC = 000
set_cc 0
Special Control Cases: Exceptions Detection
Action (on next cycle)
Also: disable setting of condition codes in execute in current cycle
Condition Trigger
Exception m_stat is in {SADR, SINS, SHLT} || W_stat is in {SADR, SINS, SHLT}
Condition F D E M W
Exception normal normal normal bubble stall
Special Control Cases: Non-exceptions Detection
Action (on next cycle)
Condition Trigger
Processing ret IRET in { D_icode, E_icode, M_icode }
Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }
Mispredicted Branch E_icode = IJXX & !e_Cnd
Condition F D E M W
Processing ret stall bubble normal normal normal
Load/Use Hazard stall stall bubble normal normal
Mispredicted Branch normal bubble bubble normal normal
Pipeline Control, rev. 1.0bool F_stall =
# Conditions for a load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } ||# Stalling at fetch while ret passes through pipelineIRET in { D_icode, E_icode, M_icode };
bool D_stall = # Conditions for a load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB };
bool D_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode };
bool E_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB};
How do we know this works?
Analysis: Control Combinations
Special cases that can arise on same clock cycle Combination A
Not-taken branch ret instruction at branch target
Combination B Instruction that reads from memory to %esp Followed by ret instruction
Control Combination A
Should be handled as mispredicted branch Combination will also stall F pipeline register But PC selection logic will be using M_valA anyway Correct action taken!
JXXE
D
M
Mispredict
JXXE
D
M
Mispredict
E
retD
M
ret 1
E
retD
M
ret 1
E
retD
M
ret 1
Combination A
Condition F D E M W
Processing ret stall bubble normal normal normal
Mispredicted branch normal bubble bubble normal normal
Combination stall bubble bubble normal normal
Control Combination B
Would assert both bubble and stall for pipeline register D Should be signaled by processor as pipeline error
Combination not handled correctly in control code 1.0 But it passed many simulation tests; caught only with systematic analysis
LoadE
UseD
M
Load/use
E
retD
M
ret 1
E
retD
M
ret 1
E
retD
M
ret 1
Combination B
Condition F D E M W
Processing ret stall bubble normal normal normal
Load/use hazard stall stall bubble normal normal
Combination stall bubble + stall
bubble normal normal
Control Combination B: Correct Handling
LoadE
UseD
M
Load/use
E
retD
M
ret 1
E
retD
M
ret 1
E
retD
M
ret 1
Combination B
Condition F D E M W
Processing ret stall bubble normal normal normal
Load/use hazard stall stall bubble normal normal
Combination stall stall bubble normal normal
Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle
Corrected Pipeline Control Logic
Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle
Condition F D E M W
Processing ret stall bubble normal normal normal
Load/use hazard stall stall bubble normal normal
Combination stall stall bubble normal normal
bool D_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVL, IPOPL }
&& E_dstM in { d_srcA, d_srcB });New
Lesson Learned Extensive and thorough testing is good, but it can’t prove
a design correct Formal verification important, but field not mature
enough for large-scale designs Important and active research area
Performance Metrics Clock rate
Measured in Megahertz or Gigahertz Function of stage partitioning and circuit design
To increase: keep amount of work per stage small Rate at which instructions executed
CPI: cycles per instruction On average, how many clock cycles does each instruction require
(after completion of previous instruction)? CPI a function of pipeline design and the program
How frequently are branches mispredicted? How frequent are load stalls? How frequent are ret instructions?
CPI for PIPE Ideal CPI = 1.0
Fetch instruction each clock cycle Process new instruction every cycle
Although each individual instruction has latency of 5 cycles Actual CPI > 1.0
Due to pipeline stalls, branch mispredictions Computing CPI
C clock cycles I instructions executed to completion B bubbles injected (C = I + B)
CPI = C/I = (I+B)/I = 1.0 + B/I B/I represents average penalty (per instruction) due to bubbles
CPI for PIPE (Cont.)B/I = LP + MP + RP LP: Penalty due to load/use hazard stalling
Fraction of instructions that are loads0.25
Fraction of load instructions requiring stall0.20
Number of bubbles injected each time 1 LP = 0.25 * 0.20 * 1 = 0.05
MP: Penalty due to mispredicted branches Fraction of instructions that are cond. jumps
0.20 Fraction of cond. jumps mispredicted
0.40 Number of bubbles injected each time 2 MP = 0.20 * 0.40 * 2 = 0.16
RP: Penalty due to ret instructions Fraction of instructions that are returns
0.02 Number of bubbles injected each time 3 RP = 0.02 * 3 = 0.06
Net effect of penalties: 0.05 + 0.16 + 0.06 = 0.27 CPI = 1.27 (Not bad! Assumes perfect memories.)
Typical Values
State-of-the-Art Pipelining What have we ignored in our Y86 implementation?
Balancing delay in each stage Which stage is longest, how might we speed it up?
Multicycle instructions Realistic memory systems
Fetch Logic Revisited
During fetch cycle1. Select PC2. Read bytes from
instruction memory3. Examine icode to
determine instruction length
4. Increment PC Timing
Steps 2 & 4 require significant amount of time
F
D rB
M_icode
PredictPC
valC valPicode ifun rA
Instructionmemory
Instructionmemory
PCincr.PC
incr.
predPC
Needregids
NeedvalCInstr
valid
AlignAlignSplitSplit
Bytes 1-5Byte 0
SelectPC
M_CndM_valA
W_icode
W_valM
f_pc
stat
stat
imem_error
icode ifun
Standard Fetch Timing
Must perform everything in sequence: Can’t compute incremented PC until we know value to increment it
with Why is increment slow? How could we speed this up?
Select PC
Mem. Read Increment
need_regids, need_valC
1 clock cycle
A Fast PC Increment Circuit
3-bit adder
need_ValC
need_regids
0
29-bitincre-
menter
MUX
High-order 29 bits
Low-order 3 bits
High-order 29 bits Low-order 3 bits
0 1
PC
incrPC
Slow Fast
carry
1
carry
Modified Fetch Timing
29-bit incrementer Acts as soon as PC selected Output not needed until final MUX Works in parallel with memory read
Select PC
Mem. Read
Incrementer
need_regids, need_valC3-bit add
MUX
1 clock cycle
Standard cycle
State-of-the-Art Pipelining Other issues to consider
More complex instructions: consider FP divide, sqrt Take many cycles to execute Forwarding can’t resolve hazards: more data stalls Important for compiler to schedule code
Deeper pipelines to allow faster cycle times Increased penalty from misprediction, data hazard stalls, etc. Increased emphasis on branch prediction
Actual memory hierarchy issues (will increase CPI) Difficult to complete memory access in one cycle! Possibility of cache misses, TLB misses, page faults
Superscalar/VLIW: process multiple instructions/cycle Dynamic scheduling (discussed in Chapter 5)
Scheduling = determining instruction execution order Hardware decides based on data dependencies, resources
Pentium 4 Pipeline Very deep pipeline
Enables very high clock rates, but 20+ cycle branch penalty Slower than Pentium III for a given clock rate
1 2 3 4 5 6 7 8 9 10 11 12
TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch
13 14
Disp Disp
15 16 17 18 19 20
RF Ex Flgs Br Ck DriveRF
Multicycle FP Operations Multiple functional units: one approach
Special-purpose hardware for FP operations Increased latency causes more frequent stalls
can cause “structural” hazards
Single cycle integer unit
F D M W
Fully pipelined multiplier
Non-pipelined divider
Fully pipelined FP adder
Dynamic scheduling Out-of-order execution engine: one view (Pentium 4)
Image from www.xbitlabs. com
Fetching,decoding,translation
of x86 instrs to
uops
to support precise exceptions and recovery from mispredicted branches
Branch Prediction: Simplistic Branch history
Encode history about prior history of each individual branch instruction; store as hash table on instr. address
Use history to predict branch outcome
State machine stores history Each time branch taken, move to left Each time branch not taken, move to right In state Yes*, predict taken; in state No*, predict not taken Can be encoded using 2 bits per table entry
T T T
Yes! Yes? No? No!
NT
T
NT NT
NT
Branch Prediction: Realistic Alpha 21264 “tournament” predictor
Addr. of branch instr
12 bits
4K entries, each 2 bits,
to select predictor
GlobalPredictor
LocalPredictor
12 bit shift register of last branch
outcomes globally
10 bits of branch address
4K entries, standard 2-bit
branch predictors
1K entries, each 10 bits, history of behavior of this
branch
1K entries, each a 3-bit saturation
counter
Predictor size: 8K + 8K + 10K + 3K = 29K bits!
8K
8K
10K 3K
Discussion Questions Data hazards on register values can be dealt with by
stalling or by forwarding. Can hazards occur on condition codes? Can hazards occur on data memory accesses?
Can software be responsible for pipeline correctness? Schedule instructions, use nops DSP chips historically have had exposed pipelines
Relationship of hazards and dependencies Does every data dependence cause a data hazard? Is every data hazard caused by a data dependence?