Upload
kristin-shanon-adams
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Chapter 6: Pipelining
Instructor: Mozafar Bag Mohammadi
University of Ilam
Pipelining
Forecast Big Picture Datapath Control Data Hazards
Stalls Forwarding
Control Hazards Exceptions
Motivation
Single cycle implementation CPI = 1 Cycle = imem + RFrd + ALU + dmem + RFwr +
muxes + control E.g. 500+250+500+500+250+0+0 = 2000ps Time/program = P x 2ns
Instructions Cycles
Program InstructionTimeCycle
(code size)
X X
(CPI) (cycle time)
Multicycle
Multicycle implementation:
Cycle:Instr:
1 2 3 4 5 6 7 8 9 10 11 12 13
i F D X M W
i+1 F D X
i+2 F D X M
i+3 F
i+4
Multicycle
Multicycle implementation CPI = 3, 4, 5 Cycle = max(memory, RF, ALU, mux, control) =max(500,250,500) = 500ps Time/prog = P x 4 x 500 = P x 2000ps = P x 2ns
Would like: CPI = 1 + overhead from hazards (later) Cycle = 500ps + overhead In practice, ~3x improvement
Big Picture
Instruction latency = 5 cycles Instruction throughput = 1/5 instr/cycle CPI = 5 cycles per instruction Instead
Pipelining: process instructions like a lunch buffet ALL microprocessors use it
E.g. Pentium 4, Athlon, PowerPC G5
Big Picture
Instruction Latency = 5 cycles (same) Instruction throughput = 1 instr/cycle CPI = 1 cycle per instruction CPI = cycle between instruction completion = 1
Ideal Pipelining
Bandwidth increases linearly with pipeline depth Latency increases by latch delays
GateDelay
Comb. Logicn Gate Delay
GateDelayL Gate
DelayL
L GateDelayL Gate
DelayL
L BW = ~(1/n)
n--2n--2
n--3n--3
n--3
BW = ~(2/n)
BW = ~(3/n)
Ideal Pipelining
Cycle:Instr:
1 2 3 4 5 6 7 8 9 10 11 12 13
i F D X M W
i+1 F D X M W
i+2 F D X M W
i+3 F D X M W
i+4 F D X M W
Pipelining Idealisms
Uniform subcomputations Can pipeline into stages with equal delay
Identical computations Can fill pipeline with identical work
Independent computations No relationships between work units
Are these practical? No, but can get close enough to get significant speedup
Complications
Datapath Five (or more) instructions in flight
Control Must correspond to multiple instructions
Instructions may have data and control flow dependences I.e. units of work are not independent
One may have to stall and wait for another
Datapath (Fig. 6.10)
Datapath (Fig. 6.11)
Control
Control Set by 5 different instructions Divide and conquer: carry IR down the pipe
MIPS ISA requires the appearance of sequential execution True of most general purpose ISAs
Program Dependences
i1: xxxx
i2: xxxx
i3: xxxx
i1
i2
i3
i1:
i2:
i3:
The implied sequential precedences are an overspecification. It is sufficient but not necessary to ensure program correctness.
A true dependence between two instructions may only involve one subcomputationof each instruction.
Program Data Dependences
True dependence (RAW) j cannot execute until i produces
its result Anti-dependence (WAR)
j cannot write its result until i has read its sources
Output dependence (WAW) j cannot write its result until i has
written its result
)()( jRiD
)()( jDiR
)()( jDiD
Control Dependences
Conditional branches Branch must execute to determine which
instruction to fetch next Instructions following a conditional branch are
control dependent on the branch instruction
Example (quicksort/MIPS)
# for (; (j < high) && (array[j] < array[low]) ; ++j );# $10 = j# $9 = high# $6 = array# $8 = low
bge done, $10, $9mul $15, $10, 4addu $24, $6, $15lw $25, 0($24)mul $13, $8, 4addu $14, $6, $13lw $15, 0($14)bge done, $25, $15
cont:addu $10, $10, 1. . .
done:addu $11, $11, -1
Resolution of Pipeline Hazards
Pipeline hazards Potential violations of program dependences Must ensure program dependences are not violated
Hazard resolution Static: compiler/programmer guarantees correctness Dynamic: hardware performs checks at runtime
Pipeline interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependences at runtime
Pipeline Hazards
Necessary conditions: WAR: write stage earlier than read stage
Is this possible in IF-RD-EX-MEM-WB ? WAW: write stage earlier than write stage
Is this possible in IF-RD-EX-MEM-WB ? RAW: read stage earlier than write stage
Is this possible in IF-RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory
Pipeline Hazard Analysis
ALU
RD
IFIF
ID
RD
ALU
MEM
WB
D
S1
S2
W/RWData
RData2
RegisterFile
RAdd2RData1
WAdd
RAdd1
Memory hazards RAW: Yes/No? WAR: Yes/No? WAW: Yes/No?
Register hazards RAW: Yes/No? WAR: Yes/No? WAW: Yes/No?
RAW Hazard Earlier instruction produces a value used by a later
instruction: add $1, $2, $3 sub $4, $5, $1
Cycle:Instr:
1 2 3 4 5 6 7 8 9 10 11 12 13
add F D X M W
sub F D X M W
RAW Hazard - Stall
Detect dependence and stall: add $1, $2, $3 sub $4, $5, $1
Cycle:Instr:
1 2 3 4 5 6 7 8 9 10 11 12 13
add F D X M W
sub F D X M W
Control Dependence One instruction affects which executes next
sw $4, 0($5) bne $2, $3, loop sub $6, $7, $8
Cycle:Instr:
1 2 3 4 5 6 7 8 9 10 11 12 13
sw F D X M W
bne F D X M W
sub F D X M W
Control Dependence - Stall Detect dependence and stall
sw $4, 0($5) bne $2, $3, loop sub $6, $7, $8
Cycle:Instr:
1 2 3 4 5 6 7 8 9 10 11 12 13
sw F D X M W
bne F D X M W
sub F D X M W
Pipelined Datapath Start with single-cycle datapath (F6.10) Pipelined execution
Assume each instruction has its own datapath But each instruction uses a different part in every cycle Multiplex all on to one datapath Latches separate cycles (like multicycle)
Ignore hazards for now Data control
Pipelined Datapath (Fig. 6.12)
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
Pipelined Datapath
Instruction flow add and load Write of registers Pass register specifiers
Any info needed by a later stage gets passed down the pipeline E.g. store value through EX
Pipelined Control
IF and ID None
EX ALUop, ALUsrc, RegDst
MEM Branch, MemRead, MemWrite
WB MemtoReg, RegWrite
Figure 6.25
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
AddAdd
result
Shiftleft 2
ALUresult
ALU
Zero
Add
0
1
Mux
0
1
Mux
Figure 6.29
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Figure 6.30
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Me
mto
Re
g
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Re
gWrit
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Me
mW
rite
AddressData
memory
Address
Pipelined Control
Controlled by different instructions Decode instructions and pass the signals
down the pipe Control sequencing is embedded in the
pipeline No explicit FSM Instead, distributed FSM
Pipelining
Not too complex yet Data hazards Control hazards Exceptions
RAW Hazards Must first detect RAW hazards
Pipeline analysis proves that WAR/WAW don’t occur
ID/EX.WriteRegister = IF/ID.ReadRegister1
ID/EX.WriteRegister = IF/ID.ReadRegister2
EX/MEM.WriteRegister = IF/ID.ReadRegister1
EX/MEM.WriteRegister = IF/ID.ReadRegister2
MEM/WB.WriteRegister = IF/ID.ReadRegister1
MEM/WB.WriteRegister = IF/ID.ReadRegister2
RAW Hazards
Not all hazards because WriteRegister not used (e.g. sw) ReadRegister not used (e.g. addi, jump) Do something only if necessary
RAW Hazards
Hazard Detection Unit Several 5-bit (or 6-bit) comparators
Response? Stall pipeline Instructions in IF and ID stay IF/ID pipeline latch not updated Send ‘nop’ down pipeline (called a bubble) PCWrite, IF/IDWrite, and nop mux
RAW Hazard Forwarding
A better response – forwarding Also called bypassing
Comparators ensure register is read after it is written
Instead of stalling until write occurs Use mux to select forwarded value rather than
register value Control mux with hazard detection logic
Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding
Forwarding
what if this $2 was $13?
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
Forwarding
PCInstruction
memory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Forwardingunit
IF/ID
Inst
ruct
ion
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Forwarding Paths (ALU instructions)
FORWARDING
IF
ID
b
ALU
PATHS
a
i+1: i+2: i+3:
i: R1
i: R1
i: R1
(i i+1)
Forwarding
via Path a
i+1:
i+1:
i+2:
(i i+2)
Forwardingvia Path b
(i i+3)
i writes R1before i+3 reads R1
RD
ALU
MEM
WB
R1 R1
R1
R1
R1c
Write before Read RF
Register file design 2-phase clocks common Write RF on first phase Read RF on second phase
Hence, same cycle: Write $1 Read $1
No bypass needed If read before write or DFF-based, need bypass
Implementation of ALU Forwarding
ALU
RegisterFile
•
••
•
•
• •
1 0 1 0
1 0 1 0
ALU
Comp Comp Comp Comp
•
•
•
•
•
Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the
same register. Thus, we need a hazard detection unit to “stall” the load instruction
Can't always forward
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
Stalling
We can stall the pipeline by keeping an instruction in the same stage
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
Forwarding Paths(Load instructions)
(i i+2)(i i+1)
IF
ID
eLOAD
FORWARDINGPATH(s)
i+1: i+1: i+2:
i+1:
RD
ALU
MEM
WB
i:R1
i:R1
i:R1
(i i+1)
Stall i+1 Forwarding
via Path d
i writes R1before i+2 reads R1
d
R1 R1 R1
R1
MEM[]
MEM[]
MEM[]
Implementation of Load Forwarding
•
ALU
RegisterFile
•
••
•
•
• •
1 0 1 0
1 0 1 0
ALU
CompComp CompComp
•
•
1 0 1 0
•
Load
StallIF,ID,RD
•Data
Add
•
•
D-Cache
•
r
LOAD
Control Flow Hazards
Control flow instructions branches, jumps, jals, returns Can’t fetch until branch outcome known Too late for next IF
Control Flow Hazards
What to do? Always stall Easy to implement Performs poorly 1/6th instructions are branches, each branch takes
3 cycles CPI = 1 + 3 x 1/6 = 1.5 (lower bound)
Control Flow Hazards
Predict branch not taken Send sequential instructions down pipeline Kill instructions later if incorrect Must stop memory accesses and RF writes Late flush of instructions on misprediction
Complex Global signal (wire delay)
When we decide to branch, other instructions are in the pipeline!
We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong
Branch Hazards
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
Flushing Instructions
PCInstruction
memory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mux
=
Shiftleft 2
Mux
Control Flow Hazards
Even better but more complex Predict taken Predict both (eager execution) Predict one or the other dynamically
Adapt to program branch patterns Lots of chip real estate these days
Pentium III, 4, Alpha 21264 Current research topic
Improving Performance
Try and avoid stalls! E.g., reorder these instructions:lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
Add a “branch delay slot” Always execute following instruction “delay slot” (later example on MIPS pipeline) Put useful instruction there, otherwise ‘nop’ rely on compiler to “fill” the slot with something useful
Superscalar: start more than one instruction in the same cycle
Dynamic Scheduling
The hardware performs the “scheduling” hardware tries to find instructions to execute out of order execution is possible speculative execution and dynamic branch prediction
All modern processors are very complicated DEC Alpha 21264: 9 stage pipeline, 6 instruction issue PowerPC and Pentium: branch history table Compiler technology important
This class has given you the background you need to learn more
Exceptions
Even worse: in one cycle I/O interrupt User trap to OS (EX) Illegal instruction (ID) Arithmetic overflow Hardware error Etc.
Interrupt priorities must be supported