Upload
arron-hardy
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
1
COMP541COMP541
Pipelined MIPSPipelined MIPS
Montek SinghMontek Singh
Apr 9, 2012Apr 9, 2012
2
TopicsTopics Today’s topic: PipeliningToday’s topic: Pipelining
Can think of it as:Can think of it as:A way to parallelize, orA way to parallelize, orA way to make better utilization of the hardware.A way to make better utilization of the hardware.
Goal: Try to use all hardware every clock cycleGoal: Try to use all hardware every clock cycle
ReadingReading Section 7.5 of textbookSection 7.5 of textbook
ParallelismParallelism Two types of parallelism:Two types of parallelism:
Spatial parallelismSpatial parallelismduplicate hardware performs multiple tasks at onceduplicate hardware performs multiple tasks at once
Temporal parallelismTemporal parallelism task is broken into multiple stagestask is broken into multiple stages
– each stage operating on different parts of distinct instructionseach stage operating on different parts of distinct instructionsalso called pipeliningalso called pipelining
– example: an assembly lineexample: an assembly line
Parallelism DefinitionsParallelism Definitions Some definitions:Some definitions:
Token:Token: A group of inputs processed together to A group of inputs processed together to produce a group of outputsproduce a group of outputsa “bundle”a “bundle”
Latency:Latency: Time for one token to pass from start to end Time for one token to pass from start to end Throughput:Throughput: The number of tokens that can be The number of tokens that can be
processed per unit timeprocessed per unit time
Parallelism increases throughputParallelism increases throughput Often sacrificing latencyOften sacrificing latency
Parallelism ExampleParallelism Example Ben is baking cookies Ben is baking cookies
2-part task:2-part task: It takes 5 minutes to roll the cookies…It takes 5 minutes to roll the cookies…… … and 15 minutes to bake themand 15 minutes to bake them
After finishing one batch he immediately starts the After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben next batch. What is the latency and throughput if Ben does NOT use parallelism?does NOT use parallelism?
Latency = 5 + 15 = 20 min = 1/3 hourLatency = 5 + 15 = 20 min = 1/3 hour
Throughput = 1 tray/ 20 min = 3 Throughput = 1 tray/ 20 min = 3 trays/hourtrays/hour
Parallelism ExampleParallelism Example What is the latency and throughput if Ben uses What is the latency and throughput if Ben uses
parallelism?parallelism? Spatial parallelism: Spatial parallelism: Ben asks Allysa to help, using her Ben asks Allysa to help, using her
own ovenown oven Temporal parallelism:Temporal parallelism: Ben breaks the task into two Ben breaks the task into two
stages: roll and baking. He uses two trays. While the stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so first batch is baking he rolls the second batch, and so on.on.
Spatial ParallelismSpatial ParallelismS
pat
ial
Par
alle
lism
Roll
Bake
Ben 1 Ben 1
Alyssa 1 Alyssa 1
Ben 2 Ben 2
Alyssa 2 Alyssa 2
Time
0 5 10 15 20 25 30 35 40 45 50
Tray 1
Tray 2
Tray 3
Tray 4
Latency:time to
first tray
Legend
Latency = ?
Throughput = ?
Spatial ParallelismSpatial ParallelismS
pat
ial
Par
alle
lism
Roll
Bake
Ben 1 Ben 1
Alyssa 1 Alyssa 1
Ben 2 Ben 2
Alyssa 2 Alyssa 2
Time
0 5 10 15 20 25 30 35 40 45 50
Tray 1
Tray 2
Tray 3
Tray 4
Latency:time to
first tray
Legend
Latency = 5 + 15 = 20 min = 1/3 hour (same) Throughput = 2 trays/ 20 min = 6 trays/hour
(doubled)
Temporal ParallelismTemporal ParallelismT
emp
ora
lP
aral
leli
sm
Ben 1 Ben 1
Ben 2 Ben 2
Ben 3 Ben 3
Time
0 5 10 15 20 25 30 35 40 45 50
Latency:time to
first tray
Tray 1
Tray 2
Tray 3
Latency = ?
Throughput = ?
Temporal ParallelismTemporal ParallelismT
emp
ora
lP
aral
leli
sm
Ben 1 Ben 1
Ben 2 Ben 2
Ben 3 Ben 3
Time
0 5 10 15 20 25 30 35 40 45 50
Latency:time to
first tray
Tray 1
Tray 2
Tray 3
Latency = 5 + 15 = 20 min = 1/3 hourThroughput = 1 trays/ 15 min = 4 trays/hour
Using both techniques, the throughput would be 8 trays/hour
Pipelined MIPSPipelined MIPS Temporal parallelismTemporal parallelism Divide single-cycle processor into 5 stages:Divide single-cycle processor into 5 stages:
FetchFetch DecodeDecode ExecuteExecute MemoryMemory WritebackWriteback
Add pipeline registers between stagesAdd pipeline registers between stages
Single-Cycle vs. Pipelined Single-Cycle vs. Pipelined PerformancePerformance PipeliningPipelining
Break instruction into 5 stepsBreak instruction into 5 steps Each is 250 ps long (length of longest step)Each is 250 ps long (length of longest step)
Time (ps)Instr
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000
Instr
1
2
3
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
Single-Cycle
Pipelined Write Reg and Read Reg occur during the first half and second half of the clock cycle, respectively. This allows more instr overlap!
Pipelining AbstractionPipelining Abstraction 5 stages of execution5 stages of execution
instruction memory (fetch)instruction memory (fetch) register file readregister file read ALU/executionALU/execution data memory accessdata memory access register file write (writeback)register file write (writeback)
Time (cycles)
lw $s2, 40($0) RF 40
$0RF
$s2+ DM
RF $t2
$t1RF
$s3+ DM
RF $s5
$s1RF
$s4- DM
RF $t6
$t5RF
$s5& DM
RF 20
$s1RF
$s6+ DM
RF $t4
$t3RF
$s7| DM
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw $s6, 20($s1)
or $s7, $t3, $t4
1 2 3 4 5 6 7 8 9 10
add
IM
IM
IM
IM
IM
IMlw
sub
and
sw
or
Single-Cycle and Pipelined Single-Cycle and Pipelined DatapathDatapath Key difference: insertion of pipeline registersKey difference: insertion of pipeline registers
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK
CLK
CLK
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
Multi-Cycle and Pipelined Multi-Cycle and Pipelined DatapathDatapath Key difference: full-length registers for datapath Key difference: full-length registers for datapath and and
controlcontrol
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK
CLK
CLK
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1 0
1
PC0
1
PC' Instr25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegD
st
Mem
toRe
g ZeroCLK
ALU
WD
WE
CLK
Adr
0
1Data
CLK
CLK
A
B00
01
10
11
4
CLK
ENEN
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK
CLK
CLK
SignImm
CLK
A RD
InstructionMemory
+4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
One ProblemOne Problem There is a problem:There is a problem:
ResultW and WriteReg are out of stepResultW and WriteReg are out of step
Corrected Pipelined DatapathCorrected Pipelined Datapath Solution: One modification neededSolution: One modification needed
Send both ResultW and WriteReg through an equal Send both ResultW and WriteReg through an equal number of registersnumber of registers
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLK
CLK
CLK
Fetch Decode Execute Memory Writeback
Pipelined ControlPipelined Control What is similar/different w.r.t. single-cycle MIPS?What is similar/different w.r.t. single-cycle MIPS?
Same control signal valuesSame control signal values Values must be delayed and delivered in correct cyclesValues must be delayed and delivered in correct cycles
Control signals must go through registers as well!Control signals must go through registers as well!
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
ZeroM
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
BranchE BranchM
RegDstE
ALUSrcE
WriteRegE4:0
Pipelining ChallengesPipelining Challenges ““Hazards”Hazards”
when an instruction depends on results from previous when an instruction depends on results from previous instruction that has not yet completedinstruction that has not yet completed
2 types of hazards:2 types of hazards:Data hazard:Data hazard: register value not written back to register file register value not written back to register file
yetyet– an instruction produces a result needed by next instructionan instruction produces a result needed by next instruction– new value will be stored during write-back stagenew value will be stored during write-back stage– new value needed by next instr during register readnew value needed by next instr during register read
following too close behind!following too close behind!
Control hazard:Control hazard: don don’’t know which is next instructiont know which is next instruction– next PC not decided yetnext PC not decided yet– cause by conditional branchescause by conditional branches– cannot fetch next instr if current branch is not yet decidedcannot fetch next instr if current branch is not yet decided
Data Hazard: ExampleData Hazard: Example First instruction computes a result ($s0) First instruction computes a result ($s0)
needed by the next 3 instructionsneeded by the next 3 instructions first two cause problemsfirst two cause problems third actually does not!third actually does not!
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Handling Data HazardsHandling Data Hazards Static/compiler approaches:Static/compiler approaches:
Insert Insert nop’nop’s (no operations) in code at compile times (no operations) in code at compile time Rearrange code at compile timeRearrange code at compile time
Dynamic/runtime approaches:Dynamic/runtime approaches: Forward data at run timeForward data at run time Stall the processor at run timeStall the processor at run time
Compile-Time Hazard EliminationCompile-Time Hazard Elimination Insert enough nops between dependent instrsInsert enough nops between dependent instrs Or re-arrange: move independent instructions Or re-arrange: move independent instructions
earlierearlier
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
nop
nop
RF RFDMnopIM
RF RFDMnopIM
9 10
Dynamic Approach: Data Dynamic Approach: Data ForwardingForwarding Also known as Also known as bypassingbypassing
results are actually available even though not stored results are actually available even though not stored in RFin RF
grab a copy and send where needed!grab a copy and send where needed!
Note:Note: forwarding actually not needed for sub. Why?forwarding actually not needed for sub. Why?
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Data ForwardingData Forwarding
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
AL
U
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
For
wa
rdA
E
For
wa
rdB
E
20:16RtE
RsD
RdD
RtD
Reg
Wri
teM
Reg
Wri
teW
Hazard Unit
PCPlus4E
BranchE BranchM
ZeroM
Data ForwardingData Forwarding Forward to Execute stage from either:Forward to Execute stage from either:
Memory stage orMemory stage or Writeback stageWriteback stage
Forwarding logic for Forwarding logic for ForwardAEForwardAE::if ((if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM) then) then
ForwardAEForwardAE = 10 = 10
else if ((else if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegWWriteRegW) AND ) AND RegWriteWRegWriteW) then ) then ForwardAEForwardAE = 01 = 01
elseelse
ForwardAEForwardAE = 00 = 00
Forwarding logic for Forwarding logic for ForwardBEForwardBE same, but replace same, but replace rsErsE with with rtErtE
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
AL
U
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
For
wa
rdA
E
For
wa
rdB
E
20:16RtE
RsD
RdD
RtD
Reg
Wri
teM
Reg
Wri
teW
Hazard Unit
PCPlus4E
BranchE BranchM
ZeroM
Data ForwardingData Forwardingif ((if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM) then) then
ForwardAEForwardAE = 10 = 10
else if ((else if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegWWriteRegW) AND ) AND RegWriteWRegWriteW)) then )) then ForwardAEForwardAE = 01 = 01
elseelse
ForwardAEForwardAE = 00 = 00
26
Forwarding may not always Forwarding may not always work…work… Example: Load followed immediately by R-Example: Load followed immediately by R-
typetype loads are harder to deal with because they need mem loads are harder to deal with because they need mem
access!access!need one extra cycle, compared to R-type instructionsneed one extra cycle, compared to R-type instructions
Time (cycles)
lw $s0, 40($0) RF 40
$0RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
Trouble!
lw has a 2-cycle latency!
StallingStalling Stall for a cycle, then forwardStall for a cycle, then forward
solves the Load followed by R-type problemsolves the Load followed by R-type problem
Time (cycles)
lw $s0, 40($0) RF 40
$0RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
9
RF $s1
$s0
IMor
Stall
Stalling HardwareStalling Hardware
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
20:16RtE
RsD
RdD
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
Stalling ControlStalling Control Stalling logic:Stalling logic:
lwstalllwstall = (( = ((rsDrsD == == rtErtE) OR () OR (rtDrtD == == rtErtE)) AND )) AND MemtoRegEMemtoRegE
StallFStallF = = StallDStallD = = FlushEFlushE = = lwstalllwstall
Stalling ControlStalling Control
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
20:16RtE
RsD
RdD
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE
StallF = StallD = FlushE = lwstall
Control HazardsControl Hazards beqbeq: :
branch is not determined until the fourth stage of the pipelinebranch is not determined until the fourth stage of the pipeline Instructions after the branch are fetched before branch occursInstructions after the branch are fetched before branch occurs These instructions must be flushed if the branch is takenThese instructions must be flushed if the branch is taken
Effect & SolutionsEffect & Solutions Could always stall when branch decodedCould always stall when branch decoded
Expensive: 3 cycles lost per branch!Expensive: 3 cycles lost per branch!
Could predict banch outcome, and flush if Could predict banch outcome, and flush if wrongwrong Branch misprediction penaltyBranch misprediction penalty
Instructions flushed when branch is takenInstructions flushed when branch is takenMay be reduced by determining branch earlierMay be reduced by determining branch earlier
33
Control Hazards: FlushingControl Hazards: Flushing FlushingFlushing
turn instruction into a NOP (put all zeros)turn instruction into a NOP (put all zeros) renders it harmless!renders it harmless!
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1RF- DM
RF $s1
$s0RF& DM
RF $s0
$s4RF| DM
RF $s5
$s0RF- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
20
24
28
2C
30
...
...
9
Flushthese
instructions
64 slt $t3, $s2, $s3 RF $s3
$s2RF
$t3slt DMIM
slt
Control Hazards: Original Pipeline (for Control Hazards: Original Pipeline (for comparison)comparison)
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
AL
U
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
20:16RtE
RsD
RdD
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
Control Hazards: Early Branch ResolutionControl Hazards: Early Branch Resolution
EqualD
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
=
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
20:16RtE
RsD
RdE
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Introduced another data hazard in Decode stage (fix a few slides away)
Control Hazards with Early Branch Control Hazards with Early Branch ResolutionResolution
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1RF- DM
RF $s1
$s0RF& DMand $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
andIM
IMlw20
24
28
2C
30
...
...
9
Flushthis
instruction
64 slt $t3, $s2, $s3 RF $s3
$s2RF
$t3slt DMIM
slt
Penalty now only one lost cycle
Aside: Delayed BranchAside: Delayed Branch MIPS always executes instruction following a MIPS always executes instruction following a
branchbranch So branch So branch delayeddelayed
This allows us to avoid killing inst.This allows us to avoid killing inst. Compilers move instruction that has no conflict w/ Compilers move instruction that has no conflict w/
branch into delay slotbranch into delay slot
38
ExampleExample This sequenceThis sequence
add $4 $5 $6add $4 $5 $6
beq $1 $2 40beq $1 $2 40
reordered to thisreordered to thisbeq $1 $2 40beq $1 $2 40
add $4 $5 $6add $4 $5 $6
39
Handling the New HazardsHandling the New Hazards
EqualD
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
0
1
0
1
=
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
For
war
dAD
For
war
dBD
20:16RtE
RsD
RdD
RtD
Reg
Writ
eE
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Bra
nchD
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Control Forwarding and Stalling Control Forwarding and Stalling HardwareHardware Forwarding logic:Forwarding logic:
ForwardADForwardAD = ( = (rsDrsD !=0) AND ( !=0) AND (rsDrsD == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM
ForwardBDForwardBD = ( = (rtDrtD !=0) AND ( !=0) AND (rtDrtD == WriteRegM) AND == WriteRegM) AND RegWriteMRegWriteM
Stalling logic:Stalling logic:branchstallbranchstall = = BranchDBranchD AND AND RegWriteERegWriteE AND AND
((WriteRegEWriteRegE == == rsDrsD OR OR WriteRegEWriteRegE == == rtDrtD) )
OR OR
BranchDBranchD AND AND MemtoRegMMemtoRegM AND AND
((WriteRegMWriteRegM == == rsDrsD OR OR WriteRegMWriteRegM == == rtDrtD))
StallFStallF = = StallDStallD = = FlushEFlushE = = lwstalllwstall OR OR branchstallbranchstall
Branch PredictionBranch Prediction Especially important if branch penalty > 1 Especially important if branch penalty > 1
cyclecycle Guess whether branch will be takenGuess whether branch will be taken
Backward branches are usually taken (loops)Backward branches are usually taken (loops) Perhaps consider history of whether branch was Perhaps consider history of whether branch was
previously taken to improve the guesspreviously taken to improve the guess
Good prediction reduces the fraction of Good prediction reduces the fraction of branches requiring a flush branches requiring a flush
Pipelined Performance ExamplePipelined Performance Example Ideally CPI = 1Ideally CPI = 1
But less due to: stalls (caused by loads and branches)But less due to: stalls (caused by loads and branches)
SPECINT2000 benchmark: SPECINT2000 benchmark: 25% loads25% loads 10% stores 10% stores 11% branches11% branches 2% jumps2% jumps 52% R-type52% R-type
Suppose:Suppose: 40% of loads used by next instruction40% of loads used by next instruction 25% of branches mispredicted25% of branches mispredicted All jumps flush next instructionAll jumps flush next instruction
What is the average CPI?What is the average CPI?
Pipelined Performance ExamplePipelined Performance Example SPECINT2000 benchmark: SPECINT2000 benchmark:
25% loads25% loads 10% stores 10% stores 11% branches11% branches 2% jumps2% jumps 52% R-type52% R-type
Suppose:Suppose: 40% of loads used by next instruction40% of loads used by next instruction 25% of branches mispredicted25% of branches mispredicted All jumps flush next instructionAll jumps flush next instruction
What is the average CPI?What is the average CPI? Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,
CPIlw = 1(0.6) + 2(0.4) = 1.4CPIlw = 1(0.6) + 2(0.4) = 1.4 CPIbeq = 1(0.75) + 2(0.25) = 1.25CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) +
(0.52)(1) = 1.15(0.52)(1) = 1.15
Pipelined PerformancePipelined Performance Pipelined processor critical path:Pipelined processor critical path:
TTcc = max {= max {
ttpcqpcq + + ttmemmem + + ttsetup,setup,
2(2(ttRFreadRFread + + ttmuxmux + + tteq eq + + ttAND AND + + ttmuxmux + + ttsetup setup ),),
ttpcqpcq + + ttmux mux + + ttmuxmux + + ttALUALU + + ttsetup,setup,
ttpcqpcq + + ttmemwritememwrite + + ttsetup,setup,
2(2(ttpcqpcq + + ttmuxmux + + ttRFwriteRFwrite) }) }
EqualD
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
AL
U
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
0
1
0
1
=
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
For
war
dAD
For
war
dBD
20:16 RtE
RsD
RdD
RtD
Reg
Writ
eE
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Bra
nchD
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Pipelined Performance ExamplePipelined Performance Example
Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )
= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps
Pipelined Performance ExamplePipelined Performance Example For a program with 100 billion instructions For a program with 100 billion instructions
executing on a pipelined MIPS processor,executing on a pipelined MIPS processor, CPI = 1.15CPI = 1.15TTcc = 550 ps = 550 ps
Execution Time = (# instructions) × CPI × TcExecution Time = (# instructions) × CPI × Tc
= (100 × 109)(1.15)(550 × 10-12)= (100 × 109)(1.15)(550 × 10-12)
= 63 seconds= 63 seconds
SummarySummary Pipelining benefitsPipelining benefits
use hardware more efficientlyuse hardware more efficiently throughput increasesthroughput increases
Pipelining challenges/drawbacksPipelining challenges/drawbacks latency increaseslatency increases hazards ensuehazards ensue energy/power consumption increasesenergy/power consumption increases
All modern processors are pipelinedAll modern processors are pipelined some have way more than 5 pipeline stagessome have way more than 5 pipeline stages
some Pentium’s have had 20-40 pipeline stages!some Pentium’s have had 20-40 pipeline stages!