Computer Structure 2012 – Pipeline1
Computer Structure
MIPS Pipeline
Lihu Rappoport and Adi Yoaz
Some of the slides were taken from: (1) Avi Mendelson (2) Randi Katz (3) Patterson
Computer Structure 2012 – Pipeline2
DataAccess
DataAccess
DataAccess
DataAccess
DataAccess
Pipelining Instructions
Ideal speedup is number of stages in the pipeline. Do we achieve this?
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
. ..
InstFetch Reg ALU Reg
InstFetch Reg ALU Reg
InstFetch
InstFetch Reg ALU Reg
InstFetch Reg ALU Reg
InstFetch Reg ALU Reg
2 ns
2 ns
2 ns 2 ns 2 ns 2 ns 2 ns
8 ns
8 ns
8 ns
TimeProgramexecutionorder
lw R1, 100(R0)
lw R2, 200(R0)
lw R3, 300(R0)
TimeProgramexecutionorder
lw R1, 100(R0)
lw R2, 200(R0)
lw R3, 300(R0)
Computer Structure 2012 – Pipeline3
Pipelined Car Assembly
chassis engine finish
1 hour 2 hours 1 hour
Car 1Car 2Car 3
Computer Structure 2012 – Pipeline4
Pipelining Pipelining does not reduce the latency of single task,
it increases the throughput of entire workload Potential speedup = Number of pipe stages
Pipeline rate is limited by the slowest pipeline stageÞ Partition the pipe to many pipe stagesÞ Make the longest pipe stage to be as short as possible Þ Balance the work in the pipe stages
Pipeline adds overhead (e.g., latches) Time to “fill” pipeline and time to “drain” it reduces speedup Stall for dependencies
Þ Too many pipe-stages start to loose performance IPC of an ideal pipelined machine is 1
Every clock one instruction finishes
Computer Structure 2012 – Pipeline5
Instruction Decode /
register fetch
Instruction fetch
Execute /address
calculation
Memoryaccess
Writeback
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
Pipelined CPU
Computer Structure 2012 – Pipeline6
Pipelined CPU with Control
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EXEX/MEM
MEM/WBIn
stru
ctio
n
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
Con
trol
WBMEM
WBMEM
WB
EXE
Instruction Decode /
register fetch
Instruction fetch
Execute /address
calculation
Memoryaccess
Writeback
Computer Structure 2012 – Pipeline7
Structural Hazard Attempt to use the same resource two different ways at the same time Register File:
Accessed in 2 stages: Read during stage 2 (ID) Write during stage 5 (WB)
Solution: 2 read ports, 1 write port Memory
Accessed in 2 stages: Instruction Fetch during stage 1 (IF) Data read/write during stage 4 (MEM)
Solution: separate instruction cache and data cache
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all
instructions
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
DM
Computer Structure 2012 – Pipeline8
Pipeline Example: cycle 1
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction lwPC4
4
Computer Structure 2012 – Pipeline9
Pipeline Example: cycle 2
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC8
8 4
sub[R1]
910
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Structure 2012 – Pipeline10
Pipeline Example: cycle 3
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC12
12 8
and[R2]
[R1]+9
sub4
[R3]
1011 0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Structure 2012 – Pipeline11
Pipeline Example: cycle 4
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
0
1
mux
0
1
mux
0
1
mux
1
0
mux
Instruction
lw
PC16
12 8
or[R4]
Data frommemoryaddress[R1]+9
sub4
[R5]
1112
and16
10
[R2]-[R3]
0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7
Computer Structure 2012 – Pipeline12
Problem with starting next instruction before first is finished dependencies that “go backward in time” are data
hazards
Dependencies: RAW Hazard
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
DM
sub R2, R1, R3
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 910 10/–20Value of R2 10 10 10 -20 -20 -20 -20
Computer Structure 2012 – Pipeline13
IM bubble bubble bubble bubble
IM bubble bubble bubble bubble
IM bubble bubble bubble bubble
RAW Hazard: HW Solution 1 - Add Stalls
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 910 10/–20Value of R2
DM Reg
IM DM RegReg
IM Reg
IM Reg DM Reg
IM DM Reg
Reg
Reg
DM
sub R2, R1, R3
stall
stall
stall
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
10 10 10 -20 -20 -20 -20
• Have the hardware detect hazard and add stalls if needed
• Problem: this also slows us down!
Computer Structure 2012 – Pipeline14
Use temporary results, don’t wait for them to be written to the register file register file forwarding to handle read/write to same register ALU forwarding
RAW Hazard: HW Solution 2 - Forwarding
IM Reg
IM Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
Reg
X X X –20 X X X X XX X X X –20 X X X X
DM
sub R2, R1, R3
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 910 10/–20Value of R2 10 10 10 -20 -20 -20 -20
Value EX/MEMValue MEM/WB
Computer Structure 2012 – Pipeline15
Forwarding Hardware
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0muxB
1
2
0muxA
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
EX/M
EM.R
egW
rite
MEM
/WB
.Reg
Writ
e
Computer Structure 2012 – Pipeline16
Forwarding ControlEX Hazard:
if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg1)) ALUSelA = 1
if (EX/MEM.RegWrite and (EX/MEM.WriteReg = ID/EX.ReadReg2)) ALUSelB = 1
MEM Hazard: if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg
ID/EX.ReadReg1)) and (MEM/WB.WriteReg = ID/EX.ReadReg1)) ALUSelA = 2 if (MEM/WB.RegWrite and ((not EX/MEM.RegWrite) or (EX/MEM.WriteReg
ID/EX.ReadReg2)) and (MEM/WB.WriteReg = ID/EX.ReadReg2)) ALUSelB = 2
Computer Structure 2012 – Pipeline17
Forwarding Hardware Example: Bypassing From EX to Src1 and From WB to Src2
lw R11,9(R1) sub R10,R2, R3and R12,R10,R11
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
lw[R10]
Data frommemoryaddress[R1]+9
sub
[R11]
1012
and
11
[R2]-[R3]
1110
Computer Structure 2012 – Pipeline18
Forwarding Hardware Example 2: Bypassing From WB to Src2
sub R10,R2, R3xxxand R12,R10,R11
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
sub[R11]
xxx
[R10]
12
and
10
[R2]-[R3]
1110
Computer Structure 2012 – Pipeline19
Register File Split
sub R2, R1, R3
xxx
xxx
and R12,R2,R11
Register file is written during first half of the cycle Register file is read during second half of the cycleÞ Register file is written before it is read Þ returns the
correct data
IM Reg
IM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
DMReg
Reg
Computer Structure 2012 – Pipeline20
Load word can still causes a hazard: an instruction tries to read a register following a load
instruction that writes to the same register
Þ A hazard detection unit is needed to “stall” the load instruction
Can't always forward
Reg
IM
Reg
Reg
IM
IM Reg DM Reg
IM DM Reg
IM DM Reg
DM Reg
Reg
Reg
DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 9Programexecutionorder
lw R2, 30(R1)
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Computer Structure 2012 – Pipeline21
De-assert the enable to ID/EXE The dependant instruction (and) stays another cycle in IF/EXE
De-assert the enable to the IF/ID latch, and to the PC Freeze pipeline stages preceding the stalled instruction
Issue a NOP into the EXE/MEM latch (instead of the stalled inst.) Allow the stalling instruction (lw) to move on
Reg
IM
Reg
Reg
IM DM
IM Reg DM RegIM
IM DM Reg
IM DM Reg
DM Reg
RegReg
Reg
bubble
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 9Programexecutionorder
lw R2, 30(R1)
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Stalling
Computer Structure 2012 – Pipeline22
Hazard Detection (Stall) Logicif (ID/EX.RegWrite and (ID/EX.opcode = lw) and ( (ID/EX.WriteReg = IF/ID.ReadReg1) or (ID/EX.WriteReg = IF/ID.ReadReg2) ) then stall
Computer Structure 2012 – Pipeline23
Forwarding + Hazard Detection Unit
Con
trol
EX
M
WB
ForwardingUnit
1mux
0
0mux
1
0mux
1
2
0mux
1
2
DataMemoryA
LU
Register File
Inst
ruct
ion
Mem
ory
PC
ID/EX EX/MEM MEM/WBIF/ID
Inst
ruct
ion
IF/ID.RsIF/ID.RtIF/ID.RtIF/ID.Rd
Rs
RtRtRd
WB
M WB
EX/MEM.Rd
MEM/WB.Rd
HazardDetection
Unit
0
ID/EX.MemRead
PC W
rite
0mux
1
IF/ID
Writ
e
ID/EX.Rt
Computer Structure 2012 – Pipeline24
Example: code for (assume all variables are in memory):a = b + c;d = e – f;
Slow codeLW Rb,bLW Rc,c
Stall ADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,f
Stall SUB Rd,Re,RfSW d,Rd
Instruction order can be changed as long as the correctness is kept
Software Scheduling to Avoid Load Hazards
Fast codeLW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
Computer Structure 2012 – Pipeline25
Control Hazards
Computer Structure 2012 – Pipeline26
Executing a BEQ Instruction (i)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
Calculate branch condition
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
R4R5 -27
812
12
beqand
Calculate branch target
Computer Structure 2012 – Pipeline27
Executing a BEQ Instruction (ii)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction R4-R5=0-
8+SignExt(27)*4
1216
16
beqandsw
Computer Structure 2012 – Pipeline28
Executing a BEQ Instruction (iii)BEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
0 or 4 beq R4, R5, 27 8 and12 sw16 sub
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
8+SignExt(27)*4
16
20 or 116
beqandsw
20
sub
Computer Structure 2012 – Pipeline29
Control Hazard on Branches
And
Beq
sub
sw
The 3 instructions following the branch get into the pipe even if the branch is taken
Inst from target
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
IM Reg DM Reg
PC
Computer Structure 2012 – Pipeline30
Control Hazard: Stall
Stall pipe when branch is encountered until resolved Stall impact: assumptions
CPI = 1 20% of instructions are branches Stall 3 cycles on every branch
Þ CPI new = 1 + 0.2 × 3 = 1.6 (CPI new = CPI Ideal + avg. stall cycles / instr.)
We loose 60% of the performance
Computer Structure 2012 – Pipeline31
Control Hazard: Predict Not Taken
Execute instructions from the fall-through (not-taken) path As if there is no branch If the branch is not-taken (~50%), no penalty is paid
If branch actually taken Flush the fall-through path instructions before they
change the machine state (memory / registers) Fetch the instructions from the correct (taken) path
Assuming ~50% branches not taken on average CPI new = 1 + (0.2 × 0.5) × 3 = 1.3
Computer Structure 2012 – Pipeline32
Dynamic Branch Prediction
Look up
PC of inst in fetch
?= Branch predicted taken or not taken
No:Inst is not pred to be branch
Yes:Inst is pred to be branch
Branch PC Target PC History
Predicted Target
Add a Branch Target Buffer (BTB) the predicts (at fetch) Instruction is a branch Branch taken / not-taken Taken branch target
Computer Structure 2012 – Pipeline33
BTB Allocation
Allocate instructions identified as branches (after decode) Both conditional and unconditional branches are allocated
Not taken branches need not be allocated BTB miss implicitly predicts not-taken
Prediction BTB lookup is done parallel to IC lookup BTB provides
Indication that the instruction is a branch (BTB hits) Branch predicted target Branch predicted direction Branch predicted type (e.g., conditional, unconditional)
Update (when branch outcome is known) Branch target Branch history (taken / not-taken)
Computer Structure 2012 – Pipeline34
BTB (cont.) Wrong prediction
Predict not-taken, actual taken Predict taken, actual not-taken, or actual taken but wrong
target In case of wrong prediction – flush the pipeline
Reset latches (same as making all instructions to be NOPs) Select the PC source to be from the correct path
Need get the fall-through with the branch Start fetching instruction from correct path
Assuming P% correct prediction rate CPI new = 1 + (0.2 × (1-P)) × 3 For example, if P=0.7
CPI new = 1 + (0.2 × 0.3) × 3 = 1.18
Computer Structure 2012 – Pipeline35
Adding a BTB to the Pipeline
Shift left 2
ALUSrc
6
ALUresultZero
+
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EXEX/MEM MEM
/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4+
IF/ID
PC
0
1
mux
0
1
mux
0mux
1
0
mux
Inst.Memory
Address
Instruction
BTB
12
pred targetpred dir
PC+4 (Not-taken target)taken target
3
Mis-predict
DetectionUnit
Flush
predicted targetPC+4 (Not-taken target)
predicted direction
−4
address
targetdirection
alloc/updt
Computer Structure 2012 – Pipeline36
Using The BTBPC moves to next instruction
Inst Mem gets PCand fetches new inst
BTB gets PCand looks it up
IF/ID latch loadedwith new inst
BTB Hit ?
Br taken ?
PC PC + 4PC perd addr
IF
IDIF/ID latch loaded
with pred instIF/ID latch loaded
with seq. instBranch ?
yes no
noyes
noyesEXE
Computer Structure 2012 – Pipeline37
Using The BTB (cont.)ID
EXE
MEM
WB
Branch ?
Calculate brcond & trgt
Flush pipe &update PC
Corect pred ?
yes no
IF/ID latch loadedwith correct inst
continue
Update BTB
yes no
continue
Computer Structure 2012 – Pipeline38
Backup
Computer Structure 2012 – Pipeline39
· R-type (register insts)
· I-type (Load, Store, Branch, inst’s w/imm data)· J-type (Jump)
op: operation of the instructionrs, rt, rd: the source and destination register specifiersshamt: shift amountfunct: selects the variant of the operation in the “op” fieldaddress / immediate: address offset or immediate valuetarget address: target address of the jump instruction
op target address02631
6 bits 26 bits
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
MIPS Instruction Formats
Computer Structure 2012 – Pipeline40
Each memory location is 8 bit = 1 byte wide has an address
We assume 32 byte address An address space of 232 bytes
Memory stores both instructions and data Each instruction is 32 bit wide Þ
stored in 4 consecutive bytes in memory
Various data types have different width
The Memory Space1 byte
00000000000000010000000200000003000000040000000500000006
FFFFFFFAFFFFFFFBFFFFFFFCFFFFFFFDFFFFFFFEFFFFFFFF
Computer Structure 2012 – Pipeline41
Register FileRegWrite
Readreg 1
Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
RegisterFile
32
32
32
5
5
5
The Register File holds 32 registers Each register is 32 bit wide The RF supports parallel
reading any two registers and writing any register
Inputs Read reg 1/2: #register whose value
will be output on Read data 1/2 RegWrite: write enable
Write reg (relevant when RegWrite=1) #register to which the value in Write data is written to
Write data (relevant when RegWrite=1) data written to Write reg
Outputs Read data 1/2: data read from Read reg 1/2
Computer Structure 2012 – Pipeline42
Memory Components Inputs
Address: address of the memory location we wish to access
Read: read data from location Write: write data into location Write data (relevant when Write=1)
data to be written into specified location Outputs
Read data (relevant when Read=1) data read from specified location
Write
Address
WriteData
ReadDataMemory
Read
32
32
32
Cache Memory components are slow relative to the CPU
A cache is a fast memory which contains only small part of the memory Instruction cache stores parts of the memory space which hold code Data Cache stores parts of the memory space which hold data
Computer Structure 2012 – Pipeline43
The Program Counter (PC) Holds the address (in memory) of the next instruction to be
executed After each instruction, advanced to point to the next
instruction If the current instruction is not a taken branch, the next instruction resides right after the current instruction PC PC + 4 If the current instruction is a taken branch, the next instruction resides at the branch target PC target (absolute jump) PC PC + 4 + offset×4 (relative jump)
Computer Structure 2012 – Pipeline44
Fetch Fetch instruction pointed by PC from I-Cache
Decode Decode instruction (generate control signals) Fetch operands from register file
Execute For a memory access: calculate effective address For an ALU operation: execute operation in ALU For a branch: calculate condition and target
Memory Access For load: read data from memory For store: write data into memory
Write Back Write result back to register file update program counter
Instruction Execution Stages
InstructionFetch
InstructionDecode
Execute
Memory
ResultStore
Computer Structure 2012 – Pipeline45
The MIPS CPUInstruction
Decode /register fetch
Instruction fetch
Execute /address
calculation
Memoryaccess
Writeback
ALUSrcALU
ALU resZero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4Add
PC
InstructionCache
Address
Instruction
0
1
mux
MemRead
MemWrite
Address
WriteData
ReadData
DataCache
Branch
PCSrc
MemtoReg
1
0
mux
RegDst
RegWrite
Readreg 1
Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r Fi
le [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
Con
trol
Computer Structure 2012 – Pipeline46
Executing an Add Instruction
ALUSrc=0ALU
ALU resZero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4Add
PC
InstructionMemory
Address
Instruction
0
1
mux
MemRead=0
MemWrite=0
Address
WriteData
ReadData
DataMemory
Branch=0
PCSrc=0
MemtoReg=01
0
mux
RegDst=1
RegWrite=1Readreg 1
Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r Fi
le [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
R3
R5
35
R3 + R5
+
[PC]+4
2
op rs rt rd shamt funct 0611162126313 5 2 0 0 = AddALU
Add R2, R3, R5 ; R2 R3+R5
ADD
Computer Structure 2012 – Pipeline47
Executing a Load Instruction
op rs rt immediate 0162126312 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
ALUSrc=1ALU
ALU resZero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4Add
PC
InstructionMemory
Address
Instruction
0
1
mux
MemRead=1
MemWrite=0
Address
WriteData
ReadData
DataMemory
Branch=0
PCSrc=0
MemtoReg=11
0
mux
RegDst=0
RegWrite=1Readreg 1
Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r Fi
le [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
R2
30
2R2+30+
[PC]+4
1
Mem[R2+30]
LW
Computer Structure 2012 – Pipeline48
Executing a Store Instruction
op rs rt immediate 0162126312 1 30SW
SW R1, (30)R2 ; Mem[R2+30] R1
ALUSrc = 1ALU
ALU resZero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4Add
PC
InstructionMemory
Address
Instruction
0
1
mux
MemRead
MemWrite = 1
Address
WriteData
ReadData
DataMemory
Branch=0
PCSrc = 0
MemtoReg =1
0
mux
RegDst =
RegWrite = 0Readreg 1
Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r Fi
le [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
R2
30
2R2+30+
[PC]+4
1SW
R1
Computer Structure 2012 – Pipeline49
Executing a BEQ InstructionBEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4
op rs rt immediate 0162126314 5 27BEQ
ALUSrc=0ALU
ALU resZero
AddShift left 2
0
1
mux
ALUControl
ALUOp
Signextend
16 32
Instruction
4Add
PC
InstructionMemory
Address
Instruction
0
1
mux
MemRead=0
MemWrite=0
Address
WriteData
ReadData
DataMemory
Branch=1
PCSrc= (R4 – R5 == 0)
MemtoReg= 1
0
mux
RegDst=
RegWrite=0Readreg 1
Readreg 2
WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r Fi
le [20-16]
[25-21]
0
1
mux [15-11]
[15-0]
[5-0]
R4
R5
45
R4-R5-
PC+4
27
ADD
PC+4orPC+4+SignExt(27)*4
Computer Structure 2012 – Pipeline50
Control Signals
add sub ori lw sw beq jumpRegDstALUSrcMemtoRegRegWriteMemWriteBranchJumpALUctr<2:0>
1001000
Add
1001000
Subtract
0101000
Or
0111000
Add
x1x0100
Add
x0x0010
Subtract
xxx00x1
xxx
funcop 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
10 0000 10 0010 Don’t Care
Computer Structure 2012 – Pipeline51
Pipelined CPU: Load (cycle 1 – Fetch)
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction lw
op rs rt immediate 0162126312 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
PC+4
Computer Structure 2012 – Pipeline52
Pipelined CPU: Load (cycle 2 – Dec)
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
op rs rt immediate 0162126312 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
PC+4
R2
30
Computer Structure 2012 – Pipeline53
Pipelined CPU: Load (cycle 3 – Exe)
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
op rs rt immediate 0162126312 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
R2+30
Computer Structure 2012 – Pipeline54
Pipelined CPU: Load (cycle 4 – Mem)
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
op rs rt immediate 0162126312 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
D
Computer Structure 2012 – Pipeline55
Pipelined CPU: Load (cycle 5 – WB)
ALUSrc
6
ALUresultZero
Add result
AddShift left 2
ALUControl
ALUOp
RegDst
RegWrite
Readreg 1Readreg 2WriteregWritedata
Readdata 1
Readdata 2
Reg
iste
r File
[15-0]
[20-16]
[15-11]
Signextend
16 32
ID/EX EX/MEM MEM/WB
Inst
ruct
ion
MemRead
MemWrite
Address
WriteData
ReadData
DataMemory
Branch
PCSrc
MemtoReg
4
InstructionMemory
Address
Add
IF/ID
PC
0
1
mux
0
1
mux
0
1
mux
0
1
mux
Instruction
op rs rt immediate 0162126312 1 30LW
LW R1, (30)R2 ; R1 Mem[R2+30]
D
Computer Structure 2012 – Pipeline56
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
Datapath with Control
Computer Structure 2012 – Pipeline57
Multi-Cycle Control Pass control signals along just like the data
Execution/Address Calculation stage control lines
Memory access stage control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Computer Structure 2012 – Pipeline58
Five Execution Steps Instruction Fetch
Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC.
IR = Memory[PC];PC = PC + 4;
Instruction Decode and Register Fetch Read registers rs and rt Compute the branch address A = Reg[IR[25-21]];
B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2);
We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
Computer Structure 2012 – Pipeline59
Five Execution Steps (cont.)• Execution ALU is performing one of three functions, based on instruction type:
Memory Reference: effective address calculation.ALUOut = A + sign-extend(IR[15-0]);
R-type:ALUOut = A op B;
Branch:if (A==B) PC = ALUOut;
• Memory Access or R-type instruction completion
• Write-back step
Computer Structure 2012 – Pipeline60
The Store Instruction
sw rt, rs, imm16
mem[PC] Fetch the instruction from memory
Addr <- R[rs] + SignExt(imm16) Calculate the memory addressMem[Addr] <- R[rt] Store the register into memoryPC <- PC + 4 Calculate the next instruction’s
address
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
Mem[Rs + SignExt[imm16]] <- Rt Example: sw rt, rs, imm16
Computer Structure 2012 – Pipeline61
RAW Hazard: SW Solution
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)
CC 7 CC 8 CC 910 10/–20Value of R2
DM Reg
IM DM RegReg
IM Reg
IM Reg DM Reg
IM DM Reg
Reg
Reg
DM
sub R2, R1, R3
NOP
NOP
NOP
and R12,R2, R5
or R13,R6, R2
add R14,R2, R2
sw R15,100(R2)
Programexecutionorder
10 10 10 -20 -20 -20 -20
IM Reg
IM Reg DM Reg
IM DM Reg
Reg
Reg
DM
• Have compiler avoid hazards by adding NOP instructions
• Problem: this really slows us down!
Computer Structure 2012 – Pipeline62
Delayed Branch Define branch to take place AFTER n following instruction
HW executes n instructions following the branch regardless of branch is taken or not
SW puts in the n slots following the branch instructions that need to be executed regardless of branch resolution Instructions that are before the branch instruction, or Instructions from the converged path after the branch
If cannot find independent instructions, put NOP
Original Coder3 = 23R4 = R3+R5If (r1==r2) goto xR1 = R4 + R5X: R7 = R1
New CodeIf (r1==r2) goto x r3 = 23R4 = R3 +R5NOPR1 = R4 + R5X: R7 = R1
Þ
Computer Structure 2012 – Pipeline63
Delayed Branch Performance Filling 1 delay slot is easy, 2 is hard, 3 is harder Assuming we can effectively fill d% of the delayed slots
CPInew = 1 + 0.2 × (3 × (1-d))
For example, for d=0.5, we get CPInew = 1.3 Mixing architecture with micro-arch
New generations requires more delay slots Cause computability issues between generations