View
233
Download
0
Category
Preview:
Citation preview
8/12/2019 pipelined processor design.ppt
1/72
112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Pipelined Processor Design
8/12/2019 pipelined processor design.ppt
2/72
212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/12/2019 pipelined processor design.ppt
3/72
312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers
Each stage takes 30 minutes to complete
Four loads of clothes to wash, dry, and fold
A B
C D
Pipelining Example
8/12/2019 pipelined processor design.ppt
4/72
412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Sequential laundry takes 6 hoursfor 4 loads
Intuitively, we can use pipeliningto speed up laundry
Sequential Laundry
Time
6 PM
A
30 30 30
7 8 9 10 11 12 AM
30 30 30
B
30 30 30
C
30 30 30
D
8/12/2019 pipelined processor design.ppt
5/72
512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Pipelined laundry takes3 hoursfor 4 loads
Speedup factor is 2for4 loads
Time to wash, dry, andfold one load is still thesame (90 minutes)
Pipelined Laundry: Start LoadASAP
Time
6 PM
A
30
7 8 9 PM
B
30
30
C
30
30
30
D
30
30
30
30
30 30
8/12/2019 pipelined processor design.ppt
6/72
612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Serial Execution versus Pipelining
Consider a task that can be divided into ksubtasks
The ksubtasksare executed on kdifferent stages
Each subtask requires one time unit
The total execution time of the task is k time units
Pipelining is to overlap the execution The kstages work in parallel on kdifferent tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2 k1 2 k
1 2 k
1 2 k1 2 k
1 2 k
Without Pipelining
One completion every k time units
With Pipelining
One completion every 1time unit
8/12/2019 pipelined processor design.ppt
7/72712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Synchronous Pipeline
Uses clocked registersbetween stages
Upon arrival of a clock edge
All registers hold the results of previous stages simultaneously
The pipeline stages are combinational logiccircuits
It is desirable to have balancedstages
Approximately equal delay in all stages
Clock period is determined by the maximum stage delay
S1 S2 SkRegister
Register
Register
Register
Input
Clock
Output
8/12/2019 pipelined processor design.ppt
8/72812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Let ti= time delay in stage S
i
Clock cycle t= max(ti) is the maximum stage delay
Clock frequencyf = 1/t = 1/max(ti)
A pipeline can process ntasks in k + n1 cycles kcycles are needed to complete the first task
n1 cycles are needed to complete the remaining n1 tasks
Ideal speedup of a
k-stage pipeline over serial execution
Pipeline Performance
k + n1Pipelined execution in cycles
Serial execution in cycles== Sk k for largennkSk
8/12/2019 pipelined processor design.ppt
9/72912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
MIPS Processor Pipeline
Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Executeoperation or calculate load/store address
4. MEM: Memory accessfor load and store
5. WB: Write Backresult to register
8/12/2019 pipelined processor design.ppt
10/721012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Single-Cycle vs PipelinedPerformance
Consider a 5-stage instruction execution in which
Instruction fetch = ALU operation = Data memory access = 200 ps
Register read = register write = 150 ps
What is the clock cycle of the single-cycle processor?
What is the clock cycle of the pipelined processor?
What is the speedup factor of pipelined execution?
Solution
Single-Cycle Clock = 200+150+200+200+150 = 900 ps
Reg ALU MEMIF
900 ps
Reg
Reg ALU MEMIF
900 ps
Reg
8/12/2019 pipelined processor design.ppt
11/721112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Single-Cycle versus Pipelined contd
Pipelined clock cycle =
CPI for pipelined execution =
One instruction completes each cycle (ignoring pipeline fill)
Speedup of pipelined execution =
Instruction count and CPI are equal in both cases
Speedup factor is less than 5 (number of pipeline stage)
Because the pipeline stages are not balanced
900 ps / 200 ps = 4.5
1
max(200, 150) = 200 ps
200
IF Reg MEMALU Reg
IF Reg MEM RegALU
IF Reg MEMALU Reg200
200 200 200 200 200
8/12/2019 pipelined processor design.ppt
12/721212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Pipeline Performance Summary
Pipelining doesnt improve latencyof a single instruction
However, it improves throughputof entire workload
Instructions are initiated and completed at a higher rate
In ak-stagepipeline, kinstructions operate in parallel
Overlapped execution using multiple hardware resources
Potential speedup = number of pipeline stagesk
Unbalanced lengths of pipeline stages reduces speedup
Pipeline rate is limited by slowestpipeline stage
Unbalanced lengths of pipeline stages reduces speedup
Also, time to filland drainpipeline reduces speedup
8/12/2019 pipelined processor design.ppt
13/721312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Next . . .
Pipelining versus Serial Execution Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/12/2019 pipelined processor design.ppt
14/721412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
ID = Decode &
Register Read
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer:Introduce pipeline register at end of each stage
Next
PC
zeroPCSrc
ALUCtrl
RegWrite
ExtOp
RegDst
ALUSrc
Data
Memory
Address
Data_in
Data_out
32
32ALU
32
Registers
RA
RB
BusA
BusB
RW
5
BusW
32
Address
Instruction
Instruction
Memory
PC
00
30
Rs
5
RdE
Imm16
Rt
0
1
0
1
32
Imm26
32
ALU result
32
0
1
clk
+1
0
1
30
Jump or Branch Target Address
MemRead
MemWrite
MemtoReg
EX = ExecuteIF = Instruction Fetch MEM = Memory
Access
WB =
Write
Back
Bne
Beq
J
8/12/2019 pipelined processor design.ppt
15/72
1512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
zero
Pipelined Datapath
Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers, registerfile, and data memory (for store instruction)
clk
32
A
LU
32
5
Address
Instruction
Instruction
Memory Rs
5Rt1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
ID = Decode &
Register Read
EX = ExecuteIF = Instruction Fetch MEM = Memory
Access
WB=Wri
teBack
32
Register
File
RA
RB
BusA
BusB
RWBusW
32
E
Rd
0
1PC
AL
Uout
D
WBData
32
Imm16
Imm26
Next
PC
A
B
Imm
NPC2
Instruc
tion
NPC
+1
0
1
8/12/2019 pipelined processor design.ppt
16/72
1612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Problem with Register Destination
Is there a problem with the register destination address?
Instruction in the ID stage different from the one in the WB stage
Instruction in the WB stage is not writing to its destination registerbut to the destination of a different instruction in the ID stage
zero
clk
32
ALU
32
5
Address
Instruction
Instruction
MemoryRs
5Rt1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
ID = Decode &
Register Read EX = ExecuteIF = Instruction Fetch MEM =
Memory Access
WB=Write
Back
32
RegisterF
ile
RA
RB
BusA
BusB
RWBusW
32
E
Rd
0
1PC
ALU
out
D
WBData
32
Imm16
Imm26
Next
PC
A
B
Imm
NPC2
Instruction
NPC
+1
0
1
8/12/2019 pipelined processor design.ppt
17/72
1712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Pipelining the Destination Register Destination Register number should be pipelined
Destination register number is passed from ID to WB stage
The WB stage writes back data knowing the destination register
zero
clk
32
ALU
32
5
Address
Instruction
Instruction
Memory Rs
5Rt1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
ID EXIF MEM WB
32
RegisterFile
RA
RB
BusA
BusB
RWBusW
32
E
PC
ALUout
D
W
BData
32
Imm16
Imm26
Next
PC
A
B
Imm
NPC2
Ins
truction
NPC
+1
0
1Rd
Rd20
1 Rd3
Rd4
8/12/2019 pipelined processor design.ppt
18/72
1812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Graphically Representing Pipelines
Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
ProgramE
xecutionOrder
add $s1, $s2, $s3
CC2Reg
IM
DM
Reg
sub $t5, $s2, $t3
CC4
ALU
IM
sw $s2, 10($t3)
DM
Reg
CC5Reg
ALU
IM
DM
Reg
CC6
Reg
ALU DM
CC7
Reg
ALU
CC8
Reg
DM
lw $t6, 8($s5) IM
CC1
Reg
ori $s4, $t3, 7
ALU
CC3
IM
8/12/2019 pipelined processor design.ppt
19/72
1912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
Instruction-Time Diagram
IF
WB
EX
ID
WB
EX
WB
MEM
ID
IF
EX
ID
IF
TimeCC1 CC4 CC5 CC6 CC7 CC8 CC9CC2 CC3
MEM
EX
ID
IF
WB
MEM
EX
ID
IF
lw $t7, 8($s3)
lw $t6, 8($s5)
ori $t4, $s3, 7sub $s5, $s2, $t3
sw $s2, 10($s3)InstructionOrder
Up to five instructions can be in the
pipeline during the same cycleInstruction Level Parallelism (ILP)
ALU instructions skipthe MEM stage.
Store instructionsskip the WB stage
8/12/2019 pipelined processor design.ppt
20/72
2012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Control Signals
zero
clk
32
ALU
32
5
Address
Instruction
Instruction
MemoryRs
5Rt1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
RegisterF
ile
RA
RB
BusA
BusB
RWBusW
32
E
PC
ALU
out
D
WBData
32
Imm16
Imm26
Next
PC
A
B
Imm
NPC2
Instruction
NPC
+1
0
1Rd
Rd20
1 Rd3
Rd4
Same control signals used in the single-cycle datapath
ALUCtrl
RegWrite
RegDst
ALUSrc
MemWrite
MemtoReg
MemRead
ExtOp
PCSrc
ID EXIF MEM WB
Bne
Beq
J
8/12/2019 pipelined processor design.ppt
21/72
2112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Pipelined Control
zero
clk
32
AL
U
32
5
Address
Instruction
Instruction
Memory Rs
5Rt1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
RegisterFile
RA
RB
BusA
BusB
RWBusW
32
EPC
ALUout
D
WB
Data
32
Imm16
Imm26
Next
PC
A
B
Imm
NPC
2
Instruction
NPC
+1
0
1Rd
Rd20
1 Rd3
Rd4
PCSrc
Bne
Beq
J
Op
RegDst
EX
ALUSrc
ALUCtrl
ExtOp
JBeqBne
MEM
MemWrite
MemRead
MemtoReg
WB
RegWrite
Pass control
signals along
pipeline just
like the data
Main& ALUControl
func
8/12/2019 pipelined processor design.ppt
22/72
2212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Pipelined Control Cont'd
ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
Each stage uses some of the control signals
Instruction Decode and Register Read
Control signals are generated
RegDst is used in this stage
Execution Stage => ExtOp, ALUSrc, andALUCtrl
Next PC uses J,Beq,Bne, and zerosignals for branchcontrol
Memory Stage => MemRead,MemWrite,and MemtoReg
Write Back Stage => RegWriteis used in this stage
8/12/2019 pipelined processor design.ppt
23/72
2312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Op
DecodeStage
Execute Stage
Control Signals
Memory Stage
Control Signals
Write
Back
RegDst ALUSrc ExtOp J Beq Bne ALUCtrl MemRd MemWr MemReg RegWrite
R-Type 1=Rd 0=Reg x 0 0 0 func 0 0 0 1
addi 0=Rt 1=Imm 1=sign 0 0 0 ADD 0 0 0 1
slti 0=Rt 1=Imm 1=sign 0 0 0 SLT 0 0 0 1
andi 0=Rt 1=Imm 0=zero 0 0 0 AND 0 0 0 1
ori 0=Rt 1=Imm 0=zero 0 0 0 OR 0 0 0 1
lw 0=Rt 1=Imm 1=sign 0 0 0 ADD 1 0 1 1
sw x 1=Imm 1=sign 0 0 0 ADD 0 1 x 0
beq x 0=Reg x 0 1 0 SUB 0 0 x 0
bne x 0=Reg x 0 0 1 SUB 0 0 x 0
j x x x 1 0 0 x 0 0 x 0
Control Signals Summary
8/12/2019 pipelined processor design.ppt
24/72
2412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/12/2019 pipelined processor design.ppt
25/72
2512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Hazards:situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle
1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions
3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and limit performance
Pipeline Hazards
8/12/2019 pipelined processor design.ppt
26/72
2612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Structural Hazards
Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Example
Writing back ALU result in stage 4
Conflict with writing load data in stage 5
WB
WBEX
ID
WB
EX MEM
IF ID
IF
TimeCC1 CC4 CC5 CC6 CC7 CC8 CC9CC2 CC3
EX
IDIF
MEM
EXID
IF
lw $t6, 8($s5)
ori $t4, $s3, 7sub $t5, $s2, $s3
sw $s2, 10($s3)Instruct
ions
Structural Hazard
Two instructions areattempting to write
the register fileduring same cycle
8/12/2019 pipelined processor design.ppt
27/72
2712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Resolving Structural Hazards Serious Hazard:
Hazard cannot be ignored
Solution 1: Delay Access to Resource
Must have mechanism to delay instruction access to resource
Delay all write backs to the register file to stage 5
ALU instructions bypass stage 4 (memory) without doing anything Solution 2: Add more hardware resources (more costly)
Add more hardware to eliminate the structural hazard
Redesign the register file to have two write ports
First write port can be used to write back ALU results in stage 4
Second write port can be used to write back load data in stage 5
8/12/2019 pipelined processor design.ppt
28/72
2812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/12/2019 pipelined processor design.ppt
29/72
2912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Read After WriteRAW Hazard
Given two instructions I and J, where I comes before J
Instruction J should read an operand after it is written by I
Called a data dependencein compiler terminology
I: add $s1,$s2, $s3 # $s1 is writtenJ: sub $s4, $s1, $s3 # $s1 is read
Hazard occurs when Jreads the operand before I writes it
Data Hazards
8/12/2019 pipelined processor design.ppt
30/72
3012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
DMReg
IM
Reg
ALU
IM
DM
Reg
Reg
ALU
IM
DM
Reg
Reg
ALU DM
Reg
ALU
Reg
DM
IM
Reg
ALU
IM
Time (cycles)
ProgramEx
ecutionOrder
value of $s2
sub $s2, $t1, $t3
CC110
CC2
add $s4, $s2, $t5
10
CC3
or $s6, $t3, $s2
10
CC4
and $s7, $t4, $s2
10
CC620
CC720
CC820
CC5
sw $t8, 10($s2)
10
Example of a RAW Data Hazard
Result of subis needed by add, or, and, & swinstructions
Instructions add& orwill read old valueof $s2from reg file
During CC5, $s2is written at end of cycle, old valueis read
8/12/2019 pipelined processor design.ppt
31/72
3112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
RegReg
Solution 1: Stalling the Pipeline
Three stall cycles during CC3thru CC5(wasting 3 cycles)
Stall cycles delay execution of add & fetching of orinstruction
The addinstruction cannot read $s2until beginning of CC6
The addinstruction remains in the Instruction register until CC6
ThePC register is not modifieduntil beginning of CC6
DM
Reg
RegReg
Time (in cycles)
In
structionOrder value of $s2
CC110
CC210
CC310
CC410
CC620
CC720
CC820
CC510
add $s4, $s2, $t5 IM
or $s6, $t3, $s2 IM ALU
ALU Reg
sub $s2, $t1, $t3 IM Reg ALU DM Reg
CC920
stall stall
DM
stall
8/12/2019 pipelined processor design.ppt
32/72
3212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
DM
Reg
Reg
Reg
Reg
Reg
Time (cycles)
ProgramE
xecutionOrder
value of $s2sub $s2, $t1, $t3 IM
CC110
CC2
add $s4, $s2, $t5 IM
10CC3
or $s6, $t3, $s2
ALU
IM
10CC4
and $s7, $s6, $s2
ALU
IM
10CC6
Reg
DM
ALU
20
CC7
Reg
DM
ALU
20
CC8
Reg
DM
20
CC5
sw $t8, 10($s2)
Reg
DM
ALU
IM
10
Solution 2: Forwarding ALU Result
TheALU resultis forwarded(fed back) to theALU input
No bubbles are inserted into the pipeline and no cycles are wasted
ALU result is forwarded fromALU, MEM, andWBstages
8/12/2019 pipelined processor design.ppt
33/72
3312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Implementing Forwarding
0123
01
23
Result
3232
clk
Rd
32
Rs
Instruction
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
ALUE
Imm16Imm26
1
0
Rd3
Rd2
A
BWData
D
Im26
32
RegisterFile
RB
BusA
BusB
RW
BusW
RARt
Two multiplexers added at the inputs of A & B registers
Data fromALU stage, MEM stage, andWB stageis fed back
Two signals: ForwardAand ForwardBcontrol forwarding
ForwardA
ForwardB
8/12/2019 pipelined processor design.ppt
34/72
3412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Forwarding Control Signals
Signal ExplanationForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardA = 1 Forward result of previous instruction to A (from ALU stage)
ForwardA = 2 Forward result of 2nd
previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rdprevious instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2ndprevious instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rdprevious instruction to B (from WB stage)
8/12/2019 pipelined processor design.ppt
35/72
3512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Forwarding Example
Result
3232
clk
Rd
32
Rs
Instruction
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
ALU
extImm16
Imm26
1
0
Rd3
Rd2
A
B
0123
0123
W
Data
D
Imm
32
RegisterFile
RB
BusA
BusB
RWBusW
RARt
Instruction sequence:
lw $t4, 4($t0)ori $t7, $t1, 2
sub $t3, $t4, $t7
When subinstruction is fetched
oriwill be in the ALU stagelw will be in the MEM stage
ForwardA = 2from MEM stage ForwardB = 1from ALU stage
lw $t4,4($t0)ori $t7,$t1,2sub $t3,$t4,$t7
2
1
8/12/2019 pipelined processor design.ppt
36/72
3612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
RAW Hazard Detection
Currentinstruction being decoded is in Decodestage
Previousinstruction is in the Executestage
Second previousinstruction is in the Memorystage
Third previousinstruction in the Write Backstage
If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite)) ForwardA 1
Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA 2
Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWrite)) ForwardA 3
Else ForwardA 0
If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite)) ForwardB 1
Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB 2
Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWrite)) ForwardB 3
Else ForwardB 0
8/12/2019 pipelined processor design.ppt
37/72
8/12/2019 pipelined processor design.ppt
38/72
3812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Pipeline Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/12/2019 pipelined processor design.ppt
39/72
3912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Reg
Reg
Reg
Time (cycles)
Program
Order
CC2
add $s4, $s2, $t5
Reg
IF
CC3
or $t6, $t3, $s2
ALU
IF
CC6
Reg
DM
ALU
CC7
Reg
Reg
DM
CC8
Reg
lw $s2, 20($t1) IF
CC1 CC4
and $t7, $s2, $t4
DM
ALU
IF
CC5
DM
ALU
Load Delay
Unfortunately, not all data hazards can be forwarded
Loadhas a delay that cannot be eliminated by forwarding
In the example shown below
The LWinstruction does not read data until end of CC4
Cannot forward data to ADDat end of CC3 - NOT possible
However, load canforward data to
2nd next and later
instructions
8/12/2019 pipelined processor design.ppt
40/72
4012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Detecting RAW Hazard after Load
Detecting a RAW hazard after a Load instruction:
The loadinstruction will be in the EXstage
Instruction that depends on the load data is in the decode stage
Condition for stalling the pipeline
if ((EX.MemRead == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard
Insert a bubbleinto the EX stage after a load instruction
Bubble is a no-opthat wastes one clock cycle
Delays the dependent instruction after load by once cycle
Because of RAW hazard
8/12/2019 pipelined processor design.ppt
41/72
4112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Regor $t6, $s3, $s2 IM DM RegALU
RegALU DMReg
add $s4, $s2, $t5 IM
Reglw $s2, 20($s1) IM
stall
ALU
bubble bubble bubble
DM Reg
Stall the Pipeline for one Cycle
ADDinstruction depends on LWstall at CC3
Allow Loadinstruction in ALUstage to proceed
Freeze PCand Instructionregisters (NO instruction is fetched)
Introduce a bubbleinto the ALUstage (bubble is a NO-OP)
Loadcan forward data to next instruction after delaying itTime (cycles)
Program
Order
CC2 CC3 CC6 CC7 CC8CC1 CC4 CC5
8/12/2019 pipelined processor design.ppt
42/72
8/12/2019 pipelined processor design.ppt
43/72
4312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Control Signals
Bubble= 0
0
1
Hazard Detect, Forward, and Stall
0123
0123
Result
3232
clk
Rd
32
Rs
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
ALU
E
Imm26
1
0
Rd3
Rd2
A
BWData
D
Im26
32
RegisterFile
RB
BusA
BusB
RWBusW
RARt
Instruction
ForwardB ForwardA
Hazard Detect
Forward, & Stall
func
RegDst
Main & ALUControl
Op
MEME
X
WB
RegWriteRegWrite
RegWrite
MemReadStall
DisablePC
PC
8/12/2019 pipelined processor design.ppt
44/72
4412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
CPI for a Pipeline Processor
CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAW
Stalls + Branch Stalls CPIIdealis 1 for a simple Pipeline processor
CPIIdealis 0.5 for 2-Way Super Scalar processor
CPIIdealis 0.25 for 4-Way Super Scalar processor, and so on . . .
There are no Structural Hazards in the pipeline displayed in previous slides.
WAR and WAW hazards also do not appear in the current pipeline
Branch Hazards will be discussed later.
RAW Hazards (Caused due to data dependencies)
Majority of RAW Hazards are resolved by forwarding
Load RAW Hazards are resolved by Stalling
Stalls due to Load is --- One Cycles in the current Pipeline
8/12/2019 pipelined processor design.ppt
45/72
4512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
CPI for a Pipeline Processor
Assume 37% Load instructions
40% of the Load Instructions Cause Stalls
How many cycle stalls for load instructions, in current pipeline?
One Cycle
CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls +WAW Stalls + Branch Stalls
CPIPipeline= 1 + 0 + RAW Stalls + 0 + 0 + 0
CPIPipeline= 1 + RAW Stalls
CPIPipeline= 1 + Load Instructions x Load Instructions that CauseStalls x Number of Cycles of Stalls
CPIPipeline= 1 + 0.37 x 0.4 x 1
CPIPipeline= 1 + 0.148 = 1.148
8/12/2019 pipelined processor design.ppt
46/72
4612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Code Scheduling to Avoid Stalls
Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E F; // A thru F are in Memory
Slow code:
lw $t0, 4($s0) # &B = 4($s0)lw $t1, 8($s0) # &C = 8($s0)
add $t2, $t0, $t1 # stall cycle
sw $t2, 0($s0) # &A = 0($s0)
lw $t3, 16($s0) # &E = 16($s0)lw $t4, 20($s0) # &F = 20($s0)
sub $t5, $t3, $t4 # stall cycle
sw $t5, 12($0) # &D = 12($0)
Fast code: No Stalls
lw $t0, 4($s0)lw $t1, 8($s0)
lw $t3, 16($s0)
lw $t4, 20($s0)
add $t2, $t0, $t1sw $t2, 0($s0)
sub $t5, $t3, $t4
sw $t5, 12($s0)
Name Dependence: Write After
8/12/2019 pipelined processor design.ppt
47/72
4712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Instruction Jshould write its result after it is read by I
Called anti-dependenceby compiler writers
I: sub $t4, $t1, $t3 # $t1 is read
J: add $t1,$t2, $t3 # $t1 is written
Results from reuse of the name $t1
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and
Instructions are processed in order
Anti-dependence can be eliminated by renaming
Use a different destination register for add (eg, $t5)
Name Dependence: Write AfterRead
Name Dependence: Write After
8/12/2019 pipelined processor design.ppt
48/72
4812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Name Dependence: Write AfterWrite
Same destination register is written by two instructions
Called output-dependence in compiler terminology
I: sub $t1, $t4, $t3 # $t1 is written
J: add $t1, $t2, $t3 # $t1 is written
again
Not a data hazard in the 5-stage pipeline because:
All writes are ordered and always take place in stage 5
However, can be a hazard in more complex pipelines
If instructions are allowed to complete out of order, and
Instruction Jcompletes and writes $t1before instruction I
Output dependence can be eliminated by renaming $t1
Read After Read is NOT a name dependence
8/12/2019 pipelined processor design.ppt
49/72
4912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/12/2019 pipelined processor design.ppt
50/72
5012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Control Hazards
Jump and Branch can cause great performance loss
Jump instruction needs only thejump target address
Branch instruction needs two things:
Branch Result Taken or Not Taken
Branch Target Address
PC + 4 If Branch is NOT taken
PC + 4 + 4 immediate If Branch is Taken
Jump and Branch targets are computed in the ID stageAt which point a new instruction is already being fetched
Jump Instruction: 2-cycle delay
Branch: 2-cycle delay for branch result (taken or not taken)
8/12/2019 pipelined processor design.ppt
51/72
5112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
2-Cycle Branch Delay
Control logic detects a Branchinstruction in the 2ndStage
ALU computes the Branch outcome in the 3rdStage
Next1 andNext2instructions will be fetched anyway
Convert Next1and Next2into bubbles if branch is taken
Beq $t1,$t2,L1 IF
cc1
Next1
cc2
Reg
IF
Next2
cc4 cc5 cc6 cc7
IF Reg DMALU
BubbleBubble Bubble
BubbleBubble BubbleBubble
L1: target instruction
cc3
BranchTargetAddr
ALU
Reg
IF
8/12/2019 pipelined processor design.ppt
52/72
5212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Implementing Jump and Branch
zero
clk
32
ALU
32
5
Address
Instruction
Instruction
Memory Rs
5Rt1
0
32
Regis
terFile
RA
RB
BusA
BusB
RWBusW
EPC
ALUout
D
Imm16
Imm26
A
B
Im26
NPC
2
In
struction
NPC
+1
0
1Rd
Rd20
1 Rd3
PCSrc
Bne
BeqJ
Op
RegDst
EX
J, Beq, Bne
MEM
Main & ALUControl
fun
c
Next
PC
0
2
3
1
0
2
3
1
Control Signals0
1Bubble = 0
Branch Delay = 2 cycles
Branch target & outcomeare computed in ALU stage
J
umporBranchTarge
t
8/12/2019 pipelined processor design.ppt
53/72
5312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
CPI for a Pipeline Processor Assume 18% Branch instructions, and 10% Branch Instructions
Assume to stall until branch condition and target is resolved
Assume conditional 54% of conditional branches are true (i.e., these branches areTAKEN)
Number of Stalls
Branch Stalls incase of Taken Branch = 2
Branch Stalls incase of Not-Taken = 2
Jump Stalls incase of Jump = 2
How many cycle stalls for load instructions, in current pipeline?
One Cycle
CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAW Stalls +Branch Stalls
CPIPipeline= CPIIdeal+ RAW Stalls + Branch Stalls
Ignoring RAW Stalls, as information is not provided
CPIPipeline= 1 + Branch Stalls
CPIPipeline= 1 + Branch Stalls
CPIPipeline= 1 + Branch Taken Stalls + Branch Not-Taken Stalls + Jump Stalls
CPIPipeline= 1 + 0.18 x 0.54 x 2 + 0.18 x 0.36 x 2 + 0.10 x 2
8/12/2019 pipelined processor design.ppt
54/72
5412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Predict Branch NOT Taken
Branches can be predicted to be NOT taken
If branch outcome is NOT taken then
Next1 andNext2instructions can be executed
Do not convert Next1& Next2into bubbles
No wasted cycles
Beq $t1,$t2,L1 IF
cc1
Next1
cc2
Reg
IF
Next2
cc3
NOT TakenALU
Reg
IF Reg
cc4 cc5 cc6 cc7
ALU DM
ALU DM
Reg
Reg
8/12/2019 pipelined processor design.ppt
55/72
5512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
CPI for a Pipeline Processor Assume 18% Branch instructions, and 10% Branch Instructions
Assume to Predict Not-Taken Branch policy
Assume conditional 54% of conditional branches are true (i.e., these branches are TAKEN)
Number of Stalls
Branch Stalls incase of Taken Branch = 2
Branch Stalls incase of Not-Taken = 0
If the condition is not true then the correct instructions are already in pipeline
Jump Stalls incase of Jump = 2
How many cycle stalls for load instructions, in current pipeline?
One Cycle
CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAW Stalls + Branch Stalls
CPIPipeline= CPIIdeal+ RAW Stalls + Branch Stalls
Ignoring RAW Stalls, as information is not provided
CPIPipeline= 1 + Branch Stalls
CPIPipeline= 1 + Branch Stalls CPIPipeline= 1 + Branch Taken Stalls + Branch Not-Taken Stalls + Jump Stalls
CPIPipeline= 1 + 0.18 x 0.54 x 2 + 0.18 x 0.36 x 0 + 0.10 x 2
CPIPipeline= 1 + 0.18 x 0.54 x 2 + 0.10 x 2
8/12/2019 pipelined processor design.ppt
56/72
5612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Reducing the Delay of Branches
Branch delay can be reduced from 2 cycles tojust 1 cycle
Branches can be determined earlier in the Decode stage
A comparator is used in the decode stage to determine branchdecision, whether the branch is taken or not
Because of forwarding the delay in the second stage will beincreased and this will also increase the clock cycle
Only one instructionthat follows the branch is fetched
If the branch is taken then only one instruction is flushed
We should insert a bubble after jump or taken branch
This will convert the next instruction into a NOP
8/12/2019 pipelined processor design.ppt
57/72
5712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
J, Beq, Bne
Reducing Branch Delay to 1 Cycle
clk
32
ALU
32
5
Address
Instruction
Instruction
Memory Rs
5Rt1
0
32
Regis
terFile
RA
RB
BusA
BusB
RWBusW
E
PC
ALUout
D
Imm16
A
B
Im16
In
struction
+1
0
1Rd
Rd20
1 Rd3
PCSrc
Bne
BeqJ
Op
RegDst
EX
MEM
Main & ALUControl
fun
c
0
2
3
1
0
2
3
1
Control Signals0
1Bubble = 0
J
umporBranchTarget
Reset signal convertsnext instruction afterjump or taken branchinto a bubble
Zero=
ALUCtrl
Reset
NextPC
Longer Cycle
Data forwardedthen compared
8/12/2019 pipelined processor design.ppt
58/72
5812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
CPI for a Pipeline Processor Assume 18% Branch instructions, and 10% Branch Instructions
Assume to Predict Not-Taken Branch policy
Assume conditional 54% of conditional branches are true (i.e., these branches are TAKEN)
Number of Stalls
Branch Stalls incase of Taken Branch = 1
Branch Stalls incase of Not-Taken = 0
If the condition is not true then the correct instructions are already in pipeline
Jump Stalls incase of Jump = 1
How many cycle stalls for load instructions, in current pipeline?
One Cycle
CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAW Stalls + Branch Stalls
CPIPipeline= CPIIdeal+ RAW Stalls + Branch Stalls
Ignoring RAW Stalls, as information is not provided
CPIPipeline= 1 + Branch Stalls
CPIPipeline= 1 + Branch Stalls CPIPipeline= 1 + Branch Taken Stalls + Branch Not-Taken Stalls + Jump Stalls
CPIPipeline= 1 + 0.18 x 0.54 x 1 + 0.18 x 0.36 x 0 + 0.10 x 1
CPIPipeline= 1 + 0.18 x 0.54 x 1 + 0.10 x 1
8/12/2019 pipelined processor design.ppt
59/72
5912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
8/12/2019 pipelined processor design.ppt
60/72
6012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Branch Hazard Alternatives
Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched
Do NOT Flush instruction after branch if branch is NOT taken
Flush only instructions appearing after Jump or taken branch
Delayed Branch Define branch to take placeAFTERthe next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
Dynamic Branch Prediction
Loop branches are taken most of time
Must reduce branch delay to 0, but how?
How to predict branch behavior at runtime?
l d h
8/12/2019 pipelined processor design.ppt
61/72
6112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Define branch to take place afterthe next instruction
For a 1-cycle branch delay, we have onedelay slot
branch instruction
branch delay slot (next instruction)
branch target (if branch taken)
Compiler fills the branch delay slot
By selecting an independent instruction
From before the branch
If no independent instruction is found
Compiler fills delay slot with a NO-OP
Delayed Branch
label:
. . .
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot
label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4
b k f l d h
8/12/2019 pipelined processor design.ppt
62/72
6212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
New meaning for branch instruction
Branching takes place after next instruction (Not immediately!)
Impacts software and compiler
Compiler is responsible to fill the branch delay slot
For a 1-cycle branch delay Onebranch delay slot
However, modern processors and deeply pipelined
Branch penalty is multiple cycles in deeper pipelines
Multiple delay slots are difficult to fill with useful instructions
MIPS used delayed branching in earlier pipelines
However, delayed branching is not useful in recent processors
Drawback of Delayed Branching
CPI f Pi li P
8/12/2019 pipelined processor design.ppt
63/72
6312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
CPI for a Pipeline Processor Assume 18% Branch instructions, and 10% Branch Instructions
Assume to Delay Branch policy Assume conditional 54% of conditional branches are true (i.e., these
branches are TAKEN)
Assume 75% of delay slots of branch instructions are filled with a validinstruction (i.e., no NO-OP)
Assume 85% of delay slots of jump instructions are filled with a valid
instruction.
How many cycle stalls for load instructions, in current pipeline?
One Cycle
CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAWStalls + Branch Stalls
CPIPipeline= CPIIdeal+ RAW Stalls + Branch Stalls
Ignoring RAW Stalls, as information is not provided
CPIPipeline= 1 + Branch Stalls + Jump Stalls
CPIPipeline= 1 + 0.18 x 0.25 x 1 + 0.10 x 0.15 x 1
Z D l d B hi
8/12/2019 pipelined processor design.ppt
64/72
6412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Zero-Delayed Branching
How to achieve zero delay for a jump or a taken branch?
Jump or branch target address is computed in the ID stage
Next instruction has already been fetched in the IF stage
Solution
Introduce a Branch Target Buffer (BTB) in the IF stage Store the target address of recent branch and jump instructions
Use the lower bits of the PC to index the BTB
Each BTB entry stores Branch/Jump address & Target Address Check the PC to see if the instruction being fetched is a branch
Update the PC using the target address stored in the BTB
B h T t B ff
8/12/2019 pipelined processor design.ppt
65/72
6512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Branch Target Buffer
The branch target bufferis implemented as a small cache
Stores the target address of recent branches and jumps
We must also have prediction bits
To predict whether branches are taken or not taken
The prediction bits are dynamically determined by the hardware
mux
PC
Branch Target & Prediction Buffer
Addresses of
Recent Branches
Target
Addresses
low-order bits
used as index
Predict
BitsInc
=predict_taken
D i B h P di ti
8/12/2019 pipelined processor design.ppt
66/72
6612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Dynamic Branch Prediction
Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction
Typically few prediction bits (1 or 2) are used per entry
We dont know if the prediction is correct or not If correct prediction
Continue normal executionno wasted cycles
If incorrect prediction (misprediction) Flush the instructions that were incorrectly fetchedwasted cycles
Update prediction bits and target address for future use
Dynamic Branch Prediction
8/12/2019 pipelined processor design.ppt
67/72
6712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Correct PredictionNo stall cycles
YesNo
Dynamic Branch Prediction Contd
Use PC to address InstructionMemory and Branch Target Buffer
FoundBTB entry with predict
taken?
Increment PC PC = target address
Mispredicted Jump/branchEnter jump/branch address, target
address, and set prediction in BTB entry.Flush fetched instructions
Restart PC at target address
Mispredicted branchBranch not taken
Update prediction bitsFlush fetched instructionsRestart PC after branch
NormalExecution
YesNo
Jumpor takenbranch?
Jumpor takenbranch?
No Yes
IF
ID
EX
1 bit P di ti S h
8/12/2019 pipelined processor design.ppt
68/72
6812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are flushed
1-bit prediction scheme is simplest to implement
1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)
Use last outcome to predict future behavior of a branch
1-bit Prediction Scheme
Predict
Not Taken
TakenPredict
Taken
NotTaken
Not Taken
Taken
1 Bit P di t Sh t i
8/12/2019 pipelined processor design.ppt
69/72
6912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
1-Bit Predictor: Shortcoming
Inner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of innerloop next time around
outer:
inner: bne , , innerbne , , outer
2 bit P di ti S h
8/12/2019 pipelined processor design.ppt
70/72
7012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
1-bit prediction scheme has a performance shortcoming
2-bit prediction scheme works better and is often used
4 states: strong and weak predict taken / predict not taken
Implemented as a saturating counter
Counter is incremented to max=3 when branch outcome is taken
Counter is decremented to min=0 when branch is not taken
2-bit Prediction Scheme
Not Taken
Taken
Not Taken
TakenStrong
Predict
Not Taken
Taken
Weak
Predict
Taken
Not Taken
Weak
Predict
Not TakenNot Taken
TakenStrong
Predict
Taken
Fallacies and Pitfalls
8/12/2019 pipelined processor design.ppt
71/72
7112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad
Fallacies and Pitfalls Pipelining is easy!
The basic idea is easy
The devil is in the details
Detecting data hazards and stalling pipeline
Poor ISA design can make pipelining harder Complex instruction sets (Intel IA-32)
Significant overhead to make pipelining work
IA-32 micro-op approach
Complex addressing modes
Register update side effects, memory indirection
Pipeline Ha a ds S mma
8/12/2019 pipelined processor design.ppt
72/72
Pipeline Hazards Summary
Three types of pipeline hazards
Structural hazards: conflicts using a resource during same cycle
Data hazards: due to data dependencies between instructions
Control hazards: due to branch and jump instructions
Hazards limit the performance and complicate the design Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by forwarding
However, load delay cannot be eliminated and stalls the pipeline
Delayed branching can be a solution when branch delay = 1 cycle
BTB with branch prediction can reduce branch delay to zero
Branch misprediction should flush the wrongly fetched instructions
Recommended