Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
ELEC 5200/6200
Computer Architecture and Design
Spring 2017 Lecture 5: Pipelining
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 1
Ujjwal Guin, Assistant Professor
Department of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849
http://www.auburn.edu/~uzg0005/
Adapted from Dr. Chen-Huan Chiang (Intel) and Prof. Vishwani D. Agrawal (Auburn University)
[Adapted from Computer Organization and Design, Patterson & Hennessy, 2014]
ILP: Instruction Level Parallelism
Single-cycle and multi-cycle datapaths execute one
instruction at a time.
How can we get better performance?
Answer: Execute multiple instructions at the same
time.
– Pipelining – Enhance a multi-cycle datapath to fetch one
instruction every cycle.
– Parallelism – Fetch multiple instructions every cycle.
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 2
Automobile Team Assembly
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 3
1 car assembled every four hours
6 cars per day
180 cars per month
2,040 cars per year
1 hour 1 hour
1 hour
1 hour
Automobile Assembly Line
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 4
Task 1
1 hour
Task 2
1 hour
Task 3
1 hour
Task 4
1 hour
First car assembled in 4 hours (pipeline latency)
1 car completed per hour thereafter
21 cars on first day, thereafter 24 cars per day
717 cars per month
8,637 cars per year
What gives 4X increase?
Mecahnical Electrical Painting Testing
Throughput: Team Assembly
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 5
Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing
Time of assembling one car = n hours
where n is the number of nearly equal subtasks,
each requiring 1 unit of time
Throughput = 1/n cars per unit time
Red car
completed
Red car
started
TimeBlue car
started
Blue car
completed
Throughput: Assembly Line
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 6
Time to complete first car = n time units (latency)
Cars completed in time T = T – n + 1
Throughput = 1 – (n – 1)/ T cars per unit time
Throughput (assembly line) 1 – (n – 1)/ T n(n – 1)
─────────────────── = ──────── = n – ───── → n
Throughput (team assembly) 1/n T as T→∞
Mechanical Electrical Painting Testing
Mechanical Electrical Painting Testing
Mechanical Electrical Painting Testing
Mechanical Electrical Painting Testing
Car 1
Car 2
Car 3
Car 4
.
.
Car 1
complete
Car 2
complete
time
Key idea: overlap execution
of multiple tasks
Some Features of Assembly Line
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 7
Task 1
1 hour
Task 2
1 hour
Task 3
1 hour
Task 4
1 hour
Mechanical Electrical Painting Testing
Electrical parts
delivered (JIT)
Defect
foundStall assembly line
to fix the cause of
defect
3 cars in the assembly line are suspects,
to be removed (flush pipeline)
Pros and Cons
Advantages: Efficient use of labor.
Specialists can do better job.
Just in time (JIT) methodology eliminates warehouse cost.
Disadvantages: Penalty of defect latency.
Lack of flexibility in production.
Assembly line work is monotonous and boring.
https://www.youtube.com/watch?v=IjarLbD9r30
https://www.youtube.com/watch?v=ANXGJe6i3G8
https://www.youtube.com/watch?v=5lp4EbfPAtI
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 8
Pipelining a Digital System
Key idea: break big computation up into pieces
Separate each piece with a pipeline register1ns
200ps 200ps 200ps 200ps 200ps
Pipeline
Register
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 9
Pipelining a Digital System
Why do this? Because it's faster for repeated
computations
1ns
Non-pipelined:
1 operation finishes
every 1ns
200ps 200ps 200ps 200ps 200ps
Pipelined:
1 operation finishes
every 200ps
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 10
Pipelining a Processor
Recall the 5 steps in instruction execution:1. Instruction Fetch (IF)
2. Instruction Decode and Register Read (ID)
3.Execution operation or calculate address (ALU or EX)
4.Memory access (MEM)
5.Write result into register (WB)
Review: Single-Cycle Processor– All 5 steps done in a single clock cycle
– Dedicated hardware required for each step
What happens if we break execution into multiple cycles, and add extra hardware?– Recall that in Multi-cycle, datapath hardware differs from
single-cycle
112/20/2017 ELEC 5200-001/6200-001 Lecture 5
Review - Single-Cycle Processor
12
IFInstruction Fetch
ID
Instruction Decode
EX
Execute/ Address Calc.
MEM
Memory Access
WB
Write Back
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
13
Pipelining - Key Idea
Question: What happens if we break execution into
multiple cycles, and add the extra hardware?
Answer: in the best case, we can start executing a
new instruction on each clock cycle
– this is pipelining
Pipelining stages:
– IF - Instruction Fetch
– ID - Instruction Decode
– EX - Execute / Address Calculation
– MEM - Memory Access (read / write)
– WB - Write Back (results into register file)
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Project Summary A RISC CPU is to be designed in the VHDL modeling
language, verified via the Mentor Graphics "ModelSim" or Aldec “Active-HDL” simulator, and implemented on the Altera DE2 FPGA board using Altera’s Quartus II software.
The project consists of six parts. Due dates will be listed above as the semester progresses. You read problem definitions of all six parts before actually starting with Part 1, i.e., Instruction Set Architecture (ISA).
Please submit only the List Format (do not submit wave format) of the simulation results in part 3, part 4, and part 5. Always annotate your simulation results. Maintain a single folder for submitting the project parts. When submitting a later part, all the previous parts need to be in the folder.
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 14
Instruction Set Architecture Classes
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 15
ALU
Processor
Memory
…
…
ALU
Processor
Memory
…
…
Memory
…
…
ALU
Processor
Memory
…
…
ALU
Processor
…
…
…
…
a) Stack b) Accumulator c) Register-Memory c) Register-Register
Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
Basic Pipelined Processor
16
IF/ID
Pipeline Registers
ID/EX EX/MEM MEM/WB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
Single-Cycle vs. Pipelined Execution
17
Non-Pipelined0 200 400 600 800 1000 1200 1400 1600 1800
lw $1, 100($0)Instruction
FetchREG
RDALU REG
WRMEM
lw $2, 200($0)Instruction
FetchREG
RDALU REG
WRMEM
lw $3, 300($0)Instruction
Fetch
TimeInstructionOrder
800ps
800ps
800ps
Pipelined0 200 400 600 800 1000 1200 1400 1600
lw $1, 100($0)Instruction
FetchREG
RDALU REG
WRMEM
lw $2, 200($0)
lw $3, 300($0)
TimeInstructionOrder
200ps
Instruction
FetchREG
RDALU REG
WRMEM
Instruction
FetchREG
RDALU REG
WRMEM
200ps
200ps 200ps 200ps 200ps 200ps
Note: REGRD is at the
end of a stage but
REGWR is at the
beginning of a stage
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Single-Cycle vs. Pipelined Execution (cont.)
Time taken in pipeline stages is limited by the slowest operation– Either ALU operation or Memory access
Time taken in ALU stage (i.e. EX) is used as pipeline clock cycle in the following discussion
If most memory access is cache access, MEM < ALU
Assumptions (Fig 4.27 on p.276)– Write to the register/memory occurs in the first half of the clock cycle
– Read from register/memory occurs in the second half of the clock cycle
– If no such assumption, Cycle 5 of the following example will have issues Executing Multiple Instructions Clock Cycle 5, where the register file is used for 2
instructions at their different stages (ID and WB)
– How to design such an assumption?
0 200 400 600 800 1000 1200 1400 1600
lw $1, 100($0)Instruction
FetchREG
RDALU REG
WRMEM
lw $2, 200($0)
lw $3, 300($0)
TimeInstructionOrder
200ps
Instruction
FetchREG
RDALU REG
WRMEM
Instruction
FetchREG
RDALU REG
WRMEM
200ps
200ps 200ps 200ps 200ps 200ps
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 18
Comments about Pipelining
The good news– Multiple instructions are being processed at the same
time
– This works because stages are isolated by registers
– Best case speedup of #Stages
The bad news– Instructions interfere with each other - Hazards
Different instructions may need the same piece of hardware (e.g., memory) in same clock cycle --- Structure Hazard
Not sure which is the next instruction for the next instruction fetch (IF) until EX of the branch instruction --- Control Hazard
Instruction may require a result produced by an earlier instruction that is not yet complete --- Data Hazard
– Worst case: Must suspend execution - Stall
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 19
Example - Executing Multiple
Instructions
Consider the following instruction sequence
lw $r0, 10($r1)
sw $r3, 20($r4)
add $r5, $r6, $r7
sub $r8, $r9, $r10
202/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 1
21
LW
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 2
22
LWSW
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 3
23
LWSWADD
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 4
24
LWSWADDSUB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 5
25
LWSWADDSUB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 6
26
SWADDSUB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 7
27
ADDSUB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Executing Multiple Instructions
Clock Cycle 8
28
SUB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Compact View
IM REG ALU DM REGlw $r0, 10($r1)
sw $r3, 20($r4)
add $r5, $r6, $r7
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
IM REG ALU DM REG
IM REG ALU DM REG
sub $r8, $r9, $r10 IM REG ALU DM REG
CC 8
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 29
Pipeline Hazards
Where one instruction cannot immediately follow
another
Types of hazards
– Structural hazards - attempt to use same resource twice
– Control hazards - attempt to make decision before
condition is evaluated
– Data hazards - attempt to use data before it is ready
We can always resolve hazards by waiting
– i.e. stall
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 30
31
Structural Hazards
Attempt to use same resource twice at same time
Example: A Single Memory for both instructions and data– Accessed by IF stage
– Accessed at same time by MEM stage
Solutions– Delay second access by one clock cycle, OR
– Provide separate memories for instructions and data (IM and DM) This is what MIPS does
Recall “Harvard Architecture”
Real pipelined processors have separate caches
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Structural Hazard - Single Memory
0 2 4 6 8 10Time
12
IF ID EX MEM WB
14
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
14
Memory Conflict
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 32
Control Hazards Attempt to make a decision before condition is evaluated
Example: beq $s0, $s1, offset
– Must begin fetching the instruction following the branch on the very next
clock cycle
– But the pipeline does not know what is the next instruction since it only just
received the branch instruction from memory
– Possible solutions: Stall, predict, or delayed decision
If we add hardware to second stage to:
– Compare fetched registers for equality
– Compute branch target and update PC
– This allows branch to be taken at end of second clock cycle
May not be possible for longer pipelines since branch may not be resolved in 2nd
stage, then larger slowdown
– Must make sure that the additional hardware does not increase pipeline clock
cycle.
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 33
34
Control Hazard Solutions
Stall - Stop loading instructions until result is available
Predict - Assume an outcome and continue fetching (undo if prediction is wrong) – Always assuming branch untaken
– Or assuming half of branch taken and half untaken
Delayed branch (used in MIPS)– Always executes the next SAFE instruction in the sequence
a safe instruction is an instruction which is not affected by the branch
– MIPS software will place such a safe instruction immediately after the delayed branch
This step is hidden from MIPS assembly programmer
– If branch is taken, the taken branch changes the address of the instruction follows the safe instruction
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Control Hazard – Stall
All following discussions are assumed with the extra
hardware at 2nd stage
beq writes PC here with
the extra hardwarenew PC used here
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WBsw $s4,200($t5)
18
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
tgt:
Control Hazard - Correct Prediction
Fetch assuming
branch taken
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WBtgt:sw $s4,200($t5)
18
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 36
Control Hazard - Incorrect Prediction
“Squashed”
instruction
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WB
18
BUBBLE BUBBLE BUBBLE BUBBLE
tgt:sw $s4,200($t5)(incorrect prediction - STALL)
IF
or $r8,$r8,$r9
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 37
Control Hazard - Delayed Branch
always executes
correct PC avail. here
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WB
18
Branch SLOT:
and $r6,$r6,$r7
Or re-arrange the codes
to execute the previous “add” here
tgt:sw $s4,200($t5) IF ID EX MEM WB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 38
Summary - Control Hazard Solutions
Stall - stop fetching instruction until result is available
– Significant performance penalty
– Hardware required to stall
Predict - assume an outcome and continue fetching
(undo if prediction is wrong)
– Performance penalty only when guess wrong
– Hardware required to "squash" instructions
Delayed branch - specify in architecture that following
instruction is always executed
– Compiler re-orders instructions into delay slot
– Insert "NOP" (no-op) operations when can't use (~50%)
– This is how original MIPS worked
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 39
Example: Delayed branchLoop: lw $8, 100($7)
addi $7, $7, 4beq $7, $4, Loop
addi is not a “safe” instruction to be placed at the
branch slot (i.e. the instruction after beq)
– Because the dependence of $7 between addi and beq.
lw seems a safe instruction candidate but its
location does not allow it to be moved to the
branch slot
– Because “addi $7, $7, 4” is after “lw $8, 100($7)”; i.e., if
lw is moved to branch slot, the value of $7 is off by 4.
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 40
Example: delayed branch (cont.)
Changes made for the MIPS codes– Swapping addi and lw location
– Changing offset from 100 to 100-4=96
In order to keep the results of two programs identical
– The value of $7 at the new location should be the value prior to “addi$7,$7,4”
Loop: addi $7, $7, 4 lw $8, 96($7)beq $7, $4, Loop
After the above swapping and changing of the offset, lwcan be safely moved to the delay slot
Loop: addi $7, $7, 4 beq $7, $4, Loop lw $8, 96($7) # delay slot
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 41
Attempt to use data before it is ready
Solutions
– Stalling - wait until result is available
– Forwarding (Bypassing)- make data available inside
datapath
– Re-ordering instructions - use compiler to avoid hazards
Examples:add $s0, $t0, $t1 ; $s0 = $t0+$t1
sub $t2, $s0, $t3 ; $t2 = $s0-$t3
lw $s0, 0($t0) ; $s0 = MEM[$t0]
sub $t2, $s0, $t3 ; $t2 = $s0-$t2
Data Hazards
422/20/2017 ELEC 5200-001/6200-001 Lecture 5
Data Hazard - Stalling0 2 4 6 8 10 12
IF ID EX MEM
16
add $s0 ,$t0,$t1
STALL
18
sub $t2, $s0 ,$t3 IF EX MEM
STALL
BUBBLE BUBBLE BUBBLE BUBBLE
BUBBLEBUBBLE BUBBLE BUBBLE BUBBLE
$s0writtenhere
Ws0
WB
$s0 readhere
Rs0
BUBBLE
May need one more , i.e. the 3rd, STALL to
be absolutely data hazard free, if such a
register can not be designed
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 43
Data Hazards - Forwarding
Key idea: connect new value directly to next stage
Still read s0, but ignore in favor of new result
Since forwarding is valid only if the destination stage is later in time than the source stage
– Problem: what about load instructions?
If the “add” replaced by “lw”, data won’t be available until MEM stage.
442/20/2017 ELEC 5200-001/6200-001 Lecture 5
Data Hazards - Forwarding
STALL still required for LOAD instruction
– Because data available after MEM
MIPS architecture calls this delayed load, initial
implementations required compiler to deal with this
ID
0 2 4 6 8 10 12
IF ID EX MEM
16
lw $s0 ,20($t1)
18
sub $t2, $s0 ,$t3 IF EX MEM
Ws0
WBRs0
new value of s0
STALLBUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 45
Data Hazards - Reordering
Instructions
What are the hazards in this code?lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1)
sw $t0, 4($t1)
Using data forwarding, resolve the data hazard but will introduce STALL
Reorder instructions to remove hazard without any STALL when using data forwarding:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1)
462/20/2017 ELEC 5200-001/6200-001 Lecture 5
47
Summary - Pipelining Overview
Pipelining increase throughput (but not latency)
Hazards limit performance
– Structural hazards
– Control hazards
– Data hazards
2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Summary: Hazards
Structural hazards– Cause: resource conflict
– Remedies: (i) hardware resources, (ii) stall (bubble)
Data hazards– Cause: data unavailablity
– Remedies: (i) forwarding, (ii) stall (bubble), (iii) code reordering
Control hazards– Cause: out-of-sequence execution (branch or jump)
– Remedies: (i) stall (bubble), (ii) branch prediction/pipeline flush, (iii) delayed branch/pipeline flush
ELEC 5200-001/6200-001 Lecture 5 482/20/2017
Control Unit
for
Pipelined MIPS
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 49
Single-Cycle Control Logic
Inputs Outputs
Instr.
type
OpcodeInstruction bits
31 31 29 28 27 26
R 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0
lw 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0
sw 1 0 1 0 1 1 X 1 X 0 0 1 0 0 0 0
beq 0 0 0 1 0 0 X 0 X 0 0 0 1 0 1 0
J 0 0 0 0 1 0 X X X 0 X 0 X X X 1
ELEC 5200-001/6200-001 Lecture 6 50
AL
UO
p0
AL
UO
p1
Reg
Dst
AL
US
rc
Mem
toR
eg
Reg
Wri
te
Me
mR
ea
d
Mem
Wri
te
Bra
nch
Ju
mp
2/20/2017
Single-Cycle Control Circuit
ELEC 5200-001/6200-001 Lecture 5 51
lw sw beq JR
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOp0
Jump
Op5
Op4
Op3
Op2
Op1
Op0
2/20/2017
ELEC 5200-001/6200-001 Lecture 5 52
ALU Control Logic
Inputs Outputs to ALU
Instr.
type
From CU Funct. Code from IR
(bits 0-5)3-bit
code
Opera-
tionALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
lw, sw 0 0 X X X X X X 010 Add
B 0 1 X X X X X X 110 Subtract
R
1 X X X 0 0 0 0 010 Add
1 X X X 0 0 1 0 110 Subtract
1 X X X 0 1 0 0 000 AND
1 X X X 0 1 0 1 001 OR
1 X X X 1 0 1 0 111 slt
2/20/2017
ELEC 5200-001/6200-001 Lecture 5 53
ALU Control
ALU
3
zero
result
overflow
Operation
select
from control
Operation select ALU function
000 AND
001 OR
010 Add
110 Subtract
111 Set on less than
F3
F2
F1
F0
ALUOp1 ALUOp0
From Control Circuit
ALU control
2/20/2017
Returning to Pipelined Control Opcode input to control is supplied by the pipeline
register IF/ID in the ID (instruction decode) cycle. Nine control signals are generated in the ID cycle,
but none is used. They are saved in the pipeline register ID/EX.
ALUSrc, RegDst and ALUOp (2 bits) are used in the EX (execute) cycle. Remaining 5 control signals are saved in the pipeline register EX/MEM.
Branch, MemWrite and MemRead are used in the MEM (memory access) cycle. Remaining 2 control signals are saved in the pipeline register MEM/WB.
MemtoReg and RegWrite are used in the WB (write back) cycle.
Pipelined control is shown without Jump.
ELEC 5200-001/6200-001 Lecture 5 542/20/2017
Pipelined Datapath with Control
Signals
MemtoReg
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
MemWrite
ALUControl6
ALUOp0
1
RegDst
5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrc PCSrc
0
1
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 55
Control
Basic approach:
– Based on single-cycle control
– Place control unit in ID stage
– Pass control signals to following stages
Later: extra features to deal with:
– Data forwarding
– Stalls
– Exceptions
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 56
Control for Pipelined Datapath
RegDst
ALUOp[1:0]
ALUSrc
MemRead
MemWrite
Branch
RegWrite
MemtoReg
EX
M
WB
Control
IF / ID ID / EX EX / MEM MEM / WB
M
WB
WB
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 57
Control for Pipelined Datapath
Execution/Address
Calculation stage control
lines
Memory access stage
control lines
Write-back
stage control
lines
Instruction
Reg
Dst
ALU
Op1
ALU
Op0
ALU
Src
Branc
h
Mem
Read
Mem
Write
Reg
write
Mem
to Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
RegDst
ALUOp[1:0]
ALUSrc
MemRead
MemWrite
Branch
RegWrite
MemtoReg
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 58
Datapath and Control Unit
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 59
Tracking Control Signals - Cycle 1
LW
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 60
Tracking Control Signals - Cycle 2
SW LW
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 61
Tracking Control Signals - Cycle 3
ADD SW LW
0
01
1
W
M WE
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
ALUControl6
ALUOp0
1
RegDst
5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrc
RegWrite
0
1
W
MControl
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 62
Tracking Control Signals - Cycle 4
SUB ADD SW LW
1
0
0
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 63
Tracking Control Signals - Cycle 5
1
1
ADDSUB SW LW
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 64
Data Hazards Revisited…
Data hazards occur when data is used before it is
stored
– RAW (read after write).
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Program
execution
order
(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of
register $2:
DM Reg
Reg
Reg
Reg
DM
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 65
66
Data Hazards Revisited… (cont.) Data hazards can be classified into 3 types, depending on the order
of read and write accesses in the instructions.
Consider two instructions i and j, with i occurring before j– RAW (read after write)
j tries to read a source before i writes it
So j incorrectly gets the old value
– WAR (write after read) j tries to write a destination before it is read by i
So i incorrectly get the new value
WAR never happens in MIPS because all READs are early in ID stage and all WRITEs are later in WB stage
For example, auto-increment addressing, which write results early in the pipeline and other instruction reading a source after a write later in the pipeline
– WAW (write after write) j tries to write an operand before it is written by i
The writes end up performed in the wrong order, so leaving the value written by i rather than the value written by j in the destination
MIPS pipeline writes a register only in WB stage and avoids WAW
WAW only occurs in pipelines that write in more than one pipeline stage, or allow an instruction to proceed even when a previous instruction is stalled
Can RAR (read after read) be a data hazard?2/20/2017 ELEC 5200-001/6200-001 Lecture 5
Data Hazard Solution: Forwarding
Key idea: connect data internally before it's stored
EX
Hazard
MEM
Hazard
Data Hazard Solution: Forwarding
Add hardware to feed back ALU and MEM results to
both ALU inputs
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 68
Forwarding Unit
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 69
Controlling Forwarding
Data hazard at “EX” stage: (EX Hazard)– EX/MEM - test whether the instruction in EX/MEM writes
register file and examine rd register
– ID/EX - test whether the instruction in ID/EX reads rs or rtregister and matches rd register in EX/MEM
Data hazard at “MEM” stage: (MEM Hazard)– MEM/WB - test whether the instruction in MEM/WB writes
register file and examine rd (or rt) register
– ID/EX - test whether the instruction in ID/EX reads rs or rtregister and matches rd (or rt) register in EX/MEM
702/20/2017 ELEC 5200-001/6200-001 Lecture 5
Forwarding Unit Detail - EX Hazard
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 71
Forwarding Unit Detail - MEM Hazard
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
722/20/2017 ELEC 5200-001/6200-001 Lecture 5
2/20/2017 73
MEM Hazard Complication
One complication is potential data hazards between
the result of the instruction in WB stage, the result
of the instruction in MEM stage and the source
operand of the instruction in ALU stage.
Example: What if we a register is changed more
than once?
– add $1, $1, $2;
– add $1, $1, $3;
– add $1, $1, $4;
Answer: forward most recent result (in MEM stage)
ELEC 5200-001/6200-001 Lecture 5
Forwarding Unit Detail - MEM Hazard
Revised
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRs)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRt)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))ForwardB = 01
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 74
Hazard Detection Unit - Control Detail
if (ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
((ID/EX.RegisterRt = IF/ID.RegisterRt)))
stall
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 75
Pipelined Processor with
Hazard Detection
PCInstruction
memory
Registers
M u x
M u x
M u x
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Data memory
M u x
Hazard detection
unit
Forwarding
unit
0
M u x
IF/ID
Instr
uctio
n
ID/EX.MemReadIF
/ID
Wri
te
PC
Wri
te
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Rt
Rs
Rd
RtEX/MEM.RegisterRd
MEM/WB.RegisterRd
This is how
“stall” is
implemented
Hazard Detection Unit
How “stall” is implemented
MUX zeros out control signals for instruction in ID
– "squashes” the instruction
– “no-op” propagates through following stages
IF/ID holds stalled instruction until next clock cycle
PC holds current value until next clock cycle (re-
loads first instruction)
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 77
Control (Branch) Hazards
Just stalling for each branch is not practical
Common assumption: branch not taken
When assumption fails: flush three instructions– Note that the following figure does not assume the extra hardware to reduce the
control hazard in ID stage.
Reducing Branch Delay Key idea: move branch logic to ID stage of pipeline
– New adder calculates branch target (PC + 4 + extend(IMM) << 2)
– New hardware tests rs == rt immediately after register read
– Add flush signal to squash instruction in IF/ID register
Reduced penalty (1 cycle) when branch taken
Example on the next slide: Figure 4.62, p. 320– Assume that branch is taken (i.e., $1==$3)
One bubble– i.e., One instruction is flushed
36 sub $10, $4, $8
40 beq $1, $3, 7 # PC-relative branch 40+4+7*4 =72
44 and $12, $2, $5
......
72 lw $4, 50(7)
792/20/2017 ELEC 5200-001/6200-001 Lecture 5
A couple of details
are ignored
(i) IF.Flush comes
from control
unit;
(ii) output of the
equivalence
check of rs and
rt should be fed
into control unit,
which then
determines the
branch control
for the MUX in
front of PC
Branch Prediction
Key idea: instead of always assuming branch not
taken, use a prediction based on previous history
– Branch history table: a small memory
Indexed by lower bits of the address of the branch instruction
Using one bit to save the history of “what happened” on last
execution
– branch taken (‘1’)
– branch not taken (‘0’)
– Use history to make prediction
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 81
ELEC 5200-001/6200-001 Lecture 5 82
Branch Prediction
Useful for program loops.
A one-bit prediction scheme: a one-bit buffer carries a “history bit” that tells what happened on the last branch instruction
History bit = 1, branch was taken
History bit = 0, branch was not taken
Predict
branch
not taken
0
Predict
branch
taken
1
taken
taken
Not taken
Not taken
2/20/2017
Branch Prediction
ELEC 5200-001/6200-001 Lecture 5 83
=Prediction
Logic
0
1
PC+4 Next PC
PC
Low-order
bits used
as index
Address of Target History
recent branch addresses bit(s)
instructions
2/20/2017
Branch Prediction for a Loop
Execu
-tion
seq.
Old
hist.
bit
Next instr. New
hist.
bit
Predi
ctionPred. I Act.
1 0 e 1 b 1 Bad
2 1 b 2 b 1 Good
3 1 b 3 b 1 Good
4 1 b 4 b 1 Good
5 1 b 5 b 1 Good
6 1 b 6 b 1 Good
7 1 b 7 b 1 Good
8 1 b 8 b 1 Good
9 1 b 9 b 1 Good
10 1 b 10 e 0 Bad
I = 0
I = I + 1
I – 10 = 0?
Store X in memory
X = X + R(I)
Y
N
a
b
c
d
e
Execution of Instruction d
h.bit = 0 branch not taken, h.bit = 1 branch taken.
Prediction Accuracy
One-bit predictor: 2 errors out of 10 predictions
Prediction accuracy = 80%
To improve prediction accuracy, use two-bit
predictor: A prediction must be wrong twice before it is changed
ELEC 5200-001/6200-001 Lecture 5 852/20/2017
ELEC 5200-001/6200-001 Lecture 5 86
Two-Bit Prediction Buffer
Implemented as a two-bit counter.
Can improve correct prediction statistics.
Predict
branch
not taken
00
Predict
branch
taken
10
Predict
branch
taken
11
Predict
branch
not taken
01
taken
taken
taken
taken
Not taken
Not taken
Not taken
Not taken
2/20/2017
Branch Prediction for a Loop
Execu
-tion
seq.
Old
Pred.
Buf
Next instr. New
pred.
Buf
Predi
ctionPred. I Act.
1 10 2 1 2 11 Good
2 11 2 2 2 11 Good
3 11 2 3 2 11 Good
4 11 2 4 2 11 Good
5 11 2 5 2 11 Good
6 11 2 6 2 11 Good
7 11 2 7 2 11 Good
8 11 2 8 2 11 Good
9 11 2 9 2 11 Good
10 11 2 10 5 10 Bad
I = 0
I = I + 1
I – 10 = 0?
Store X in memory
X = X + R(I)
Y
N
1
2
3
4
5
Execution of Instruction 4
Performance
Comparison
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 88
*More on performance evaluation will be discussed in a future lecture.
Single-Cycle Performance
Assume 200 ps for memory access
100 ps for ALU operation
50 ps for register file read or write
Cycle time set according to longest instruction:
lw ≡ IF + ID/RegRead + ALU + MEM + RegWrite
= 200 + 50 +100 + 200 + 50
= 600 ps
Cycles Per Instruction (CPI) = 1
Av. instruction execution time = clock cycle time
= 600 ps
ELEC 5200-001/6200-001 Lecture 5 892/20/2017
Multicycle Performance
Consider SPECINT2000* instruction mix: 25% lw 5 cycles
10% sw 4 cycles
11% branch 3 cycles
2% jump 3 cycles
52% ALU instr. 4 cycles
Av. CPI = 0.25×5 + 0.10×4 + 0.11×3 + 0.02×3 + 0.52×4
= 4.12
Clock cycle time determined from longest operation (memory access) = 200 ps
Av. instruction execution time = 4.12×200 = 824 ps
ELEC 5200-001/6200-001 Lecture 5 902/20/2017
Pipeline Performance
Neglect initial latency (reasonable for long programs).
One instruction completed every clock cycle unless delayed by hazard. Average CPI:
lw 2 cycles in 50% cases due to hazard 1.5 cycles
sw 1 cycle
ALU 1 cycle
branch 2 cycles in 25% cases due to hazard 1.25 cycles
jump 2 cycles
For SPECINT2000
Av. CPI = 0.25×1.5 + 0.10×1 + 0.11×1.25 + 0.02×2.0 + 0.52×1
= 1.17
Clock cycle time (longest operation: memory access) = 200 ps
Av. instruction execution time = 1.17×200 = 234 ps
ELEC 5200-001/6200-001 Lecture 5 912/20/2017
ELEC 5200-001/6200-001 Lecture 5 92
Comparing Alternatives
Type of
datapath
and control
Clock cycle
time
Average
CPI
Av. instruction
execution time
Single-cycle 600 ps 1.00 600 ps
Multicycle 200 ps 4.12 824 ps
Pipelined 200 ps 1.17 234 ps
2/20/2017
Exceptions
A typical exception occurs when ALU produces an overflow signal.
Control asserts following actions on exception:– Change the PC address to 4000 0040hex. This is the
location of the exception routine. This is done by adding an additional input to the PC input multiplexer.
– Overflow is detected in the EX cycle. Similar to data hazard and pipeline flush, Set IF/ID to 0 (nop).
Generate ID.Flush and EX.Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers. This also prevents the ALU result (presumed contaminated) from being written in the WB cycle.
ELEC 5200-001/6200-001 Lecture 5 932/20/2017
2/20/2017 ELEC 5200-001/6200-001 Lecture 5 94
Next Class Memory Organization