View
216
Download
1
Embed Size (px)
Citation preview
Pipelining Datapath
Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)
and Hank Walker (TAMU)
Pipelining is Natural!• Laundry Example
• Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A B C D
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
Pipelined Laundry: Start work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
Pipelining Lessons• Latency vs. Throughput• Question
– What is the latency in both cases ?– What is the throughput in both cases ?
Pipelining doesn’t help latency of single task, it helps
throughput of entire workload
A
B
C
D
30 40 40 40 40 20
Pipelining Lessons [contd…]• Question
– What is the fastest operation in the example ?– What is the slowest operation in the example
Pipeline rate limited by slowest pipeline stage
A
B
C
D
30 40 40 40 40 20
Pipelining Lessons [contd…]
A
B
C
D
30 40 40 40 40 20
Multiple tasks operating simultaneously using different resources
Pipelining Lessons [contd…]• Question
– Would the speedup increase if we had more steps ?
A
B
C
D
30 40 40 40 40 20
Potential Speedup = Number of pipe stages
Pipelining Lessons [contd…]• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
• Question– Will it affect if “Folder” also took 40 minutes
Unbalanced lengths of pipe stages reduces speedup
Pipelining Lessons [contd…]
A
B
C
D
30 40 40 40 40 20
Time to “fill” pipeline and time to “drain” it reduces speedup
Five Stages of an Instruction
• Ifetch: Instruction Fetch– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode• Exec: Calculate the memory address• Mem: Read the data from the Data Memory• Wr: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
Conventional Pipelined Execution Representation
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WBProgram Flow
Time
Example [contd…]
• Timepipeline = Timenon-pipeline / Pipe stages
– Assumptions• Stages are perfectly balanced• Ideal conditions
• Ideally, speedup = 8/5 = 1.6• Most cases are not ideal !!!
Example [contd…]
• Speedup in this case = 24/14 = 1.7
• Lets add 1000 more instructions– Time (non-pipelined) = 1000 x 8 + 24 ns = 8000 ns– Time (pipelined) = 1000 x 2 + 14 ns = 2014 ns– Speedup = 8000 / 2014 = 3.98 = 4 (approx) = 8/2
Instruction throughput is important metric (as opposed to individual instruction)as real programs execute billions of instructions in practical case !!!
Pipeline Hazards
• Structural HazardIFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WBProgram Flow
Summary Pipelining Lessons• Pipelining doesn’t help
latency of single task, it helps throughput of entire workload
• Pipeline rate limited by slowest pipeline stage
• Multiple tasks operating simultaneously using different resources
• Potential speedup = Number pipe stages
• Unbalanced lengths of pipe stages reduces speedup
• Time to “fill” pipeline and time to “drain” it reduces speedup
• Stall for Dependences
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
Summary of Pipeline Hazards
• Structural Hazards– Hardware design
• Control Hazard– Decision based on results
• Data Hazard– Data Dependency
Example
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
Start: Fetch 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
rs rt im
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
IF
PC
Nex
t P
C
10
=
n n n n
Fetch 14, Decode 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
2 rt im
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
lw r
1, r
2(35
)
ID
IF
PC
Nex
t P
C
14
=
n n n
Fetch 20, Decode 14, Exec 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r2
B
SReg
File
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
2 rt 35
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
lw r
1
add
I r2,
r2,
3
EX
PC
Nex
t P
C
20
=
n n
Fetch 24, Decode 20, Exec 14, Mem 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r2
B
r2+
35
Reg
File
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
4 5 3
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
lw r
1
sub
r3,
r4,
r5
add
I r2,
r2,
3
ID
IF
EX
M
PC
Nex
t P
C
24
=
n
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r4
r5
r2+
3
Reg
File
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M[r
2+35
]6 7
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
lw r
1
beq
r6,
r7
100
add
I r2
sub
r3
ID
IF
EX
M WB
PC
Nex
t P
C
30
=
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
r6
r7
r2+
3
Reg
File
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
r1=
M[r
2+35
]
9 xx
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
beq
add
I r2
sub
r3
r4-r
5
100
ori
r8,
r9
17
ID
IF
EX
M WB
PC
Nex
t P
C
100
=
Pipelining Load Instruction
• The five independent functional units in the pipeline datapath are:
– Instruction Memory for the Ifetch stage
– Register File’s Read ports (bus A and busB) for the Reg/Dec stage
– ALU for the Exec stage
– Data Memory for the Mem stage
– Register File’s Write port (bus W) for the Wr stage
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch Reg/Dec Exec Mem Wr1st lw
Ifetch Reg/Dec Exec Mem Wr2nd lw
Ifetch Reg/Dec Exec Mem Wr3rd lw
Pipelining the R Instruction
• Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec:
– ALU operates on the two register operands
– Update PC
• Wr: Write the ALU output back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-type
Pipelingng Both L and R type
• We have pipeline conflict or structural hazard:– Two instructions try to write to the register file at
the same time!– Only one write port
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ops! We have a problem!
Important Observations• Each functional unit can only be used once per
instruction
• Each functional unit must be used at the same stage for all instructions:– Load uses Register File’s Write Port during its 5th
stage
– R-type uses Register File’s Write Port during its 4th stage
Ifetch Reg/Dec Exec Mem WrLoad
1 2 3 4 5
Ifetch Reg/Dec Exec WrR-type
1 2 3 4
Solution• Delay R-type’s register write by one cycle:
– Now R-type instructions also use Reg File’s write port at Stage 5
– Mem stage is a NOOP stage: nothing is being done.
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Mem Wr
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec WrR-type Mem
Exec
Exec
Exec
Exec
1 2 3 4 5
Datapath (Without Pipeline)IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– S;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– S;
S <– A + SX;
Mem[S] <- B
If CondPC < PC+SX;
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
D
M
Datapath (With Pipeline)IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– M;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– M;
S <– A + SX;
Mem[S] <- B
if Cond PC < PC+SX;
M <– S
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
D
M
M <– S
Mem
Structural Hazard and Solution
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4A
LUMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UMem Reg Mem RegA
LUReg Mem Reg
AL
UMem Reg Mem Reg
Control Hazard - #1 Stall
• Stall: wait until decision is clear
• Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
UMem Reg Mem Reg
AL
UMem Reg Mem RegA
LUReg Mem RegMem
Lostpotential
Control Hazard – #2 Predict
• Predict: guess one direction then back up if wrong• Impact: 0 lost cycles per branch instruction if right,
1 if wrong (right 50% of time)• More dynamic scheme: history of 1 branch
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
Mem
AL
UReg Mem Reg
Control Hazard - #3 Delayed Branch
• Delayed Branch: Redefine branch behavior (takes place after next instruction)
• Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)
Instr.
Order
Time (clock cycles)
Add
Beq
Misc
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
Mem
AL
UReg Mem Reg
Load Mem
AL
UReg Mem Reg
Data Hazards (RAW)
• Dependencies backwards in time are hazards
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF
ID/RF EX MEM WBAL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
Im
AL
UReg Dm Reg
Data Hazards [contd…]• “Forward” result from one stage to another
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF
ID/RF
EX MEM WBAL
UIm Reg Dm Reg
AL
UIm Reg Dm RegA
LUIm Reg Dm Reg
Im
AL
UReg Dm Reg
AL
UIm Reg Dm Reg
Data Hazards [contd…]
Reg
• Dependencies backwards in time are hazards
• Can’t solve with forwarding: • Must delay/stall instruction dependent on loads
Time (clock cycles)
lw r1,0(r2)
sub r4,r1,r3
IF
ID/RF EX MEM WBAL
UIm Reg Dm
AL
UIm Reg Dm RegStall
Hazard Detection
I-Fet ch DCD MemOpFetch OpFetch Exec Store
IFetch DCD ° ° °StructuralHazard
I-Fet ch DCD OpFetch Jump
IFetch DCD ° ° °
Control Hazard
IF DCD EX Mem WB
IF DCD OF Ex Mem
RAW (read after write) Data Hazard
WAW Data Hazard (write after write)
IF DCD OF Ex RS WAR Data Hazard (write after read)
IF DCD EX Mem WB
IF DCD EX Mem WB
Hazard Detection• Suppose instruction i is about to be issued and a
predecessor instruction j is in the instruction pipeline.
• A RAW hazard exists on register if Rregs( i ) Wregs( j )
• A WAW hazard exists on register if Wregs( i ) Wregs( j )
• A WAR hazard exists on register if Wregs( i ) Rregs( j )
Window on execution:Only pending instructions cancause hazardsInst J
Inst INew Inst
InstructionMovement:
Computing CPI
2211
typetypetypetypestall
stallbase
freqSTALLfreqSTALLCPI
CPICPICPI
• Start with Base CPI
• Add stalls
•Suppose: –CPIbase=1
–Freqbranch=20%, freqload=30%
–Suppose branches always cause 1 cycle stall
–Loads cause a 2 cycle stall
•Then: CPI = 1 + (10.20)+(2 0.30)= 1.8