Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Processor Design � Pipelined Processor
Hung-Wei Tseng
Pipelining
7
Pipelining• Break up the logic with “pipeline registers” into
pipeline stages• Each stage can act on different instruction/data• States/Control signals of instructions are hold in
pipeline registers
8
latc
hpi
pelin
e re
g.
latc
hpi
pelin
e re
g.
pipe
line
reg.
pipe
line
reg.
pipe
line
reg.
pipe
line
reg.
Pipelining
9
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #1pi
pelin
e re
g
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #2
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #3
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #4
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #5
After the 5th cycle, the processor can do 5 instructions in parallel
Pipelining
10
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #6
cycle #7
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #8
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #9
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
cycle #10
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
pipe
line
reg
The processor can complete 1 instruction each cycleCPI == 1 if everything works perfectly!
Single-cycle v.s. pipeline
11
v.s.
Cycle time of a pipeline processor• Critical path is the longest possible delay between two
registers in a design.• The critical path sets the cycle time, since the cycle
time must be long enough for a signal to traverse the critical path.
• Lengthening or shortening non-critical paths does not change performance
• Ideally, all paths are about the same length
13
Designing a 5-stage pipeline processor for
MIPS
15
Basic steps of execution• Instruction fetch: where?
• Decode:• What’s the instruction?• Where are the operands?
• Execute• Memory access
• Where is my data?
• Write back• Where to put the result
• Determine the next PC16
Processor
PC
120007a30: 0f00bb27 ldah gp,15(t12) 120007a34: 509cbd23 lda gp,-25520(gp)120007a38: 00005d24 ldah t1,0(gp)120007a3c: 0000bd24 ldah t4,0(gp)120007a40: 2ca422a0 ldl t0,-23508(t1)120007a44: 130020e4 beq t0,120007a94120007a48: 00003d24 ldah t0,0(gp)120007a4c: 2ca4e2b3 stl zero,-23508(t1)in
stru
ctio
n m
emor
yda
ta m
emor
y800bf9000: 00c2e800 12773376800bf9004: 00000008 8800bf9008: 00c2f000 12775424800bf900c: 00000008 8800bf9010: 00c2f800 12777472800bf9014: 00000008 8800bf9018: 00c30000 12779520800bf901c: 00000008 8
R0
R1
R2
R31
......
..
registersALU
instruction memory
registersALUs
data memory
registers
Pipeline a MIPS processor• Instruction Fetch
• Read from instruction memory
• Decode• Figure out the incoming instruction?• Fetch the operands from the registers
• Execution• Perform ALU functions
• Memory access• Read/write data memory
• Write back results to registers• Write to the register file
17
Execution (EXE)
Instruction Fetch (IF)
Instruction Decode (ID)
Memory Access (MEM)
Write Back (WB)
From single-cycle to pipeline
18
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[15:11]inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrcMemtoReg
MemRead
RegDst
RegWrite MemWrite
PCSrc
Zero
PCSrc = Branch & Zero
IF/ID ID/EX EX/MEM MEM/WB
Instruction Fetch Instruction Decode Execution MemoryAccess
WriteBack
Will this work?
ALUop
Control
inst[31:25],inst[5:0]
Pipelined processor
19
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4
Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[15:11]inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrcMemtoReg
MemRead
RegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)
ALUop
Control
inst[31:25],inst[5:0]
Pipelined processor
20
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4
Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[15:11]inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrcMemtoReg
MemRead
RegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)
ALUop
Control
inst[31:25],inst[5:0]
Pipelined processor
21
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4
Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[15:11]inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrcMemtoReg
MemRead
RegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)
ALUop
ControlWB
ME
EX
WB
ME
Where can I find these?
inst[31:25],inst[5:0]
Pipelined processor
22
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4
Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[15:11]
inst[31:0]mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrcMemtoReg
MemRead
RegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)
ALUop
ControlWB
ME
EX
WB
ME
WB
inst[31:25],inst[5:0]
RegDst
Pipelined processor
23
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4
Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[15:11]
inst[31:0]mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrcMemtoReg
MemRead
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)
ALUop
ControlWB
ME
EX
WB
ME
WB
Is this right?RegWrite
inst[31:25],inst[5:0]
Pipelined processor
24
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrcMemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
inst[31:25],inst[5:0]
5-stage pipelined processor
25
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrcMemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
inst[31:25],inst[5:0]
Simplified pipeline diagram• Use symbols to represent the physical resources
with the abbreviations for pipeline stages.• IF, ID, EXE, MEM, WB
• Horizontal axis represent the timeline, vertical axis for the instruction stream
• Example:
26
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)
IF EXE WBID MEM
IF EXE WBID MEM
IF EXEID MEM
IF EXEID
IF ID
WB
WBMEM
EXE WBMEM
Pipeline hazards
28
Pipeline hazards• Even though we perfectly divide pipeline stages, it’s
still hard to achieve CPI == 1.• Pipeline hazards:
• Structural hazard• The hardware does not allow two pipeline stages to work concurrently
• Data hazard• A later instruction in a pipeline stage depends on the outcome of an earlier
instruction in the pipeline
• Control hazard• The processor is not clear about what’s the next instruction to fetch
29
Can we get the right result?• Given the current 5-stage pipeline,
how many of the following MIPS code can work correctly?
30
I II III IVadd $1, $2, $3lw $4, 0($1)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9, $1, $10sw $11, 0($12)
add $1, $2, $3lw $4, 0($5)bne $0, $7, Lsub $9,$10,$11sw $1, 0($12)
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)
IF EXE WBID MEM
a:b:c:d:e:
b cannot get $1 produced by a before WB
both a and d are accessing $1 at 5th cycle
We don’t know if d & e will be executed or not
Data hazard Structural hazard
Control hazard
Structural hazard
31
Structural hazard• The hardware cannot support the combination of
instructions that we want to execute at the same cycle
• The original pipeline incurs structural hazard when two instructions competing the same register.
• Solution: write early, read late• Writes occur at the clock edge and complete long enough
before the end of the clock cycle.• This leaves enough time for outputs to settle for reads• The revised register file is the default one from now!
33
add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10, $1sw $1, 0($12)
MEM
EXE
IF EXEID MEM
IF EXEID
IF ID
IF ID
IF
WB
MEM
EXE
ID
WB
WBMEM
EXE WBMEM
WB
Structural hazard• The design of hardware causes structural hazard• We need to modify the hardware design to avoid
structural hazard
35
Data hazard
36
Data hazard• When an instruction in the pipeline needs a value
that is not available• Data dependences
• The output of an instruction is the input of a later instruction• May result in data hazard if the later instruction that
consumes the result is still in the pipeline
38
Sol. of data hazard I: Stall• When the source operand of an instruction is not ready,
stall the pipeline• Suspend the instruction and the following instruction• Allow the previous instructions to proceed• This introduces a pipeline bubble: a bubble does nothing,
propagate through the pipeline like a nop instruction
• How to stall the pipeline?• Disable the PC update• Disable the pipeline registers on the earlier pipeline stages• When the stall is over, re-enable the pipeline registers, PC
updates
40
Hazard detection & stall
41
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
MemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
ALUSrc
hazard detection
unit
ID/EX.MemReadPCWrite
IF/IDWrite
mux
0
Check if the destination register of EX == source register of the instruction in ID
Check if the destination register of MEM == source register of the instruction in ID
Insert a “noop” if we need to stall
inst[31:25],inst[5:0]
Performance of stall
42
add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)
WB
IF
IF EXEID MEM
IF EXEID
IF IF
ID ID
ID
IF
MEM WB
ID ID MEMEXE WB
IFIF ID MEMEXE WB
IF ID ID ID MEMEXE WB
15 cycles! CPI == 3(If there is no stall, CPI should be just 1!)
Insert a “noop” in EXE stageInsert another “noop” in EXE stage, previous noop goes to MEM stage
Sol. of data hazard II: Forwarding• The result is available after EXE and MEM stage,
but publicized in WB!• The data is already there, we should use it right
away!• Also called bypassing
43
add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)
IF EXEID
IF ID
IF
We can obtain the result here!
Sol. of data hazard II: Forwarding• Take the values, where ever they are!
44
add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)
IF EXEID
IF ID
IF
WBMEM
EXE
ID
IF
MEM WB
ID MEMEXE WB
IF ID MEMEXE WB
IF ID MEMEXE WB
10 cycles! CPI == 2 (Not optimal, but much better!)
When can/should we forward data?• If the instruction entering the EXE stage consumes a
result from a previous instruction that is entering MEM stage or WB stage• A source of the instruction entering EXE stage is the destination
of an instruction entering MEM/WB stage• The previous instruction must be an instruction that updates
register file
46
Forwarding in hardware
47
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrc
MemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
forwardingunit
mux
ForwardA
ForwardB
ForwardA
ForwardBdestination of Ins#1
Rs of Ins#2
Rt of Ins#2
ALU result of Ins#1
Control of Ins#1Control of Ins#2
inst[31:25],inst[5:0]
previous instruction (Ins#1)curernt instruction (Ins#2)
How about load?
Forwarding in hardware
48
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrc
MemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
forwardingunit
mux
ForwardA
ForwardB
ForwardA
ForwardB
Rd of Ins#1
ALU/MEM result of
Ins#1Control of Ins#1
inst[31:25],inst[5:0]
There is still a case that we have to stall...
• Revisit the following code:
49
add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)
IF EXEID
IF ID
IF
WBMEM
EXE
ID
IF
MEM WB
ID MEMEXE WB
IF ID MEMEXE WB
IF ID MEMEXE WB
lw generates result at MEM stage, we have to stall
• If the instruction entering EXE stage depends on a load instruction that does not finish its MEM stage yet, we have to stall!• We call this hazard detection
We need to know the following:1. If an instruction in EX/MEM updates a register (RegWrite)2. If an instruction in EX/MEM reads memory (MemRead)3. If the destination register of EX/MEM is a source of ID/EX (rs, rt of ID/EX == rt of EX/MEM #1)
Hazard detection with forwarding
50
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrc
MemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
forwardingunit
mux
ForwardA
ForwardB
ForwardA
ForwardB
hazard detection
unit
ID/EX.MemReadPCWrite
IF/IDWrite
mux
0
inst[31:25],inst[5:0]
Control hazard
51
Control hazard• The processor cannot determine the next PC to
fetch
54
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP lw $t3, 0($s0)
WB
MEM
EXE
ID
WB
MEM
EXE
IF EXEID
IF
IF
WBMEM
EXE MEM
ID EXE
IF ID
IF
ID
WB
MEM WB
IF ID MEMEXE WBstall
7 cycles per loop
Reducing the overhead of
control hazards
55
Solution I: Delayed branches• An agreement between ISA and hardware
• “Branch delay” slots: the next N instructions after a branch are always executed
• Compiler decides the instructions in branch delay slots• Reordering the instruction cannot affect the correctness of the program
• MIPS has one branch delay slot
• Good• Simple hardware
• Bad • N cannot change• Sometimes cannot find good candidates for the slot
56
Solution I: Delayed branches
57
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP
branch delay slot
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 bne $t1, $t0, LOOP addi $s0, $s0, 4 lw $t3, 0($s0)
WB
MEM
EXE
ID
WB
MEM
EXE
IF EXEID
IF
IF
WBMEM
EXE MEM
ID EXE
IF ID
IF
ID
IF
WB
MEM WB
ID MEMEXE WBstall
6 cycles per loop
Solution II: always predict not-taken• Always predict the next PC is PC+4
58
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP sw $v0, 0($s1) add $t4, $t3, $t5
WB
MEM
EXE
ID
WB
MEM
EXE
IF EXEID
IF
IF
WBMEM
EXE MEM
ID EXE
IF ID
IF
ID
IF
If branch is not taken: no stalls!If branch is taken: doesn’t hurt!
ID
IF
WB
MEM
nop
nop
lw $t3, 0($s0) IF
WB
nop
nop
ID
nop
MEMEXE WB
7 cycles per loopflush the instructions fetched incorrectly
Solution III: always predict taken
61
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
AddShi>le>2
ALUSrc
MemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
forwardingunit
mux
ForwardA
ForwardB
ForwardA
ForwardB
hazard detection
unit
ID/EX.MemReadPCWrite
IF/IDWrite
mux
0
inst[31:25],inst[5:0]
Solution III: always predict taken
62
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
Add
ALUSrc
MemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
forwardingunit
mux
ForwardA
ForwardB
ForwardA
ForwardB
hazard detection
unit
ID/EX.MemReadPCWrite
IF/IDWrite
mux
0
Shi>le>2
Still have to stall 1 cycle
inst[31:25],inst[5:0]
Solution III: always predict taken
63
ReadAddress
Instruc(onMemory
PC ALU
WriteData
4Add
ReadData1
ReadData2
ReadReg1
ReadReg2
WriteReg
Register
File
inst[25:21]
inst[20:16]
inst[31:0]
mux
0
1
mux
0
1sign-extend 3216
DataMemory
Address ReadData
mux
1
0
WriteData
mux
1
0
Add
ALUSrc
MemtoReg
MemReadRegDst
RegWrite MemWrite
PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst
[15:
11]
ALUop
ControlWB
ME
EX
WB
ME
WB
RegWrite
forwardingunit
mux
ForwardA
ForwardB
ForwardA
ForwardB
hazard detection
unit
ID/EX.MemReadPCWrite
IF/IDWrite
mux
0
Shi>le>2
Branch Target Buffer
Consult BTB in fetch stage
inst[31:25],inst[5:0]
Branch Target Buffer
64
PC
Branch Target Buffer
branch PCtarget address ortarget instruction
Solution III: always predict taken• Always predict taken with the help of BTB
65
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP
WB
MEM
EXE
ID
WB
MEM
EXE
IF EXEID
IF
IF
WBMEM
EXE MEM
ID EXE
IF ID
IF
ID
IF ID
IF
WB
MEM
EXE
ID
IF
WB
MEM
EXE
ID
WB
MEM WB
MEMEXE WB
5 cycles per loop(CPI == 1 !!!)
But what if the branch is not always taken?
lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3
Dynamic branch prediction
68
1-bit counter• Predict this branch will go the same way as the
result of the last time this branch executed• 1 for taken, 0 for not takens
69
0x400420 0x8048324 1
0x400464 0x8048392 1
0x400578 0x804850a 0
0x41000C 0x8049624 1
Branch Target Buffer
PC = 0x400420
Taken!
2-bit counter• A 2-bit counter for each branch• Predict taken if the counter value >= 2• If the prediction in taken states, fetch from target PC,
otherwise, use PC+4
Taken3 (11)
Taken2 (10)
NotTaken0 (00)
NotTaken1 (01)
taken
taken
taken
not taken
not taken
not taken
not taken
taken
71Branch Target Buffer
PC= 0x400420
Taken!0x400420 0x8048324 11
0x400464 0x8048392 10
0x400578 0x804850a 00
0x41000C 0x8049624 01
Performance of 2-bit counter• 2-bit state machine for each branch
for(i = 0; i < 10; i++) {! sum += a[i];}
90% accuracy!Taken3 (11)
Taken2 (10)
NotTaken0 (00)
NotTaken1 (01)
taken
taken
taken
not taken
not taken
not taken
not taken
taken • Application: 80% ALU, 20% Branch, and branch resolved in EX stage, average CPI?• 1+20%*(1-90%)*2 = 1.04 72
i state predict actual1 10 T T2 11 T T3 11 T T
4-9 11 T T10 11 T NT +
Make the prediction better• Consider the following code:
i = 0;do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i;} while ( ++i < 100) // Branch X
74
i branch result0 Y T0 X T1 Y NT1 X T2 Y NT2 X T3 Y T3 X T4 Y NT4 X T5 Y NT5 X T6 Y T6 X T7 Y NT
Can we capture the pattern?
Predict using history• Instead of using the PC to choose the predictor, use
a bit vector (global history register, GHR) made up of the previous branch outcomes.
• Each entry in the history table has its own counter.
75
0111101100111110
history table
n-bit GHR
2n entries
= 101 (T, NT, T)
Taken!
index
Performance of global history predictor
• Consider the following code:i = 0;do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i;// Branch Y} while ( ++i < 100) // Branch X
76
i ? GHR BHT prediction actual New BHT
0 Y 0000 10 T T 110 X 0001 10 T T 111 Y 0011 10 T NT 011 X 0110 10 T T 112 Y 1101 10 T NT 012 X 1010 10 T T 113 Y 0101 10 T T 113 X 1011 10 T T 114 Y 0111 10 T NT 014 X 1110 10 T T 115 Y 1101 01 NT NT 005 X 1010 11 T T 116 Y 0101 11 T T 116 X 1011 11 T T 117 Y 0111 01 NT NT 007 X 1110 11 T T 118 Y 1101 00 NT NT 008 X 1010 11 T T 119 Y 0101 11 T T 119 X 1011 11 T T 11
10 Y 0111 00 NT NT 00
Assume that we start with a 4-bit GHR= 0, all counters are 10.
Nearly perfect after this
Branch prediction and modern processors
79
Deeper pipeline
80
• Higher frequencies by shortening the pipeline stages• Higher marketing values since consumers usually link
performance with frequencies• Potentially higher power consumption as
dynamic/active power = aCV2f• If the execution time is better, still consume less energy
Case Study
81
Intel Pentium 4 Microarch.
82
Intel Pentium 4• Very deep pipeline: in order to achieve high frequency!
(start from 1.5GHz)• 20 stages in Netburst
• 31 stages in Prescott
• 103W (3.6GHz, 65nm) • Reference
• The Microarchitecture of the Pentium 4 Processor
1 2 3 4 5Drive
6Alloc
7 8 9Que
10Sch
11Sch
12Sch
13Disp
14Disp
15RF
16RF
17Ex
18Flgs
19Br Ck
20DriveTC Nxt IP TC Fetch Rename
83
AMD Athlon 64
84
AMD Athlon 64• 12 stage pipeline
• 89W TDP (Opteron 2.2GHz 90nm)
1Inst. AddrDecode
2Inst Mem
Read
3Inst. Byte
Pick
4ID1
5ID2
6Inst. Dbl. & Pack
7ID and Pack
8Dispatch
9Scheduling
10Execution
11D-CacheAddress
12D-cacheAccess
85
Demo revisited• Why the sorting the array speed up the code despite
the increased instruction count?
88
if(option) std::sort(data, data + arraySize);
for (unsigned i = 0; i < 100000; ++i) { int threshold = std::rand(); for (unsigned i = 0; i < arraySize; ++i) { if (data[i] >= threshold) sum ++; } }
Deep pipelining and data hazards
89
Data hazard revisited• How many cycles it takes to execute the following
code? • Draw the pipeline execution diagram
• assume that we have full data forwarding.lw $t1, 0($a0)lw $a0, 0($t1)bne $a0, $zero, 0
IF ID
IF EXE
ID
WB
MEM
ID
EXE
ID
IF
MEM
IF
ID WB
EX
90
MEM WB
9 cycles
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2-2
2.1 THE SKYLAKE MICROARCHITECTURE The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures. The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.
The Skylake microarchitecture offers the following enhancements:• Larger internal buffers to enable deeper OOO execution and higher cache bandwidth.• Improved front end throughput.• Improved branch predictor.• Improved divider throughput and latency.• Lower power consumption.• Improved SMT performance with Hyper-Threading Technology.• Balanced floating-point ADD, MUL, FMA throughput and latency.
The microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub-system consisting of a number of components including a ring interconnect to multiple slices of L3 (an off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A four-core configuration can be supported similar to the arrangement shown in Figure 2-3.
Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture
32K L1 Instruction Cache
MSROM Decoded Icache (DSB)
Legacy DecodePipeline
Instruction Decode Queue (IDQ,, or micro-op queue)
Allocate/Rename/Retire/MoveElimination/ZeroIdiom
32K L1 Data Cache
256K L2 Cache (Unified)
Int ALU, Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Divide,
Branch2
Port 2LD/STA
Scheduler
BPU
Port 0
Int ALU, Fast LEA,Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Int MUL,Slow LEA
Int ALU, Fast LEA,Vec SHUF,Vec ALU,
CVT
Int ALU, Int Shft,Branch1,
Port 3LD/STA
Port 4STD
Port 7STA
Port 1 Port 5 Port 6
5 uops/cycle4 uops/cycle6 uops/cycle
Intel’s latest SkyLake
92
Good reference for intel microarchitectures: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf