66
Processor Design Pipelined Processor Hung-Wei Tseng

Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Processor Design � Pipelined Processor

Hung-Wei Tseng

Page 2: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelining

7

Page 3: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelining• Break up the logic with “pipeline registers” into

pipeline stages• Each stage can act on different instruction/data• States/Control signals of instructions are hold in

pipeline registers

8

latc

hpi

pelin

e re

g.

latc

hpi

pelin

e re

g.

pipe

line

reg.

pipe

line

reg.

pipe

line

reg.

pipe

line

reg.

Page 4: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelining

9

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #1pi

pelin

e re

g

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #2

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #3

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #4

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #5

After the 5th cycle, the processor can do 5 instructions in parallel

Page 5: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelining

10

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #6

cycle #7

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #8

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #9

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #10

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

The processor can complete 1 instruction each cycleCPI == 1 if everything works perfectly!

Page 6: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Single-cycle v.s. pipeline

11

v.s.

Page 7: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Cycle time of a pipeline processor• Critical path is the longest possible delay between two

registers in a design.• The critical path sets the cycle time, since the cycle

time must be long enough for a signal to traverse the critical path.

• Lengthening or shortening non-critical paths does not change performance

• Ideally, all paths are about the same length

13

Page 8: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Designing a 5-stage pipeline processor for

MIPS

15

Page 9: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Basic steps of execution• Instruction fetch: where?

• Decode:• What’s the instruction?• Where are the operands?

• Execute• Memory access

• Where is my data?

• Write back• Where to put the result

• Determine the next PC16

Processor

PC

120007a30: 0f00bb27 ldah gp,15(t12) 120007a34: 509cbd23 lda gp,-25520(gp)120007a38: 00005d24 ldah t1,0(gp)120007a3c: 0000bd24 ldah t4,0(gp)120007a40: 2ca422a0 ldl t0,-23508(t1)120007a44: 130020e4 beq t0,120007a94120007a48: 00003d24 ldah t0,0(gp)120007a4c: 2ca4e2b3 stl zero,-23508(t1)in

stru

ctio

n m

emor

yda

ta m

emor

y800bf9000: 00c2e800 12773376800bf9004: 00000008 8800bf9008: 00c2f000 12775424800bf900c: 00000008 8800bf9010: 00c2f800 12777472800bf9014: 00000008 8800bf9018: 00c30000 12779520800bf901c: 00000008 8

R0

R1

R2

R31

......

..

registersALU

instruction memory

registersALUs

data memory

registers

Page 10: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipeline a MIPS processor• Instruction Fetch

• Read from instruction memory

• Decode• Figure out the incoming instruction?• Fetch the operands from the registers

• Execution• Perform ALU functions

• Memory access• Read/write data memory

• Write back results to registers• Write to the register file

17

Execution (EXE)

Instruction Fetch (IF)

Instruction Decode (ID)

Memory Access (MEM)

Write Back (WB)

Page 11: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

From single-cycle to pipeline

18

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero

PCSrc = Branch & Zero

IF/ID ID/EX EX/MEM MEM/WB

Instruction Fetch Instruction Decode Execution MemoryAccess

WriteBack

Will this work?

ALUop

Control

inst[31:25],inst[5:0]

Page 12: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelined processor

19

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

Control

inst[31:25],inst[5:0]

Page 13: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelined processor

20

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

Control

inst[31:25],inst[5:0]

Page 14: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelined processor

21

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

ControlWB

ME

EX

WB

ME

Where can I find these?

inst[31:25],inst[5:0]

Page 15: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelined processor

22

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]

inst[31:0]mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

ControlWB

ME

EX

WB

ME

WB

inst[31:25],inst[5:0]

Page 16: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

RegDst

Pipelined processor

23

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]

inst[31:0]mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

ControlWB

ME

EX

WB

ME

WB

Is this right?RegWrite

inst[31:25],inst[5:0]

Page 17: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipelined processor

24

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

inst[31:25],inst[5:0]

Page 18: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

5-stage pipelined processor

25

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

inst[31:25],inst[5:0]

Page 19: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Simplified pipeline diagram• Use symbols to represent the physical resources

with the abbreviations for pipeline stages.• IF, ID, EXE, MEM, WB

• Horizontal axis represent the timeline, vertical axis for the instruction stream

• Example:

26

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

IF EXE WBID MEM

IF EXE WBID MEM

IF EXEID MEM

IF EXEID

IF ID

WB

WBMEM

EXE WBMEM

Page 20: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipeline hazards

28

Page 21: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Pipeline hazards• Even though we perfectly divide pipeline stages, it’s

still hard to achieve CPI == 1.• Pipeline hazards:

• Structural hazard• The hardware does not allow two pipeline stages to work concurrently

• Data hazard• A later instruction in a pipeline stage depends on the outcome of an earlier

instruction in the pipeline

• Control hazard• The processor is not clear about what’s the next instruction to fetch

29

Page 22: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Can we get the right result?• Given the current 5-stage pipeline,

how many of the following MIPS code can work correctly?

30

I II III IVadd $1, $2, $3lw $4, 0($1)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9, $1, $10sw $11, 0($12)

add $1, $2, $3lw $4, 0($5)bne $0, $7, Lsub $9,$10,$11sw $1, 0($12)

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

IF EXE WBID MEM

a:b:c:d:e:

b cannot get $1 produced by a before WB

both a and d are accessing $1 at 5th cycle

We don’t know if d & e will be executed or not

Data hazard Structural hazard

Control hazard

Page 23: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Structural hazard

31

Page 24: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Structural hazard• The hardware cannot support the combination of

instructions that we want to execute at the same cycle

• The original pipeline incurs structural hazard when two instructions competing the same register.

• Solution: write early, read late• Writes occur at the clock edge and complete long enough

before the end of the clock cycle.• This leaves enough time for outputs to settle for reads• The revised register file is the default one from now!

33

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10, $1sw $1, 0($12)

MEM

EXE

IF EXEID MEM

IF EXEID

IF ID

IF ID

IF

WB

MEM

EXE

ID

WB

WBMEM

EXE WBMEM

WB

Page 25: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Structural hazard• The design of hardware causes structural hazard• We need to modify the hardware design to avoid

structural hazard

35

Page 26: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Data hazard

36

Page 27: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Data hazard• When an instruction in the pipeline needs a value

that is not available• Data dependences

• The output of an instruction is the input of a later instruction• May result in data hazard if the later instruction that

consumes the result is still in the pipeline

38

Page 28: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Sol. of data hazard I: Stall• When the source operand of an instruction is not ready,

stall the pipeline• Suspend the instruction and the following instruction• Allow the previous instructions to proceed• This introduces a pipeline bubble: a bubble does nothing,

propagate through the pipeline like a nop instruction

• How to stall the pipeline?• Disable the PC update• Disable the pipeline registers on the earlier pipeline stages• When the stall is over, re-enable the pipeline registers, PC

updates

40

Page 29: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Hazard detection & stall

41

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

ALUSrc

hazard detection

unit

ID/EX.MemReadPCWrite

IF/IDWrite

mux

0

Check if the destination register of EX == source register of the instruction in ID

Check if the destination register of MEM == source register of the instruction in ID

Insert a “noop” if we need to stall

inst[31:25],inst[5:0]

Page 30: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Performance of stall

42

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

WB

IF

IF EXEID MEM

IF EXEID

IF IF

ID ID

ID

IF

MEM WB

ID ID MEMEXE WB

IFIF ID MEMEXE WB

IF ID ID ID MEMEXE WB

15 cycles! CPI == 3(If there is no stall, CPI should be just 1!)

Insert a “noop” in EXE stageInsert another “noop” in EXE stage, previous noop goes to MEM stage

Page 31: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Sol. of data hazard II: Forwarding• The result is available after EXE and MEM stage,

but publicized in WB!• The data is already there, we should use it right

away!• Also called bypassing

43

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

IF ID

IF

We can obtain the result here!

Page 32: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Sol. of data hazard II: Forwarding• Take the values, where ever they are!

44

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

IF ID

IF

WBMEM

EXE

ID

IF

MEM WB

ID MEMEXE WB

IF ID MEMEXE WB

IF ID MEMEXE WB

10 cycles! CPI == 2 (Not optimal, but much better!)

Page 33: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

When can/should we forward data?• If the instruction entering the EXE stage consumes a

result from a previous instruction that is entering MEM stage or WB stage• A source of the instruction entering EXE stage is the destination

of an instruction entering MEM/WB stage• The previous instruction must be an instruction that updates

register file

46

Page 34: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Forwarding in hardware

47

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardBdestination of Ins#1

Rs of Ins#2

Rt of Ins#2

ALU result of Ins#1

Control of Ins#1Control of Ins#2

inst[31:25],inst[5:0]

previous instruction (Ins#1)curernt instruction (Ins#2)

How about load?

Page 35: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Forwarding in hardware

48

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

Rd of Ins#1

ALU/MEM result of

Ins#1Control of Ins#1

inst[31:25],inst[5:0]

Page 36: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

There is still a case that we have to stall...

• Revisit the following code:

49

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

IF ID

IF

WBMEM

EXE

ID

IF

MEM WB

ID MEMEXE WB

IF ID MEMEXE WB

IF ID MEMEXE WB

lw generates result at MEM stage, we have to stall

• If the instruction entering EXE stage depends on a load instruction that does not finish its MEM stage yet, we have to stall!• We call this hazard detection

We need to know the following:1. If an instruction in EX/MEM updates a register (RegWrite)2. If an instruction in EX/MEM reads memory (MemRead)3. If the destination register of EX/MEM is a source of ID/EX (rs, rt of ID/EX == rt of EX/MEM #1)

Page 37: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Hazard detection with forwarding

50

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

unit

ID/EX.MemReadPCWrite

IF/IDWrite

mux

0

inst[31:25],inst[5:0]

Page 38: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Control hazard

51

Page 39: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Control hazard• The processor cannot determine the next PC to

fetch

54

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP lw $t3, 0($s0)

WB

MEM

EXE

ID

WB

MEM

EXE

IF EXEID

IF

IF

WBMEM

EXE MEM

ID EXE

IF ID

IF

ID

WB

MEM WB

IF ID MEMEXE WBstall

7 cycles per loop

Page 40: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Reducing the overhead of

control hazards

55

Page 41: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Solution I: Delayed branches• An agreement between ISA and hardware

• “Branch delay” slots: the next N instructions after a branch are always executed

• Compiler decides the instructions in branch delay slots• Reordering the instruction cannot affect the correctness of the program

• MIPS has one branch delay slot

• Good• Simple hardware

• Bad • N cannot change• Sometimes cannot find good candidates for the slot

56

Page 42: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Solution I: Delayed branches

57

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP

branch delay slot

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 bne $t1, $t0, LOOP addi $s0, $s0, 4 lw $t3, 0($s0)

WB

MEM

EXE

ID

WB

MEM

EXE

IF EXEID

IF

IF

WBMEM

EXE MEM

ID EXE

IF ID

IF

ID

IF

WB

MEM WB

ID MEMEXE WBstall

6 cycles per loop

Page 43: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Solution II: always predict not-taken• Always predict the next PC is PC+4

58

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP sw $v0, 0($s1) add $t4, $t3, $t5

WB

MEM

EXE

ID

WB

MEM

EXE

IF EXEID

IF

IF

WBMEM

EXE MEM

ID EXE

IF ID

IF

ID

IF

If branch is not taken: no stalls!If branch is taken: doesn’t hurt!

ID

IF

WB

MEM

nop

nop

lw $t3, 0($s0) IF

WB

nop

nop

ID

nop

MEMEXE WB

7 cycles per loopflush the instructions fetched incorrectly

Page 44: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Solution III: always predict taken

61

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

unit

ID/EX.MemReadPCWrite

IF/IDWrite

mux

0

inst[31:25],inst[5:0]

Page 45: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Solution III: always predict taken

62

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

Add

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

unit

ID/EX.MemReadPCWrite

IF/IDWrite

mux

0

Shi>le>2

Still have to stall 1 cycle

inst[31:25],inst[5:0]

Page 46: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Solution III: always predict taken

63

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

Add

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero

IF/ID ID/EX EX/MEM MEM/WB

inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

unit

ID/EX.MemReadPCWrite

IF/IDWrite

mux

0

Shi>le>2

Branch Target Buffer

Consult BTB in fetch stage

inst[31:25],inst[5:0]

Page 47: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Branch Target Buffer

64

PC

Branch Target Buffer

branch PCtarget address ortarget instruction

Page 48: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Solution III: always predict taken• Always predict taken with the help of BTB

65

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP

WB

MEM

EXE

ID

WB

MEM

EXE

IF EXEID

IF

IF

WBMEM

EXE MEM

ID EXE

IF ID

IF

ID

IF ID

IF

WB

MEM

EXE

ID

IF

WB

MEM

EXE

ID

WB

MEM WB

MEMEXE WB

5 cycles per loop(CPI == 1 !!!)

But what if the branch is not always taken?

lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3

Page 49: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Dynamic branch prediction

68

Page 50: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

1-bit counter• Predict this branch will go the same way as the

result of the last time this branch executed• 1 for taken, 0 for not takens

69

0x400420 0x8048324 1

0x400464 0x8048392 1

0x400578 0x804850a 0

0x41000C 0x8049624 1

Branch Target Buffer

PC = 0x400420

Taken!

Page 51: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

2-bit counter• A 2-bit counter for each branch• Predict taken if the counter value >= 2• If the prediction in taken states, fetch from target PC,

otherwise, use PC+4

Taken3 (11)

Taken2 (10)

NotTaken0 (00)

NotTaken1 (01)

taken

taken

taken

not taken

not taken

not taken

not taken

taken

71Branch Target Buffer

PC= 0x400420

Taken!0x400420 0x8048324 11

0x400464 0x8048392 10

0x400578 0x804850a 00

0x41000C 0x8049624 01

Page 52: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Performance of 2-bit counter• 2-bit state machine for each branch

for(i = 0; i < 10; i++) {! sum += a[i];}

90% accuracy!Taken3 (11)

Taken2 (10)

NotTaken0 (00)

NotTaken1 (01)

taken

taken

taken

not taken

not taken

not taken

not taken

taken • Application: 80% ALU, 20% Branch, and branch resolved in EX stage, average CPI?• 1+20%*(1-90%)*2 = 1.04 72

i state predict actual1 10 T T2 11 T T3 11 T T

4-9 11 T T10 11 T NT +

Page 53: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Make the prediction better• Consider the following code:

i = 0;do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i;} while ( ++i < 100) // Branch X

74

i branch result0 Y T0 X T1 Y NT1 X T2 Y NT2 X T3 Y T3 X T4 Y NT4 X T5 Y NT5 X T6 Y T6 X T7 Y NT

Can we capture the pattern?

Page 54: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Predict using history• Instead of using the PC to choose the predictor, use

a bit vector (global history register, GHR) made up of the previous branch outcomes.

• Each entry in the history table has its own counter.

75

0111101100111110

history table

n-bit GHR

2n entries

= 101 (T, NT, T)

Taken!

index

Page 55: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Performance of global history predictor

• Consider the following code:i = 0;do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i;// Branch Y} while ( ++i < 100) // Branch X

76

i ? GHR BHT prediction actual New BHT

0 Y 0000 10 T T 110 X 0001 10 T T 111 Y 0011 10 T NT 011 X 0110 10 T T 112 Y 1101 10 T NT 012 X 1010 10 T T 113 Y 0101 10 T T 113 X 1011 10 T T 114 Y 0111 10 T NT 014 X 1110 10 T T 115 Y 1101 01 NT NT 005 X 1010 11 T T 116 Y 0101 11 T T 116 X 1011 11 T T 117 Y 0111 01 NT NT 007 X 1110 11 T T 118 Y 1101 00 NT NT 008 X 1010 11 T T 119 Y 0101 11 T T 119 X 1011 11 T T 11

10 Y 0111 00 NT NT 00

Assume that we start with a 4-bit GHR= 0, all counters are 10.

Nearly perfect after this

Page 56: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Branch prediction and modern processors

79

Page 57: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Deeper pipeline

80

• Higher frequencies by shortening the pipeline stages• Higher marketing values since consumers usually link

performance with frequencies• Potentially higher power consumption as

dynamic/active power = aCV2f• If the execution time is better, still consume less energy

Page 58: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Case Study

81

Page 59: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Intel Pentium 4 Microarch.

82

Page 60: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Intel Pentium 4• Very deep pipeline: in order to achieve high frequency!

(start from 1.5GHz)• 20 stages in Netburst

• 31 stages in Prescott

• 103W (3.6GHz, 65nm) • Reference

• The Microarchitecture of the Pentium 4 Processor

1 2 3 4 5Drive

6Alloc

7 8 9Que

10Sch

11Sch

12Sch

13Disp

14Disp

15RF

16RF

17Ex

18Flgs

19Br Ck

20DriveTC Nxt IP TC Fetch Rename

83

Page 61: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

AMD Athlon 64

84

Page 62: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

AMD Athlon 64• 12 stage pipeline

• 89W TDP (Opteron 2.2GHz 90nm)

1Inst. AddrDecode

2Inst Mem

Read

3Inst. Byte

Pick

4ID1

5ID2

6Inst. Dbl. & Pack

7ID and Pack

8Dispatch

9Scheduling

10Execution

11D-CacheAddress

12D-cacheAccess

85

Page 63: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Demo revisited• Why the sorting the array speed up the code despite

the increased instruction count?

88

if(option) std::sort(data, data + arraySize);

for (unsigned i = 0; i < 100000; ++i) { int threshold = std::rand(); for (unsigned i = 0; i < arraySize; ++i) { if (data[i] >= threshold) sum ++; } }

Page 64: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Deep pipelining and data hazards

89

Page 65: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Data hazard revisited• How many cycles it takes to execute the following

code? • Draw the pipeline execution diagram

• assume that we have full data forwarding.lw $t1, 0($a0)lw $a0, 0($t1)bne $a0, $zero, 0

IF ID

IF EXE

ID

WB

MEM

ID

EXE

ID

IF

MEM

IF

ID WB

EX

90

MEM WB

9 cycles

Page 66: Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2-2

2.1 THE SKYLAKE MICROARCHITECTURE The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures. The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.

The Skylake microarchitecture offers the following enhancements:• Larger internal buffers to enable deeper OOO execution and higher cache bandwidth.• Improved front end throughput.• Improved branch predictor.• Improved divider throughput and latency.• Lower power consumption.• Improved SMT performance with Hyper-Threading Technology.• Balanced floating-point ADD, MUL, FMA throughput and latency.

The microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub-system consisting of a number of components including a ring interconnect to multiple slices of L3 (an off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A four-core configuration can be supported similar to the arrangement shown in Figure 2-3.

Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture

32K L1 Instruction Cache

MSROM Decoded Icache (DSB)

Legacy DecodePipeline

Instruction Decode Queue (IDQ,, or micro-op queue)

Allocate/Rename/Retire/MoveElimination/ZeroIdiom

32K L1 Data Cache

256K L2 Cache (Unified)

Int ALU, Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Divide,

Branch2

Port 2LD/STA

Scheduler

BPU

Port 0

Int ALU, Fast LEA,Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Int MUL,Slow LEA

Int ALU, Fast LEA,Vec SHUF,Vec ALU,

CVT

Int ALU, Int Shft,Branch1,

Port 3LD/STA

Port 4STD

Port 7STA

Port 1 Port 5 Port 6

5 uops/cycle4 uops/cycle6 uops/cycle

Intel’s latest SkyLake

92

Good reference for intel microarchitectures: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf