Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets

Processor Design � Pipelined Processor

Hung-Wei Tseng

Pipelining

7

Pipelining• Break up the logic with “pipeline registers” into

pipeline stages• Each stage can act on different instruction/data• States/Control signals of instructions are hold in

pipeline registers

8

latc

hpi

pelin

e re

g.

latc

hpi

pelin

e re

g.

pipe

line

reg.

pipe

line

reg.

pipe

line

reg.

pipe

line

reg.

Pipelining

9

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #1pi

pelin

e re

g

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #2

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #3

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #4

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #5

After the 5th cycle, the processor can do 5 instructions in parallel

Pipelining

10

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #6

cycle #7

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #8

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #9

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

cycle #10

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

pipe

line

reg

The processor can complete 1 instruction each cycleCPI == 1 if everything works perfectly!

Single-cycle v.s. pipeline

11

v.s.

Cycle time of a pipeline processor• Critical path is the longest possible delay between two

registers in a design.• The critical path sets the cycle time, since the cycle

time must be long enough for a signal to traverse the critical path.

• Lengthening or shortening non-critical paths does not change performance

• Ideally, all paths are about the same length

13

Designing a 5-stage pipeline processor for

MIPS

15

Basic steps of execution• Instruction fetch: where?

• Decode:• What’s the instruction?• Where are the operands?

• Execute• Memory access

• Where is my data?

• Write back• Where to put the result

• Determine the next PC16

Processor

PC

120007a30: 0f00bb27 ldah gp,15(t12) 120007a34: 509cbd23 lda gp,-25520(gp)120007a38: 00005d24 ldah t1,0(gp)120007a3c: 0000bd24 ldah t4,0(gp)120007a40: 2ca422a0 ldl t0,-23508(t1)120007a44: 130020e4 beq t0,120007a94120007a48: 00003d24 ldah t0,0(gp)120007a4c: 2ca4e2b3 stl zero,-23508(t1)in

stru

ctio

n m

emor

yda

ta m

emor

y800bf9000: 00c2e800 12773376800bf9004: 00000008 8800bf9008: 00c2f000 12775424800bf900c: 00000008 8800bf9010: 00c2f800 12777472800bf9014: 00000008 8800bf9018: 00c30000 12779520800bf901c: 00000008 8

R0

R1

R2

R31

......

..

registersALU

instruction memory

registersALUs

data memory

registers

Pipeline a MIPS processor• Instruction Fetch

• Read from instruction memory

• Decode• Figure out the incoming instruction?• Fetch the operands from the registers

• Execution• Perform ALU functions

• Memory access• Read/write data memory

• Write back results to registers• Write to the register file

17

Execution (EXE)

Instruction Fetch (IF)

Instruction Decode (ID)

Memory Access (MEM)

Write Back (WB)

From single-cycle to pipeline

18

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero

PCSrc = Branch & Zero

IF/ID ID/EX EX/MEM MEM/WB

Instruction Fetch Instruction Decode Execution MemoryAccess

WriteBack

Will this work?

ALUop

Control

inst[31:25],inst[5:0]

Pipelined processor

19

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero


add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

Control

inst[31:25],inst[5:0]

Pipelined processor

20

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero


add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

Control

inst[31:25],inst[5:0]

Pipelined processor

21

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero


add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

ControlWB

ME

EX

WB

ME

Where can I find these?

inst[31:25],inst[5:0]

Pipelined processor

22

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]

inst[31:0]mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc

Zero


add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

ControlWB

ME

EX

WB

ME

WB

inst[31:25],inst[5:0]

RegDst

Pipelined processor

23

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4

Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[15:11]

inst[31:0]mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegWrite MemWrite

PCSrc

Zero


add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ALUop

ControlWB

ME

EX

WB

ME

WB

Is this right?RegWrite

inst[31:25],inst[5:0]

Pipelined processor

24

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

inst[31:25],inst[5:0]

5-stage pipelined processor

25

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrcMemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

inst[31:25],inst[5:0]

Simplified pipeline diagram• Use symbols to represent the physical resources

with the abbreviations for pipeline stages.• IF, ID, EXE, MEM, WB

• Horizontal axis represent the timeline, vertical axis for the instruction stream

• Example:

26

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

IF EXE WBID MEM

IF EXE WBID MEM

IF EXEID MEM

IF EXEID

IF ID

WB

WBMEM

EXE WBMEM

Pipeline hazards

28

Pipeline hazards• Even though we perfectly divide pipeline stages, it’s

still hard to achieve CPI == 1.• Pipeline hazards:

• Structural hazard• The hardware does not allow two pipeline stages to work concurrently

• Data hazard• A later instruction in a pipeline stage depends on the outcome of an earlier

instruction in the pipeline

• Control hazard• The processor is not clear about what’s the next instruction to fetch

29

Can we get the right result?• Given the current 5-stage pipeline,

how many of the following MIPS code can work correctly?

30

I II III IVadd $1, $2, $3lw $4, 0($1)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9, $1, $10sw $11, 0($12)

add $1, $2, $3lw $4, 0($5)bne $0, $7, Lsub $9,$10,$11sw $1, 0($12)

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

IF EXE WBID MEM

a:b:c:d:e:

b cannot get $1 produced by a before WB

both a and d are accessing $1 at 5th cycle

We don’t know if d & e will be executed or not

Data hazard Structural hazard

Control hazard

Structural hazard

31

Structural hazard• The hardware cannot support the combination of

instructions that we want to execute at the same cycle

• The original pipeline incurs structural hazard when two instructions competing the same register.

• Solution: write early, read late• Writes occur at the clock edge and complete long enough

before the end of the clock cycle.• This leaves enough time for outputs to settle for reads• The revised register file is the default one from now!

33

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10, $1sw $1, 0($12)

MEM

EXE

IF EXEID MEM

IF EXEID

IF ID

IF ID

IF

WB

MEM

EXE

ID

WB

WBMEM

EXE WBMEM

WB

Structural hazard• The design of hardware causes structural hazard• We need to modify the hardware design to avoid

structural hazard

35

Data hazard

36

Data hazard• When an instruction in the pipeline needs a value

that is not available• Data dependences

• The output of an instruction is the input of a later instruction• May result in data hazard if the later instruction that

consumes the result is still in the pipeline

38

Sol. of data hazard I: Stall• When the source operand of an instruction is not ready,

stall the pipeline• Suspend the instruction and the following instruction• Allow the previous instructions to proceed• This introduces a pipeline bubble: a bubble does nothing,

propagate through the pipeline like a nop instruction

• How to stall the pipeline?• Disable the PC update• Disable the pipeline registers on the earlier pipeline stages• When the stall is over, re-enable the pipeline registers, PC

updates

40

Hazard detection & stall

41

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

ALUSrc

hazard detection

unit

ID/EX.MemReadPCWrite

IF/IDWrite

mux

0

Check if the destination register of EX == source register of the instruction in ID

Check if the destination register of MEM == source register of the instruction in ID

Insert a “noop” if we need to stall

inst[31:25],inst[5:0]

Performance of stall

42

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

WB

IF

IF EXEID MEM

IF EXEID

IF IF

ID ID

ID

IF

MEM WB

ID ID MEMEXE WB

IFIF ID MEMEXE WB

IF ID ID ID MEMEXE WB

15 cycles! CPI == 3(If there is no stall, CPI should be just 1!)

Insert a “noop” in EXE stageInsert another “noop” in EXE stage, previous noop goes to MEM stage

Sol. of data hazard II: Forwarding• The result is available after EXE and MEM stage,

but publicized in WB!• The data is already there, we should use it right

away!• Also called bypassing

43

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

IF ID

IF

We can obtain the result here!

Sol. of data hazard II: Forwarding• Take the values, where ever they are!

44

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

IF ID

IF

WBMEM

EXE

ID

IF

MEM WB

ID MEMEXE WB

IF ID MEMEXE WB

IF ID MEMEXE WB

10 cycles! CPI == 2 (Not optimal, but much better!)

When can/should we forward data?• If the instruction entering the EXE stage consumes a

result from a previous instruction that is entering MEM stage or WB stage• A source of the instruction entering EXE stage is the destination

of an instruction entering MEM/WB stage• The previous instruction must be an instruction that updates

register file

46

Forwarding in hardware

47

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardBdestination of Ins#1

Rs of Ins#2

Rt of Ins#2

ALU result of Ins#1

Control of Ins#1Control of Ins#2

inst[31:25],inst[5:0]

previous instruction (Ins#1)curernt instruction (Ins#2)

How about load?

Forwarding in hardware

48

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

Rd of Ins#1

ALU/MEM result of

Ins#1Control of Ins#1

inst[31:25],inst[5:0]

There is still a case that we have to stall...

• Revisit the following code:

49

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

IF ID

IF

WBMEM

EXE

ID

IF

MEM WB

ID MEMEXE WB

IF ID MEMEXE WB

IF ID MEMEXE WB

lw generates result at MEM stage, we have to stall

• If the instruction entering EXE stage depends on a load instruction that does not finish its MEM stage yet, we have to stall!• We call this hazard detection

We need to know the following:1. If an instruction in EX/MEM updates a register (RegWrite)2. If an instruction in EX/MEM reads memory (MemRead)3. If the destination register of EX/MEM is a source of ID/EX (rs, rt of ID/EX == rt of EX/MEM #1)

Hazard detection with forwarding

50

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

unit


IF/IDWrite

mux

0

inst[31:25],inst[5:0]

Control hazard

51

Control hazard• The processor cannot determine the next PC to

fetch

54

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP lw $t3, 0($s0)

WB

MEM

EXE

ID

WB

MEM

EXE

IF EXEID

IF

IF

WBMEM

EXE MEM

ID EXE

IF ID

IF

ID

WB

MEM WB

IF ID MEMEXE WBstall

7 cycles per loop

Reducing the overhead of

control hazards

55

Solution I: Delayed branches• An agreement between ISA and hardware

• “Branch delay” slots: the next N instructions after a branch are always executed

• Compiler decides the instructions in branch delay slots• Reordering the instruction cannot affect the correctness of the program

• MIPS has one branch delay slot

• Good• Simple hardware

• Bad • N cannot change• Sometimes cannot find good candidates for the slot

56

Solution I: Delayed branches

57

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP

branch delay slot

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 bne $t1, $t0, LOOP addi $s0, $s0, 4 lw $t3, 0($s0)

WB

MEM

EXE

ID

WB

MEM

EXE

IF EXEID

IF

IF

WBMEM

EXE MEM

ID EXE

IF ID

IF

ID

IF

WB

MEM WB

ID MEMEXE WBstall

6 cycles per loop

Solution II: always predict not-taken• Always predict the next PC is PC+4

58

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP sw $v0, 0($s1) add $t4, $t3, $t5

WB

MEM

EXE

ID

WB

MEM

EXE

IF EXEID

IF

IF

WBMEM

EXE MEM

ID EXE

IF ID

IF

ID

IF

If branch is not taken: no stalls!If branch is taken: doesn’t hurt!

ID

IF

WB

MEM

nop

nop

lw $t3, 0($s0) IF

WB

nop

nop

ID

nop

MEMEXE WB

7 cycles per loopflush the instructions fetched incorrectly

Solution III: always predict taken

61

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

unit


IF/IDWrite

mux

0

inst[31:25],inst[5:0]


62

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

Add

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

unit


IF/IDWrite

mux

0

Shi>le>2

Still have to stall 1 cycle

inst[31:25],inst[5:0]


63

ReadAddress

Instruc(onMemory

PC ALU

WriteData

4Add

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

File

inst[25:21]

inst[20:16]

inst[31:0]

mux

0

1

mux

0

1sign-extend 3216

DataMemory

Address ReadData

mux

1

0

WriteData

mux

1

0

Add

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

PCSrc

Zero


inst

[15:

11]

ALUop

ControlWB

ME

EX

WB

ME

WB

RegWrite

forwardingunit

mux

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

unit


IF/IDWrite

mux

0

Shi>le>2

Branch Target Buffer

Consult BTB in fetch stage

inst[31:25],inst[5:0]


64

PC


branch PCtarget address ortarget instruction

Solution III: always predict taken• Always predict taken with the help of BTB

65

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP

WB

MEM

EXE

ID

WB

MEM

EXE

IF EXEID

IF

IF

WBMEM

EXE MEM

ID EXE

IF ID

IF

ID

IF ID

IF

WB

MEM

EXE

ID

IF

WB

MEM

EXE

ID

WB

MEM WB

MEMEXE WB

5 cycles per loop(CPI == 1 !!!)

But what if the branch is not always taken?

lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3

Dynamic branch prediction

68

1-bit counter• Predict this branch will go the same way as the

result of the last time this branch executed• 1 for taken, 0 for not takens

69

0x400420 0x8048324 1

0x400464 0x8048392 1

0x400578 0x804850a 0

0x41000C 0x8049624 1


PC = 0x400420

Taken!

2-bit counter• A 2-bit counter for each branch• Predict taken if the counter value >= 2• If the prediction in taken states, fetch from target PC,

otherwise, use PC+4

Taken3 (11)

Taken2 (10)

NotTaken0 (00)

NotTaken1 (01)

taken

taken

taken

not taken

not taken

not taken

not taken

taken

71Branch Target Buffer

PC= 0x400420

Taken!0x400420 0x8048324 11

0x400464 0x8048392 10

0x400578 0x804850a 00

0x41000C 0x8049624 01

Performance of 2-bit counter• 2-bit state machine for each branch

for(i = 0; i < 10; i++) {! sum += a[i];}

90% accuracy!Taken3 (11)

Taken2 (10)

NotTaken0 (00)

NotTaken1 (01)

taken

taken

taken

not taken

not taken

not taken

not taken

taken • Application: 80% ALU, 20% Branch, and branch resolved in EX stage, average CPI?• 1+20%*(1-90%)*2 = 1.04 72

i state predict actual1 10 T T2 11 T T3 11 T T

4-9 11 T T10 11 T NT +

Make the prediction better• Consider the following code:

i = 0;do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i;} while ( ++i < 100) // Branch X

74

i branch result0 Y T0 X T1 Y NT1 X T2 Y NT2 X T3 Y T3 X T4 Y NT4 X T5 Y NT5 X T6 Y T6 X T7 Y NT

Can we capture the pattern?

Predict using history• Instead of using the PC to choose the predictor, use

a bit vector (global history register, GHR) made up of the previous branch outcomes.

• Each entry in the history table has its own counter.

75

0111101100111110

history table

n-bit GHR

2n entries

= 101 (T, NT, T)

Taken!

index

Performance of global history predictor

• Consider the following code:i = 0;do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i;// Branch Y} while ( ++i < 100) // Branch X

76

i ? GHR BHT prediction actual New BHT

0 Y 0000 10 T T 110 X 0001 10 T T 111 Y 0011 10 T NT 011 X 0110 10 T T 112 Y 1101 10 T NT 012 X 1010 10 T T 113 Y 0101 10 T T 113 X 1011 10 T T 114 Y 0111 10 T NT 014 X 1110 10 T T 115 Y 1101 01 NT NT 005 X 1010 11 T T 116 Y 0101 11 T T 116 X 1011 11 T T 117 Y 0111 01 NT NT 007 X 1110 11 T T 118 Y 1101 00 NT NT 008 X 1010 11 T T 119 Y 0101 11 T T 119 X 1011 11 T T 11

10 Y 0111 00 NT NT 00

Assume that we start with a 4-bit GHR= 0, all counters are 10.

Nearly perfect after this

Branch prediction and modern processors

79

Deeper pipeline

80

• Higher frequencies by shortening the pipeline stages• Higher marketing values since consumers usually link

performance with frequencies• Potentially higher power consumption as

dynamic/active power = aCV2f• If the execution time is better, still consume less energy

Case Study

81

Intel Pentium 4 Microarch.

82

Intel Pentium 4• Very deep pipeline: in order to achieve high frequency!

(start from 1.5GHz)• 20 stages in Netburst

• 31 stages in Prescott

• 103W (3.6GHz, 65nm) • Reference

• The Microarchitecture of the Pentium 4 Processor

1 2 3 4 5Drive

6Alloc

7 8 9Que

10Sch

11Sch

12Sch

13Disp

14Disp

15RF

16RF

17Ex

18Flgs

19Br Ck

20DriveTC Nxt IP TC Fetch Rename

83

AMD Athlon 64

84

AMD Athlon 64• 12 stage pipeline

• 89W TDP (Opteron 2.2GHz 90nm)

1Inst. AddrDecode

2Inst Mem

Read

3Inst. Byte

Pick

4ID1

5ID2

6Inst. Dbl. & Pack

7ID and Pack

8Dispatch

9Scheduling

10Execution

11D-CacheAddress

12D-cacheAccess

85

Demo revisited• Why the sorting the array speed up the code despite

the increased instruction count?

88

if(option) std::sort(data, data + arraySize);

for (unsigned i = 0; i < 100000; ++i) { int threshold = std::rand(); for (unsigned i = 0; i < arraySize; ++i) { if (data[i] >= threshold) sum ++; } }

Deep pipelining and data hazards

89

Data hazard revisited• How many cycles it takes to execute the following

code? • Draw the pipeline execution diagram

• assume that we have full data forwarding.lw $t1, 0($a0)lw $a0, 0($t1)bne $a0, $zero, 0

IF ID

IF EXE

ID

WB

MEM

ID

EXE

ID

IF

MEM

IF

ID WB

EX

90

MEM WB

9 cycles

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2-2

2.1 THE SKYLAKE MICROARCHITECTURE The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures. The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.

The Skylake microarchitecture offers the following enhancements:• Larger internal buffers to enable deeper OOO execution and higher cache bandwidth.• Improved front end throughput.• Improved branch predictor.• Improved divider throughput and latency.• Lower power consumption.• Improved SMT performance with Hyper-Threading Technology.• Balanced floating-point ADD, MUL, FMA throughput and latency.

The microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub-system consisting of a number of components including a ring interconnect to multiple slices of L3 (an off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A four-core configuration can be supported similar to the arrangement shown in Figure 2-3.

Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture

32K L1 Instruction Cache

MSROM Decoded Icache (DSB)

Legacy DecodePipeline

Instruction Decode Queue (IDQ,, or micro-op queue)

Allocate/Rename/Retire/MoveElimination/ZeroIdiom

32K L1 Data Cache

256K L2 Cache (Unified)

Int ALU, Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Divide,

Branch2

Port 2LD/STA

Scheduler

BPU

Port 0

Int ALU, Fast LEA,Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Int MUL,Slow LEA

Int ALU, Fast LEA,Vec SHUF,Vec ALU,

CVT

Int ALU, Int Shft,Branch1,

Port 3LD/STA

Port 4STD

Port 7STA

Port 1 Port 5 Port 6

5 uops/cycle4 uops/cycle6 uops/cycle

Intel’s latest SkyLake

92

Good reference for intel microarchitectures: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Documents

Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the longest possible delay between two registers in a design. • The critical path sets