Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Chapter 6: Pipelining

Instructor: Mozafar Bag Mohammadi

University of Ilam

Pipelining

Forecast Big Picture Datapath Control Data Hazards

Stalls Forwarding

Control Hazards Exceptions

Motivation

Single cycle implementation CPI = 1 Cycle = imem + RFrd + ALU + dmem + RFwr +

muxes + control E.g. 500+250+500+500+250+0+0 = 2000ps Time/program = P x 2ns

Instructions Cycles

Program InstructionTimeCycle

(code size)

X X

(CPI) (cycle time)

Multicycle

Multicycle implementation:

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

i F D X M W

i+1 F D X

i+2 F D X M

i+3 F

i+4

Multicycle

Multicycle implementation CPI = 3, 4, 5 Cycle = max(memory, RF, ALU, mux, control) =max(500,250,500) = 500ps Time/prog = P x 4 x 500 = P x 2000ps = P x 2ns

Would like: CPI = 1 + overhead from hazards (later) Cycle = 500ps + overhead In practice, ~3x improvement

Big Picture

Instruction latency = 5 cycles Instruction throughput = 1/5 instr/cycle CPI = 5 cycles per instruction Instead

Pipelining: process instructions like a lunch buffet ALL microprocessors use it

E.g. Pentium 4, Athlon, PowerPC G5

Big Picture

Instruction Latency = 5 cycles (same) Instruction throughput = 1 instr/cycle CPI = 1 cycle per instruction CPI = cycle between instruction completion = 1

Ideal Pipelining

Bandwidth increases linearly with pipeline depth Latency increases by latch delays

GateDelay

Comb. Logicn Gate Delay

GateDelayL Gate

DelayL

L GateDelayL Gate

DelayL

L BW = ~(1/n)

n--2n--2

n--3n--3

n--3

BW = ~(2/n)

BW = ~(3/n)

Ideal Pipelining

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

i F D X M W

i+1 F D X M W

i+2 F D X M W

i+3 F D X M W

i+4 F D X M W

Pipelining Idealisms

Uniform subcomputations Can pipeline into stages with equal delay

Identical computations Can fill pipeline with identical work

Independent computations No relationships between work units

Are these practical? No, but can get close enough to get significant speedup

Complications

Datapath Five (or more) instructions in flight

Control Must correspond to multiple instructions

Instructions may have data and control flow dependences I.e. units of work are not independent

One may have to stall and wait for another

Datapath (Fig. 6.10)

Datapath (Fig. 6.11)

Control

Control Set by 5 different instructions Divide and conquer: carry IR down the pipe

MIPS ISA requires the appearance of sequential execution True of most general purpose ISAs

Program Dependences

i1: xxxx

i2: xxxx

i3: xxxx

i1

i2

i3

i1:

i2:

i3:

The implied sequential precedences are an overspecification. It is sufficient but not necessary to ensure program correctness.

A true dependence between two instructions may only involve one subcomputationof each instruction.

Program Data Dependences

True dependence (RAW) j cannot execute until i produces

its result Anti-dependence (WAR)

j cannot write its result until i has read its sources

Output dependence (WAW) j cannot write its result until i has

written its result

)()( jRiD

)()( jDiR

)()( jDiD

Control Dependences

Conditional branches Branch must execute to determine which

instruction to fetch next Instructions following a conditional branch are

control dependent on the branch instruction

Example (quicksort/MIPS)

# for (; (j < high) && (array[j] < array[low]) ; ++j );# $10 = j# $9 = high# $6 = array# $8 = low

bge done, $10, $9mul $15, $10, 4addu $24, $6, $15lw $25, 0($24)mul $13, $8, 4addu $14, $6, $13lw $15, 0($14)bge done, $25, $15

cont:addu $10, $10, 1. . .

done:addu $11, $11, -1

Resolution of Pipeline Hazards

Pipeline hazards Potential violations of program dependences Must ensure program dependences are not violated

Hazard resolution Static: compiler/programmer guarantees correctness Dynamic: hardware performs checks at runtime

Pipeline interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependences at runtime

Pipeline Hazards

Necessary conditions: WAR: write stage earlier than read stage

Is this possible in IF-RD-EX-MEM-WB ? WAW: write stage earlier than write stage

Is this possible in IF-RD-EX-MEM-WB ? RAW: read stage earlier than write stage

Is this possible in IF-RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory

Pipeline Hazard Analysis

ALU

RD

IFIF

ID

RD

ALU

MEM

WB

D

S1

S2

W/RWData

RData2

RegisterFile

RAdd2RData1

WAdd

RAdd1

Memory hazards RAW: Yes/No? WAR: Yes/No? WAW: Yes/No?

Register hazards RAW: Yes/No? WAR: Yes/No? WAW: Yes/No?

RAW Hazard Earlier instruction produces a value used by a later

instruction: add $1, $2, $3 sub $4, $5, $1

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

add F D X M W

sub F D X M W

RAW Hazard - Stall

Detect dependence and stall: add $1, $2, $3 sub $4, $5, $1

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

add F D X M W

sub F D X M W

Control Dependence One instruction affects which executes next

sw $4, 0($5) bne $2, $3, loop sub $6, $7, $8

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

sw F D X M W

bne F D X M W

sub F D X M W

Control Dependence - Stall Detect dependence and stall

sw $4, 0($5) bne $2, $3, loop sub $6, $7, $8

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

sw F D X M W

bne F D X M W

sub F D X M W

Pipelined Datapath Start with single-cycle datapath (F6.10) Pipelined execution

Assume each instruction has its own datapath But each instruction uses a different part in every cycle Multiplex all on to one datapath Latches separate cycles (like multicycle)

Ignore hazards for now Data control

Pipelined Datapath (Fig. 6.12)

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Pipelined Datapath

Instruction flow add and load Write of registers Pass register specifiers

Any info needed by a later stage gets passed down the pipeline E.g. store value through EX

Pipelined Control

IF and ID None

EX ALUop, ALUsrc, RegDst

MEM Branch, MemRead, MemWrite

WB MemtoReg, RegWrite

Figure 6.25

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead


6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

Figure 6.29

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Figure 6.30

PC

Instructionmemory

Inst

ruct

ion

Add


Me

mto

Re

g

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Me

mW

rite

AddressData

memory

Address

Pipelined Control

Controlled by different instructions Decode instructions and pass the signals

down the pipe Control sequencing is embedded in the

pipeline No explicit FSM Instead, distributed FSM

Pipelining

Not too complex yet Data hazards Control hazards Exceptions

RAW Hazards Must first detect RAW hazards

Pipeline analysis proves that WAR/WAW don’t occur

ID/EX.WriteRegister = IF/ID.ReadRegister1

ID/EX.WriteRegister = IF/ID.ReadRegister2

EX/MEM.WriteRegister = IF/ID.ReadRegister1

EX/MEM.WriteRegister = IF/ID.ReadRegister2

MEM/WB.WriteRegister = IF/ID.ReadRegister1

MEM/WB.WriteRegister = IF/ID.ReadRegister2

RAW Hazards

Not all hazards because WriteRegister not used (e.g. sw) ReadRegister not used (e.g. addi, jump) Do something only if necessary

RAW Hazards

Hazard Detection Unit Several 5-bit (or 6-bit) comparators

Response? Stall pipeline Instructions in IF and ID stay IF/ID pipeline latch not updated Send ‘nop’ down pipeline (called a bubble) PCWrite, IF/IDWrite, and nop mux

RAW Hazard Forwarding

A better response – forwarding Also called bypassing

Comparators ensure register is read after it is written

Instead of stalling until write occurs Use mux to select forwarded value rather than

register value Control mux with hazard detection logic

Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding

what if this $2 was $13?

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Forwarding

PCInstruction

memory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Forwarding Paths (ALU instructions)

FORWARDING

IF

ID

b

ALU

PATHS

a

i+1: i+2: i+3:

i: R1

i: R1

i: R1

(i i+1)

Forwarding

via Path a

i+1:

i+1:

i+2:

(i i+2)

Forwardingvia Path b

(i i+3)

i writes R1before i+3 reads R1

RD

ALU

MEM

WB

R1 R1

R1

R1

R1c

Write before Read RF

Register file design 2-phase clocks common Write RF on first phase Read RF on second phase

Hence, same cycle: Write $1 Read $1

No bypass needed If read before write or DFF-based, need bypass

Implementation of ALU Forwarding

ALU

RegisterFile

•

••

•

•

• •

1 0 1 0

1 0 1 0

ALU

Comp Comp Comp Comp

•

•

•

•

•

Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the

same register. Thus, we need a hazard detection unit to “stall” the load instruction

Can't always forward

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6


lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Stalling

We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Forwarding Paths(Load instructions)

(i i+2)(i i+1)

IF

ID

eLOAD

FORWARDINGPATH(s)

i+1: i+1: i+2:

i+1:

RD

ALU

MEM

WB

i:R1

i:R1

i:R1

(i i+1)

Stall i+1 Forwarding

via Path d

i writes R1before i+2 reads R1

d

R1 R1 R1

R1

MEM[]

MEM[]

MEM[]

Implementation of Load Forwarding

•

ALU

RegisterFile

•

••

•

•

• •

1 0 1 0

1 0 1 0

ALU

CompComp CompComp

•

•

1 0 1 0

•

Load

StallIF,ID,RD

•Data

Add

•

•

D-Cache

•

r

LOAD

Control Flow Hazards

Control flow instructions branches, jumps, jals, returns Can’t fetch until branch outcome known Too late for next IF


What to do? Always stall Easy to implement Performs poorly 1/6th instructions are branches, each branch takes

3 cycles CPI = 1 + 3 x 1/6 = 1.5 (lower bound)


Predict branch not taken Send sequential instructions down pipeline Kill instructions later if incorrect Must stop memory accesses and RF writes Late flush of instructions on misprediction

Complex Global signal (wire delay)

When we decide to branch, other instructions are in the pipeline!

We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong

Branch Hazards

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Flushing Instructions

PCInstruction

memory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux


Even better but more complex Predict taken Predict both (eager execution) Predict one or the other dynamically

Adapt to program branch patterns Lots of chip real estate these days

Pentium III, 4, Alpha 21264 Current research topic

Improving Performance

Try and avoid stalls! E.g., reorder these instructions:lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Add a “branch delay slot” Always execute following instruction “delay slot” (later example on MIPS pipeline) Put useful instruction there, otherwise ‘nop’ rely on compiler to “fill” the slot with something useful

Superscalar: start more than one instruction in the same cycle

Dynamic Scheduling

The hardware performs the “scheduling” hardware tries to find instructions to execute out of order execution is possible speculative execution and dynamic branch prediction

All modern processors are very complicated DEC Alpha 21264: 9 stage pipeline, 6 instruction issue PowerPC and Pentium: branch history table Compiler technology important

This class has given you the background you need to learn more

Exceptions

Even worse: in one cycle I/O interrupt User trap to OS (EX) Illegal instruction (ID) Arithmetic overflow Hardware error Etc.

Interrupt priorities must be supported

Documents

Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam