56
Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Embed Size (px)

Citation preview

Page 1: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Chapter 6: Pipelining

Instructor: Mozafar Bag Mohammadi

University of Ilam

Page 2: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipelining

Forecast Big Picture Datapath Control Data Hazards

Stalls Forwarding

Control Hazards Exceptions

Page 3: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Motivation

Single cycle implementation CPI = 1 Cycle = imem + RFrd + ALU + dmem + RFwr +

muxes + control E.g. 500+250+500+500+250+0+0 = 2000ps Time/program = P x 2ns

Instructions Cycles

Program InstructionTimeCycle

(code size)

X X

(CPI) (cycle time)

Page 4: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Multicycle

Multicycle implementation:

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

i F D X M W

i+1 F D X

i+2 F D X M

i+3 F

i+4

Page 5: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Multicycle

Multicycle implementation CPI = 3, 4, 5 Cycle = max(memory, RF, ALU, mux, control) =max(500,250,500) = 500ps Time/prog = P x 4 x 500 = P x 2000ps = P x 2ns

Would like: CPI = 1 + overhead from hazards (later) Cycle = 500ps + overhead In practice, ~3x improvement

Page 6: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Big Picture

Instruction latency = 5 cycles Instruction throughput = 1/5 instr/cycle CPI = 5 cycles per instruction Instead

Pipelining: process instructions like a lunch buffet ALL microprocessors use it

E.g. Pentium 4, Athlon, PowerPC G5

Page 7: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Big Picture

Instruction Latency = 5 cycles (same) Instruction throughput = 1 instr/cycle CPI = 1 cycle per instruction CPI = cycle between instruction completion = 1

Page 8: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Ideal Pipelining

Bandwidth increases linearly with pipeline depth Latency increases by latch delays

GateDelay

Comb. Logicn Gate Delay

GateDelayL Gate

DelayL

L GateDelayL Gate

DelayL

L BW = ~(1/n)

n--2n--2

n--3n--3

n--3

BW = ~(2/n)

BW = ~(3/n)

Page 9: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Ideal Pipelining

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

i F D X M W

i+1 F D X M W

i+2 F D X M W

i+3 F D X M W

i+4 F D X M W

Page 10: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipelining Idealisms

Uniform subcomputations Can pipeline into stages with equal delay

Identical computations Can fill pipeline with identical work

Independent computations No relationships between work units

Are these practical? No, but can get close enough to get significant speedup

Page 11: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Complications

Datapath Five (or more) instructions in flight

Control Must correspond to multiple instructions

Instructions may have data and control flow dependences I.e. units of work are not independent

One may have to stall and wait for another

Page 12: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Datapath (Fig. 6.10)

Page 13: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Datapath (Fig. 6.11)

Page 14: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Control

Control Set by 5 different instructions Divide and conquer: carry IR down the pipe

MIPS ISA requires the appearance of sequential execution True of most general purpose ISAs

Page 15: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Program Dependences

i1: xxxx

i2: xxxx

i3: xxxx

i1

i2

i3

i1:

i2:

i3:

The implied sequential precedences are an overspecification. It is sufficient but not necessary to ensure program correctness.

A true dependence between two instructions may only involve one subcomputationof each instruction.

Page 16: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Program Data Dependences

True dependence (RAW) j cannot execute until i produces

its result Anti-dependence (WAR)

j cannot write its result until i has read its sources

Output dependence (WAW) j cannot write its result until i has

written its result

)()( jRiD

)()( jDiR

)()( jDiD

Page 17: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Control Dependences

Conditional branches Branch must execute to determine which

instruction to fetch next Instructions following a conditional branch are

control dependent on the branch instruction

Page 18: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Example (quicksort/MIPS)

# for (; (j < high) && (array[j] < array[low]) ; ++j );# $10 = j# $9 = high# $6 = array# $8 = low

bge done, $10, $9mul $15, $10, 4addu $24, $6, $15lw $25, 0($24)mul $13, $8, 4addu $14, $6, $13lw $15, 0($14)bge done, $25, $15

cont:addu $10, $10, 1. . .

done:addu $11, $11, -1

Page 19: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Resolution of Pipeline Hazards

Pipeline hazards Potential violations of program dependences Must ensure program dependences are not violated

Hazard resolution Static: compiler/programmer guarantees correctness Dynamic: hardware performs checks at runtime

Pipeline interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependences at runtime

Page 20: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipeline Hazards

Necessary conditions: WAR: write stage earlier than read stage

Is this possible in IF-RD-EX-MEM-WB ? WAW: write stage earlier than write stage

Is this possible in IF-RD-EX-MEM-WB ? RAW: read stage earlier than write stage

Is this possible in IF-RD-EX-MEM-WB? If conditions not met, no need to resolve Check for both register and memory

Page 21: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipeline Hazard Analysis

ALU

RD

IFIF

ID

RD

ALU

MEM

WB

D

S1

S2

W/RWData

RData2

RegisterFile

RAdd2RData1

WAdd

RAdd1

Memory hazards RAW: Yes/No? WAR: Yes/No? WAW: Yes/No?

Register hazards RAW: Yes/No? WAR: Yes/No? WAW: Yes/No?

Page 22: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

RAW Hazard Earlier instruction produces a value used by a later

instruction: add $1, $2, $3 sub $4, $5, $1

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

add F D X M W

sub F D X M W

Page 23: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

RAW Hazard - Stall

Detect dependence and stall: add $1, $2, $3 sub $4, $5, $1

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

add F D X M W

sub F D X M W

Page 24: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Control Dependence One instruction affects which executes next

sw $4, 0($5) bne $2, $3, loop sub $6, $7, $8

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

sw F D X M W

bne F D X M W

sub F D X M W

Page 25: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Control Dependence - Stall Detect dependence and stall

sw $4, 0($5) bne $2, $3, loop sub $6, $7, $8

Cycle:Instr:

1 2 3 4 5 6 7 8 9 10 11 12 13

sw F D X M W

bne F D X M W

sub F D X M W

Page 26: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipelined Datapath Start with single-cycle datapath (F6.10) Pipelined execution

Assume each instruction has its own datapath But each instruction uses a different part in every cycle Multiplex all on to one datapath Latches separate cycles (like multicycle)

Ignore hazards for now Data control

Page 27: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipelined Datapath (Fig. 6.12)

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Page 28: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipelined Datapath

Instruction flow add and load Write of registers Pass register specifiers

Any info needed by a later stage gets passed down the pipeline E.g. store value through EX

Page 29: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipelined Control

IF and ID None

EX ALUop, ALUsrc, RegDst

MEM Branch, MemRead, MemWrite

WB MemtoReg, RegWrite

Page 30: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Figure 6.25

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[15– 11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

Page 31: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Figure 6.29

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 32: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Figure 6.30

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Me

mto

Re

g

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Me

mW

rite

AddressData

memory

Address

Page 33: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipelined Control

Controlled by different instructions Decode instructions and pass the signals

down the pipe Control sequencing is embedded in the

pipeline No explicit FSM Instead, distributed FSM

Page 34: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Pipelining

Not too complex yet Data hazards Control hazards Exceptions

Page 35: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

RAW Hazards Must first detect RAW hazards

Pipeline analysis proves that WAR/WAW don’t occur

ID/EX.WriteRegister = IF/ID.ReadRegister1

ID/EX.WriteRegister = IF/ID.ReadRegister2

EX/MEM.WriteRegister = IF/ID.ReadRegister1

EX/MEM.WriteRegister = IF/ID.ReadRegister2

MEM/WB.WriteRegister = IF/ID.ReadRegister1

MEM/WB.WriteRegister = IF/ID.ReadRegister2

Page 36: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

RAW Hazards

Not all hazards because WriteRegister not used (e.g. sw) ReadRegister not used (e.g. addi, jump) Do something only if necessary

Page 37: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

RAW Hazards

Hazard Detection Unit Several 5-bit (or 6-bit) comparators

Response? Stall pipeline Instructions in IF and ID stay IF/ID pipeline latch not updated Send ‘nop’ down pipeline (called a bubble) PCWrite, IF/IDWrite, and nop mux

Page 38: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

RAW Hazard Forwarding

A better response – forwarding Also called bypassing

Comparators ensure register is read after it is written

Instead of stalling until write occurs Use mux to select forwarded value rather than

register value Control mux with hazard detection logic

Page 39: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding

what if this $2 was $13?

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Page 40: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Forwarding

PCInstruction

memory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Page 41: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Forwarding Paths (ALU instructions)

FORWARDING

IF

ID

b

ALU

PATHS

a

i+1: i+2: i+3:

i: R1

i: R1

i: R1

(i i+1)

Forwarding

via Path a

i+1:

i+1:

i+2:

(i i+2)

Forwardingvia Path b

(i i+3)

i writes R1before i+3 reads R1

RD

ALU

MEM

WB

R1 R1

R1

R1

R1c

Page 42: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Write before Read RF

Register file design 2-phase clocks common Write RF on first phase Read RF on second phase

Hence, same cycle: Write $1 Read $1

No bypass needed If read before write or DFF-based, need bypass

Page 43: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Implementation of ALU Forwarding

ALU

RegisterFile

••

• •

1 0 1 0

1 0 1 0

ALU

Comp Comp Comp Comp

Page 44: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the

same register. Thus, we need a hazard detection unit to “stall” the load instruction

Can't always forward

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Page 45: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Stalling

We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Page 46: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Forwarding Paths(Load instructions)

(i i+2)(i i+1)

IF

ID

eLOAD

FORWARDINGPATH(s)

i+1: i+1: i+2:

i+1:

RD

ALU

MEM

WB

i:R1

i:R1

i:R1

(i i+1)

Stall i+1 Forwarding

via Path d

i writes R1before i+2 reads R1

d

R1 R1 R1

R1

MEM[]

MEM[]

MEM[]

Page 47: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Implementation of Load Forwarding

ALU

RegisterFile

••

• •

1 0 1 0

1 0 1 0

ALU

CompComp CompComp

1 0 1 0

Load

StallIF,ID,RD

•Data

Add

D-Cache

r

LOAD

Page 48: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Control Flow Hazards

Control flow instructions branches, jumps, jals, returns Can’t fetch until branch outcome known Too late for next IF

Page 49: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Control Flow Hazards

What to do? Always stall Easy to implement Performs poorly 1/6th instructions are branches, each branch takes

3 cycles CPI = 1 + 3 x 1/6 = 1.5 (lower bound)

Page 50: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Control Flow Hazards

Predict branch not taken Send sequential instructions down pipeline Kill instructions later if incorrect Must stop memory accesses and RF writes Late flush of instructions on misprediction

Complex Global signal (wire delay)

Page 51: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

When we decide to branch, other instructions are in the pipeline!

We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong

Branch Hazards

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Page 52: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Flushing Instructions

PCInstruction

memory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux

Page 53: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Control Flow Hazards

Even better but more complex Predict taken Predict both (eager execution) Predict one or the other dynamically

Adapt to program branch patterns Lots of chip real estate these days

Pentium III, 4, Alpha 21264 Current research topic

Page 54: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Improving Performance

Try and avoid stalls! E.g., reorder these instructions:lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Add a “branch delay slot” Always execute following instruction “delay slot” (later example on MIPS pipeline) Put useful instruction there, otherwise ‘nop’ rely on compiler to “fill” the slot with something useful

Superscalar: start more than one instruction in the same cycle

Page 55: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Dynamic Scheduling

The hardware performs the “scheduling” hardware tries to find instructions to execute out of order execution is possible speculative execution and dynamic branch prediction

All modern processors are very complicated DEC Alpha 21264: 9 stage pipeline, 6 instruction issue PowerPC and Pentium: branch history table Compiler technology important

This class has given you the background you need to learn more

Page 56: Chapter 6: Pipelining Instructor: Mozafar Bag Mohammadi University of Ilam

Exceptions

Even worse: in one cycle I/O interrupt User trap to OS (EX) Illegal instruction (ID) Arithmetic overflow Hardware error Etc.

Interrupt priorities must be supported