1 COMP541 Pipelined MIPS Montek Singh Apr 9, 2012

1

COMP541COMP541

Pipelined MIPSPipelined MIPS

Montek SinghMontek Singh

Apr 9, 2012Apr 9, 2012

2

TopicsTopics Today’s topic: PipeliningToday’s topic: Pipelining

Can think of it as:Can think of it as:A way to parallelize, orA way to parallelize, orA way to make better utilization of the hardware.A way to make better utilization of the hardware.

Goal: Try to use all hardware every clock cycleGoal: Try to use all hardware every clock cycle

ReadingReading Section 7.5 of textbookSection 7.5 of textbook

ParallelismParallelism Two types of parallelism:Two types of parallelism:

Spatial parallelismSpatial parallelismduplicate hardware performs multiple tasks at onceduplicate hardware performs multiple tasks at once

Temporal parallelismTemporal parallelism task is broken into multiple stagestask is broken into multiple stages

– each stage operating on different parts of distinct instructionseach stage operating on different parts of distinct instructionsalso called pipeliningalso called pipelining

– example: an assembly lineexample: an assembly line

Parallelism DefinitionsParallelism Definitions Some definitions:Some definitions:

Token:Token: A group of inputs processed together to A group of inputs processed together to produce a group of outputsproduce a group of outputsa “bundle”a “bundle”

Latency:Latency: Time for one token to pass from start to end Time for one token to pass from start to end Throughput:Throughput: The number of tokens that can be The number of tokens that can be

processed per unit timeprocessed per unit time

Parallelism increases throughputParallelism increases throughput Often sacrificing latencyOften sacrificing latency

Parallelism ExampleParallelism Example Ben is baking cookies Ben is baking cookies

2-part task:2-part task: It takes 5 minutes to roll the cookies…It takes 5 minutes to roll the cookies…… … and 15 minutes to bake themand 15 minutes to bake them

After finishing one batch he immediately starts the After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben next batch. What is the latency and throughput if Ben does NOT use parallelism?does NOT use parallelism?

Latency = 5 + 15 = 20 min = 1/3 hourLatency = 5 + 15 = 20 min = 1/3 hour

Throughput = 1 tray/ 20 min = 3 Throughput = 1 tray/ 20 min = 3 trays/hourtrays/hour

Parallelism ExampleParallelism Example What is the latency and throughput if Ben uses What is the latency and throughput if Ben uses

parallelism?parallelism? Spatial parallelism: Spatial parallelism: Ben asks Allysa to help, using her Ben asks Allysa to help, using her

own ovenown oven Temporal parallelism:Temporal parallelism: Ben breaks the task into two Ben breaks the task into two

stages: roll and baking. He uses two trays. While the stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so first batch is baking he rolls the second batch, and so on.on.

Spatial ParallelismSpatial ParallelismS

pat

ial

Par

alle

lism

Roll

Bake

Ben 1 Ben 1

Alyssa 1 Alyssa 1

Ben 2 Ben 2

Alyssa 2 Alyssa 2

Time

0 5 10 15 20 25 30 35 40 45 50

Tray 1

Tray 2

Tray 3

Tray 4

Latency:time to

first tray

Legend

Latency = ?

Throughput = ?

Spatial ParallelismSpatial ParallelismS

pat

ial

Par

alle

lism

Roll

Bake

Ben 1 Ben 1

Alyssa 1 Alyssa 1

Ben 2 Ben 2

Alyssa 2 Alyssa 2

Time

0 5 10 15 20 25 30 35 40 45 50

Tray 1

Tray 2

Tray 3

Tray 4

Latency:time to

first tray

Legend

Latency = 5 + 15 = 20 min = 1/3 hour (same) Throughput = 2 trays/ 20 min = 6 trays/hour

(doubled)

Temporal ParallelismTemporal ParallelismT

emp

ora

lP

aral

leli

sm

Ben 1 Ben 1

Ben 2 Ben 2

Ben 3 Ben 3

Time

0 5 10 15 20 25 30 35 40 45 50

Latency:time to

first tray

Tray 1

Tray 2

Tray 3

Latency = ?

Throughput = ?

Temporal ParallelismTemporal ParallelismT

emp

ora

lP

aral

leli

sm

Ben 1 Ben 1

Ben 2 Ben 2

Ben 3 Ben 3

Time

0 5 10 15 20 25 30 35 40 45 50

Latency:time to

first tray

Tray 1

Tray 2

Tray 3

Latency = 5 + 15 = 20 min = 1/3 hourThroughput = 1 trays/ 15 min = 4 trays/hour

Using both techniques, the throughput would be 8 trays/hour

Pipelined MIPSPipelined MIPS Temporal parallelismTemporal parallelism Divide single-cycle processor into 5 stages:Divide single-cycle processor into 5 stages:

FetchFetch DecodeDecode ExecuteExecute MemoryMemory WritebackWriteback

Add pipeline registers between stagesAdd pipeline registers between stages

Single-Cycle vs. Pipelined Single-Cycle vs. Pipelined PerformancePerformance PipeliningPipelining

Break instruction into 5 stepsBreak instruction into 5 steps Each is 250 ps long (length of longest step)Each is 250 ps long (length of longest step)

Time (ps)Instr

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000

Instr

1

2

3

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

Single-Cycle

Pipelined Write Reg and Read Reg occur during the first half and second half of the clock cycle, respectively. This allows more instr overlap!

Pipelining AbstractionPipelining Abstraction 5 stages of execution5 stages of execution

instruction memory (fetch)instruction memory (fetch) register file readregister file read ALU/executionALU/execution data memory accessdata memory access register file write (writeback)register file write (writeback)

Time (cycles)

lw $s2, 40($0) RF 40

$0RF

$s2+ DM

RF $t2

$t1RF

$s3+ DM

RF $s5

$s1RF

$s4- DM

RF $t6

$t5RF

$s5& DM

RF 20

$s1RF

$s6+ DM

RF $t4

$t3RF

$s7| DM

add $s3, $t1, $t2

sub $s4, $s1, $s5

and $s5, $t5, $t6

sw $s6, 20($s1)

or $s7, $t3, $t4

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IMlw

sub

and

sw

or

Single-Cycle and Pipelined Single-Cycle and Pipelined DatapathDatapath Key difference: insertion of pipeline registersKey difference: insertion of pipeline registers

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory Writeback

Multi-Cycle and Pipelined Multi-Cycle and Pipelined DatapathDatapath Key difference: full-length registers for datapath Key difference: full-length registers for datapath and and

controlcontrol

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU


SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1 0

1

PC0

1

PC' Instr25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegD

st

Mem

toRe

g ZeroCLK

ALU

WD

WE

CLK

Adr

0

1Data

CLK

CLK

A

B00

01

10

11

4

CLK

ENEN

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU


One ProblemOne Problem There is a problem:There is a problem:

ResultW and WriteReg are out of stepResultW and WriteReg are out of step

Corrected Pipelined DatapathCorrected Pipelined Datapath Solution: One modification neededSolution: One modification needed

Send both ResultW and WriteReg through an equal Send both ResultW and WriteReg through an equal number of registersnumber of registers

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

WriteRegW4:0

ALU

WriteRegE4:0

CLK

CLK

CLK


Pipelined ControlPipelined Control What is similar/different w.r.t. single-cycle MIPS?What is similar/different w.r.t. single-cycle MIPS?

Same control signal valuesSame control signal values Values must be delayed and delivered in correct cyclesValues must be delayed and delivered in correct cycles

Control signals must go through registers as well!Control signals must go through registers as well!

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

ZeroM

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

BranchE BranchM

RegDstE

ALUSrcE

WriteRegE4:0

Pipelining ChallengesPipelining Challenges ““Hazards”Hazards”

when an instruction depends on results from previous when an instruction depends on results from previous instruction that has not yet completedinstruction that has not yet completed

2 types of hazards:2 types of hazards:Data hazard:Data hazard: register value not written back to register file register value not written back to register file

yetyet– an instruction produces a result needed by next instructionan instruction produces a result needed by next instruction– new value will be stored during write-back stagenew value will be stored during write-back stage– new value needed by next instr during register readnew value needed by next instr during register read

following too close behind!following too close behind!

Control hazard:Control hazard: don don’’t know which is next instructiont know which is next instruction– next PC not decided yetnext PC not decided yet– cause by conditional branchescause by conditional branches– cannot fetch next instr if current branch is not yet decidedcannot fetch next instr if current branch is not yet decided

Data Hazard: ExampleData Hazard: Example First instruction computes a result ($s0) First instruction computes a result ($s0)

needed by the next 3 instructionsneeded by the next 3 instructions first two cause problemsfirst two cause problems third actually does not!third actually does not!

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Handling Data HazardsHandling Data Hazards Static/compiler approaches:Static/compiler approaches:

Insert Insert nop’nop’s (no operations) in code at compile times (no operations) in code at compile time Rearrange code at compile timeRearrange code at compile time

Dynamic/runtime approaches:Dynamic/runtime approaches: Forward data at run timeForward data at run time Stall the processor at run timeStall the processor at run time

Compile-Time Hazard EliminationCompile-Time Hazard Elimination Insert enough nops between dependent instrsInsert enough nops between dependent instrs Or re-arrange: move independent instructions Or re-arrange: move independent instructions

earlierearlier

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

nop

nop

RF RFDMnopIM

RF RFDMnopIM

9 10

Dynamic Approach: Data Dynamic Approach: Data ForwardingForwarding Also known as Also known as bypassingbypassing

results are actually available even though not stored results are actually available even though not stored in RFin RF

grab a copy and send where needed!grab a copy and send where needed!

Note:Note: forwarding actually not needed for sub. Why?forwarding actually not needed for sub. Why?

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Data ForwardingData Forwarding

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

For

wa

rdA

E

For

wa

rdB

E

20:16RtE

RsD

RdD

RtD

Reg

Wri

teM

Reg

Wri

teW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Data ForwardingData Forwarding Forward to Execute stage from either:Forward to Execute stage from either:

Memory stage orMemory stage or Writeback stageWriteback stage

Forwarding logic for Forwarding logic for ForwardAEForwardAE::if ((if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM) then) then

ForwardAEForwardAE = 10 = 10

else if ((else if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegWWriteRegW) AND ) AND RegWriteWRegWriteW) then ) then ForwardAEForwardAE = 01 = 01

elseelse


Forwarding logic for Forwarding logic for ForwardBEForwardBE same, but replace same, but replace rsErsE with with rtErtE

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

For

wa

rdA

E

For

wa

rdB

E

20:16RtE

RsD

RdD

RtD

Reg

Wri

teM

Reg

Wri

teW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Data ForwardingData Forwardingif ((if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM) then) then


else if ((else if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegWWriteRegW) AND ) AND RegWriteWRegWriteW)) then )) then ForwardAEForwardAE = 01 = 01

elseelse


26

Forwarding may not always Forwarding may not always work…work… Example: Load followed immediately by R-Example: Load followed immediately by R-

typetype loads are harder to deal with because they need mem loads are harder to deal with because they need mem

access!access!need one extra cycle, compared to R-type instructionsneed one extra cycle, compared to R-type instructions

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

Trouble!

lw has a 2-cycle latency!

StallingStalling Stall for a cycle, then forwardStall for a cycle, then forward

solves the Load followed by R-type problemsolves the Load followed by R-type problem

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

9

RF $s1

$s0

IMor

Stall

Stalling HardwareStalling Hardware

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Stalling ControlStalling Control Stalling logic:Stalling logic:

lwstalllwstall = (( = ((rsDrsD == == rtErtE) OR () OR (rtDrtD == == rtErtE)) AND )) AND MemtoRegEMemtoRegE

StallFStallF = = StallDStallD = = FlushEFlushE = = lwstalllwstall

Stalling ControlStalling Control

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE

StallF = StallD = FlushE = lwstall

Control HazardsControl Hazards beqbeq: :

branch is not determined until the fourth stage of the pipelinebranch is not determined until the fourth stage of the pipeline Instructions after the branch are fetched before branch occursInstructions after the branch are fetched before branch occurs These instructions must be flushed if the branch is takenThese instructions must be flushed if the branch is taken

Effect & SolutionsEffect & Solutions Could always stall when branch decodedCould always stall when branch decoded

Expensive: 3 cycles lost per branch!Expensive: 3 cycles lost per branch!

Could predict banch outcome, and flush if Could predict banch outcome, and flush if wrongwrong Branch misprediction penaltyBranch misprediction penalty

Instructions flushed when branch is takenInstructions flushed when branch is takenMay be reduced by determining branch earlierMay be reduced by determining branch earlier

33

Control Hazards: FlushingControl Hazards: Flushing FlushingFlushing

turn instruction into a NOP (put all zeros)turn instruction into a NOP (put all zeros) renders it harmless!renders it harmless!

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DM

RF $s0

$s4RF| DM

RF $s5

$s0RF- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

20

24

28

2C

30

...

...

9

Flushthese

instructions

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Control Hazards: Original Pipeline (for Control Hazards: Original Pipeline (for comparison)comparison)

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Control Hazards: Early Branch ResolutionControl Hazards: Early Branch Resolution

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdE

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Introduced another data hazard in Decode stage (fix a few slides away)

Control Hazards with Early Branch Control Hazards with Early Branch ResolutionResolution

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DMand $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

andIM

IMlw20

24

28

2C

30

...

...

9

Flushthis

instruction

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Penalty now only one lost cycle

Aside: Delayed BranchAside: Delayed Branch MIPS always executes instruction following a MIPS always executes instruction following a

branchbranch So branch So branch delayeddelayed

This allows us to avoid killing inst.This allows us to avoid killing inst. Compilers move instruction that has no conflict w/ Compilers move instruction that has no conflict w/

branch into delay slotbranch into delay slot

38

ExampleExample This sequenceThis sequence

add $4 $5 $6add $4 $5 $6

beq $1 $2 40beq $1 $2 40

reordered to thisreordered to thisbeq $1 $2 40beq $1 $2 40

add $4 $5 $6add $4 $5 $6

39

Handling the New HazardsHandling the New Hazards

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

For

war

dAD

For

war

dBD

20:16RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Control Forwarding and Stalling Control Forwarding and Stalling HardwareHardware Forwarding logic:Forwarding logic:

ForwardADForwardAD = ( = (rsDrsD !=0) AND ( !=0) AND (rsDrsD == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM

ForwardBDForwardBD = ( = (rtDrtD !=0) AND ( !=0) AND (rtDrtD == WriteRegM) AND == WriteRegM) AND RegWriteMRegWriteM

Stalling logic:Stalling logic:branchstallbranchstall = = BranchDBranchD AND AND RegWriteERegWriteE AND AND

((WriteRegEWriteRegE == == rsDrsD OR OR WriteRegEWriteRegE == == rtDrtD) )

OR OR

BranchDBranchD AND AND MemtoRegMMemtoRegM AND AND

((WriteRegMWriteRegM == == rsDrsD OR OR WriteRegMWriteRegM == == rtDrtD))

StallFStallF = = StallDStallD = = FlushEFlushE = = lwstalllwstall OR OR branchstallbranchstall

Branch PredictionBranch Prediction Especially important if branch penalty > 1 Especially important if branch penalty > 1

cyclecycle Guess whether branch will be takenGuess whether branch will be taken

Backward branches are usually taken (loops)Backward branches are usually taken (loops) Perhaps consider history of whether branch was Perhaps consider history of whether branch was

previously taken to improve the guesspreviously taken to improve the guess

Good prediction reduces the fraction of Good prediction reduces the fraction of branches requiring a flush branches requiring a flush

Pipelined Performance ExamplePipelined Performance Example Ideally CPI = 1Ideally CPI = 1

But less due to: stalls (caused by loads and branches)But less due to: stalls (caused by loads and branches)

SPECINT2000 benchmark: SPECINT2000 benchmark: 25% loads25% loads 10% stores 10% stores 11% branches11% branches 2% jumps2% jumps 52% R-type52% R-type

Suppose:Suppose: 40% of loads used by next instruction40% of loads used by next instruction 25% of branches mispredicted25% of branches mispredicted All jumps flush next instructionAll jumps flush next instruction

What is the average CPI?What is the average CPI?

Pipelined Performance ExamplePipelined Performance Example SPECINT2000 benchmark: SPECINT2000 benchmark:

25% loads25% loads 10% stores 10% stores 11% branches11% branches 2% jumps2% jumps 52% R-type52% R-type

Suppose:Suppose: 40% of loads used by next instruction40% of loads used by next instruction 25% of branches mispredicted25% of branches mispredicted All jumps flush next instructionAll jumps flush next instruction

What is the average CPI?What is the average CPI? Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,

CPIlw = 1(0.6) + 2(0.4) = 1.4CPIlw = 1(0.6) + 2(0.4) = 1.4 CPIbeq = 1(0.75) + 2(0.25) = 1.25CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) +

(0.52)(1) = 1.15(0.52)(1) = 1.15

Pipelined PerformancePipelined Performance Pipelined processor critical path:Pipelined processor critical path:

TTcc = max {= max {

ttpcqpcq + + ttmemmem + + ttsetup,setup,

2(2(ttRFreadRFread + + ttmuxmux + + tteq eq + + ttAND AND + + ttmuxmux + + ttsetup setup ),),

ttpcqpcq + + ttmux mux + + ttmuxmux + + ttALUALU + + ttsetup,setup,

ttpcqpcq + + ttmemwritememwrite + + ttsetup,setup,

2(2(ttpcqpcq + + ttmuxmux + + ttRFwriteRFwrite) }) }

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

For

war

dAD

For

war

dBD

20:16 RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Pipelined Performance ExamplePipelined Performance Example

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )

= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps

Pipelined Performance ExamplePipelined Performance Example For a program with 100 billion instructions For a program with 100 billion instructions

executing on a pipelined MIPS processor,executing on a pipelined MIPS processor, CPI = 1.15CPI = 1.15TTcc = 550 ps = 550 ps

Execution Time = (# instructions) × CPI × TcExecution Time = (# instructions) × CPI × Tc

= (100 × 109)(1.15)(550 × 10-12)= (100 × 109)(1.15)(550 × 10-12)

= 63 seconds= 63 seconds

SummarySummary Pipelining benefitsPipelining benefits

use hardware more efficientlyuse hardware more efficiently throughput increasesthroughput increases

Pipelining challenges/drawbacksPipelining challenges/drawbacks latency increaseslatency increases hazards ensuehazards ensue energy/power consumption increasesenergy/power consumption increases

All modern processors are pipelinedAll modern processors are pipelined some have way more than 5 pipeline stagessome have way more than 5 pipeline stages

some Pentium’s have had 20-40 pipeline stages!some Pentium’s have had 20-40 pipeline stages!

Documents

1 COMP541 Pipelined MIPS Montek Singh Apr 9, 2012