Lecture 09: RISC-V Pipeline Implementa8on - PASSLab · Lecture 09: RISC-V Pipeline Implementa8on...

Preview:

Citation preview

Lecture09:RISC-VPipelineImplementa8on

CSE564ComputerArchitectureSummer2017

DepartmentofComputerScienceandEngineeringYonghongYan

yan@oakland.eduwww.secs.oakland.edu/~yan

1

Acknowledgement

•  SlidesadaptedfromComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakisfromUCBerkeley

2

Introduc8on

•  CPUperformancefactors–  InstrucNoncount

•  DeterminedbyISAandcompiler–  CPIandCycleNme

•  DeterminedbyCPUhardware

•  ThreegroupsofinstrucNons–  Memoryreference:lw,sw–  ArithmeNc/logical:add,sub,and,or,slt–  Controltransfer:jal,jalr,b*

•  CPI–  Single-cycle,CPI=1–  5stageunpipelined,CPI=5–  5stagepipelined,CPI=1

CPU Time = InstructionsProgram

* CyclesInstruction

*TimeCycle

AnIdealPipeline

•  Allobjectsgothroughthesamestages•  Nosharingofresourcesbetweenanytwostages•  PropagaNondelaythroughallpipelinestagesisequal•  Theschedulingofanobjectenteringthepipelineisnot

affectedbytheobjectsinotherstages

4

stage 1

stage 2

stage 3

stage 4

Thesecondi+onsgenerallyholdforindustrialassemblylines,butinstruc+onsdependoneachother!

Review:UnpipelinedDatapathforRISC-V

5

0x4

RegWriteEn

AddAdd

clk

WBSelMemWrite

addr

wdata

rdataDataMemory

we

WASel Op2SelImmSelOpCode

clk

clk

addrinst

Inst.Memory

PC rd1

GPRs

rs1rs2

wawd rd2

we

ImmSelect

ALU

ALUControl

PCSelbrrindjabspc+4

Bcomp?BrLogic

Review:HardwiredControlTable

6

Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel

ALU ALUi LW SW BEQtrue

BEQfalse

JAL

JALR

Op2Sel=Reg/Imm WBSel=ALU/Mem/PC PCSel=pc+4/br/rind/jabs

* * *no yes rindPC rd

jabs* * * no

yes PC rdpc+4BrType12 * * no no * *

brBrType12 * * no no * *pc+4BsType12 Imm + yes no * *

pc+4* Reg Func no yes ALU rdIType12 Imm Op pc+4no yes ALU rd

pc+4IType12 Imm + no yes Mem rd

PipelinedDatapath

7

ClockperiodcanbereducedbydividingtheexecuNonofaninstrucNonintomulNplecycles

tC>max{tIM,tRF,tALU,tDM,tRW}(=tDMprobably)

However,CPIwillincreaseunlessinstruc+onsarepipelined

write-backphase

fetchphase

executephase

decode&Reg-fetchphase

memoryphase

addr

wdata

rdataDataMemory

weALU

ImmSelect

0x4Add

addrrdata

Inst.Memory

rd1

GPRs

rs1rs2

wawdrd2

we

IRPC

TechnologyAssump8ons

8

Thus,thefollowingNmingassumpNonisreasonable

• Asmallamountofveryfastmemory(caches)backedupbyalarge,slowermemory• FastALU(atleastforintegers)• MulNportedRegisterfiles(slower!)

tIM~=tRF~=tALU~=tDM~=tRW

A5-stagepipelinewillbefocusofourdetaileddesign-somecommercialdesignshaveover30pipeline

stagestodoanintegeradd!

5-StagePipelinedExecu8on

9

+me t0 t1 t2 t3 t4 t5 t6 t7 ....instrucNon1 IF1 ID1 EX1 MA1 WB1instrucNon2 IF2 ID2 EX2 MA2 WB2instrucNon3 IF3 ID3 EX3 MA3 WB3instrucNon4 IF4 ID4 EX4 MA4 WB4instrucNon5 IF5 ID5 EX5 MA5 WB5

Write-Back(WB)

I-Fetch(IF)

Execute(EX)

Decode,Reg.Fetch(ID)

Memory(MA)

addr

wdata

rdataDataMemory

weALU

ImmSelect

0x4Add

addrrdata

Inst.Memory

rd1

GPRs

rs1rs2wawdrd2

we

IRPC

5-StagePipelinedExecu8onResourceUsageDiagram

10

+me t0 t1 t2 t3 t4 t5 t6 t7 ....IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5EX I1 I2 I3 I4 I5MA I1 I2 I3 I4 I5WB I1 I2 I3 I4 I5

Resources

Write-Back(WB)

I-Fetch(IF)

Execute(EX)

Decode,Reg.Fetch(ID)

Memory(MA)

addr

wdata

rdataDataMemory

weALU

0x4Add

addrrdata

Inst.Memory

rd1

GPRs

rs1rs2wawdrd2

we

IRPC

ImmSelect

PipelinedExecu8on:ALUInstruc8ons

11

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawdrd2

we

wdata

addr

wdata

rdataDataMemory

we

Notquitecorrect!WeneedanInstruc+onReg(IR)foreachstage

PipelinedRISC-VDatapathwithoutjumps

12

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawdrd2

we

DataMemorywdata

addr

wdata

rdata

we

ImmSel Op2Sel

WBSelMemWrite

RegWriteEn

F D E M W

ControlPointsNeedtoBeConnected

ALUControl

Instruc8onsinteractwitheachotherinpipeline

•  AninstrucNoninthepipelinemayneedaresourcebeingusedbyanotherinstrucNoninthepipelineàstructuralhazard

•  AninstrucNonmaydependonsomethingproducedbyanearlierinstrucNon–  Dependencemaybeforadatavalue

àdatahazard–  DependencemaybeforthenextinstrucNon’saddress

àcontrolhazard(branches,excep+ons)

13

ResolvingStructuralHazards

•  StructuralhazardoccurswhentwoinstrucNonsneedsamehardwareresourceatsameNme–  CanresolveinhardwarebystallingnewerinstrucNonNllolder

instrucNonfinishedwithresource•  Astructuralhazardcanalwaysbeavoidedbyaddingmorehardwaretodesign–  E.g.,iftwoinstrucNonsbothneedaporttomemoryatsame

Nme,couldavoidhazardbyaddingsecondporttomemory•  Our5-stagepipelinehasnostructuralhazardsbydesign

–  ThankstoRISC-VISA,whichwasdesignedforpipelining

14

DataHazards

15

... x1 ← x0 + 10 x4 ← x1 + 17 ...

x1 is stale. Oops!

x1 ← … x4 ← x1 …

IR IR IR

PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR

Imm Select

ALU rd1

GPRs

rs1 rs2

wa wd rd2

we

wdata

addr

wdata

rdata Data Memory

we

HowWouldYouResolveThis?

•  ThreeopNons–  Wait(stall)–  Bypass:askthemforwhatyouneedbeforehis/herfinal

deliverable–  Speculateonvaluestoread

16

ResolvingDataHazards(1)

17

Strategy 1: Wait for the result to be available by freezing earlier pipeline stages è interlocks

InterlockstoresolveDataHazards

18

IR IR IR

PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR

Imm Select

ALU rd1

GPRs

rs1 rs2

wa wd rd2

we

wdata

addr

wdata

rdata Data Memory

we

bubble

... x1 ← x0 + 10 x4 ← x1 + 17 ...

Stall Condition

StalledStagesandPipelineBubbles

19

stalled stages

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I3 I3 I3 I4 I5 ID I1 I2 I2 I2 I2 I3 I4 I5 EX I1 - - - I2 I3 I4 I5 MA I1 - - - I2 I3 I4 I5 WB I1 - - - I2 I3 I4 I5

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) x1 ← (x0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 ← (x1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5

Resource Usage

- ⇒ pipeline bubble

InterlockControlLogic

20

IR IR IR

PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR

Imm Select

ALU rd1

GPRs

rs1 rs2

wa wd rd2

we

wdata

addr

wdata

rdata Data Memory

we

bubble

Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.

stall Cstall

ws

rs2 rs1 ?

InterlockControlLogicignoringjumps&branches

21

Shouldwealwaysstallifanrsfieldmatchessomerd?

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALUrd1

GPRs

rs1rs2

wawdrd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stallCstall

wsW

rs1rs2 ?

weW

re1 re2Cre

wsEweM wsM

Cdest CdestweE

noteveryinstrucNonwritesaregister=>wenoteveryinstrucNonreadsaregister=>re

ImmSelect

we:writeenable,1-biton/offws:writeselect,5-bitregisternumberre:readenable,1-biton/offrs:readselect,5-bitregisternumber

InRISC-VSodorImplementa8on

22

Source&Des8na8onRegisters

23

ALUI/LW/JALRALU

SW/Bcond

func7 rs2 rs1 func3 rd opcode immediate12 rs1 func3 rd opcode imm rs2 rs1 func3 imm

Jump Offset[19:0]

opcode

rd opcode source(s) des+na+on

ALU rd<=rs1func10rs2 rs1,rs2 rdALUI rd<=rs1opimm rs1 rdLW rd<=M[rs1+imm] rs1 rdSW M[rs1+imm]<=rs2 rs1,rs2 -Bcondrs1,rs2 rs1,rs2 -

true: PC<=PC+imm false: PC<=PC+4

JAL x1<=PC,PC<=PC+imm - rdJALR rd<=PC,PC<=rs1+imm rs1 rd

DerivingtheStallSignal

24

Cdestws=rd

we=Caseopcode

ALU,ALUi,LW,JALR=>on... =>off

Crere1=Caseopcode

ALU,ALUi, =>on =>off

re2=Caseopcode

=>on ->off

LW,SW,Bcond,JALRJAL

ALU,SW,Bcond...

Cstall stall=((rs1D==wsE)&&weE+ (rs1D==wsM)&&weM+ (rs1D==wsW)&&weW)&&re1D+ ((rs2D==wsE)&&weE+ (rs2D==wsM)&&weM+ (rs2D==wsW)&&weW)&&re2D

HazardsduetoLoads&Stores

25

...M[x1+7]<=x2x4<=M[x3+5]...

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawdrd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

StallCondi+on

Isthereanypossibledatahazardinthisinstruc+onsequence?

Whatifx1+7=x3+5?

Load&StoreHazards

26

However,thehazardisavoidedbecauseourmemorysystemcompleteswritesinonecycle!Load/StorehazardsaresomeNmesresolvedinthepipelineandsomeNmesinthememorysystemitself.Moreonthislaterinthecourse.

...M[x1+7]<=x2x4<=M[x3+5]...

x1+7=x3+5=>datahazard

ResolvingDataHazards(2)

27

Strategy2:Routedataassoonaspossibleaweritiscalculatedtotheearlierpipelinestageàbypass

Bypassing

28

Eachstallorkillintroducesabubbleinthepipeline =>CPI>1

time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) x1 ← x0 + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 ← x1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 (I4) stalled stages IF4 ID4 EX4 (I5) IF5 ID5

time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) x1 ← x0 + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 ← x1 + 17 IF2 ID2 EX2 MA2 WB2 (I3) IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5

Anewdatapath,i.e.,abypass,cangetthedatafromtheoutputoftheALUtoitsinput

HardwareSupportforForwarding

Detec8ngRAWHazards

•  Pass register numbers along pipeline –  ID/EX.RegisterRs = register number for Rs in ID/EX –  ID/EX.RegisterRt = register number for Rt in ID/EX –  ID/EX.RegisterRd = register number for Rd in ID/EX

•  Current instruction being executed in ID/EX register •  Previous instruction is in the EX/MEM register •  Second previous is in the MEM/WB register •  RAW Data hazards when

1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

FwdfromEX/MEMpipelinereg

FwdfromMEM/WBpipelinereg

Detec8ngtheNeedtoForward•  But only if forwarding instruction will write to a register!

–  EX/MEM.RegWrite, MEM/WB.RegWrite

•  And only if Rd for that instruction is not R0 –  EX/MEM.RegisterRd ≠ 0 –  MEM/WB.RegisterRd ≠ 0

ForwardingCondi8ons

•  Detecting RAW hazard with Previous Instruction –  if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)

and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 (Forward from EX/MEM pipe stage)

–  if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 (Forward from EX/MEM pipe stage)

•  Detecting RAW hazard with Second Previous –  if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)

and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 (Forward from MEM/WB pipe stage)

–  if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 (Forward from MEM/WB pipe stage)

AddingaBypass

33

ASrc

...(I1)x1<=x0+10(I2)x4<=x1+17

x4<=x1... x1<=...

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawdrd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stall

D

E M W

Whendoesthisbypasshelp?x1<=M[x0+10]x4<=x1+17

JAL500x4<=x1+17

yes no no

TheBypassSignalDerivingitfromtheStallSignal

34

ASrc=(rs1D==wsE)&&weE&&re1D

we=CaseopcodeALU,ALUi,LW,,JALJALR=>on...=>off

NobecauseonlyALUandALUiinstrucNonscanbenefitfromthisbypass

Isthiscorrect?

SplitweEintotwocomponents:we-bypass,we-stall

stall=(((rs1D==wsE)&&weE+(rs1D==wsM)&&weM+(rs1D==wsW)&&weW)&&re1D+((rs2D==wsE)&&weE+(rs2D==wsM)&&weM+(rs2D==wsW)&&weW)&&re2D)

ws=rd

BypassandStallSignals

35

we-bypassE=CaseopcodeEALU,ALUi =>on

... =>off

ASrc=(rs1D==wsE)&&we-bypassE&&re1D

SplitweEintotwocomponents:we-bypass,we-stall

stall=((rs1D==wsE)&&we-stallE+ (rs1D==wsM)&&weM+(rs1D==wsW)&&weW)&&re1D

+((rs2D==wsE)&&weE+(rs2D==wsM)&&weM+(rs2D==wsW)&&weW)&&re2D

we-stallE=CaseopcodeELW,JAL,JALR=>on

JAL =>on... =>off

FullyBypassedDatapath

36

ASrcIRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALU

ImmSelect

rd1

GPRs

rs1rs2

wawdrd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stall

D

E M W

PCforJAL,...

BSrc

Istheres+llaneedforthestallsignal?stall=(rs1D==wsE)&&(opcodeE==LWE)&&(wsE!=0)&&re1D

+(rs2D==wsE)&&(opcodeE==LWE)&&(wsE!=0)&&re2D

ControlHazards

WhatdoweneedtocalculatenextPC?•  ForJumps

–  Opcode,PCandoffset

•  ForJumpRegister–  Opcode,Registervalue,andPC

•  ForCondiNonalBranches–  Opcode,Register(forcondiNon),PCandoffset

•  ForallotherinstrucNons–  OpcodeandPC(andhavetoknowit’snotoneofabove)

37

PCCalcula8onBubbles

38

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) x1 ← x0 + 10 IF1 ID1 EX1 MA1 WB1 (I2) x3 ← x2 + 17 IF2 IF2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 IF4 ID4 EX4 MA4 WB4

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 - I2 - I3 - I4 ID I1 - I2 - I3 - I4 EX I1 - I2 - I3 - I4 MA I1 - I2 - I3 - I4 WB I1 - I2 - I3 - I4

Resource Usage

- ⇒ pipeline bubble

SpeculatenextaddressisPC+4

39

A jump instruction kills (not stalls) the following instruction

stall

How?

I2

I1

104

IR IR

PC addr inst

Inst Memory

0x4 Add

bubble

IR

E M Add

Jump?

PCSrc (pc+4 / jabs / rind/ br)

I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD

kill

PipeliningJumps

40

I2

I1

104

stall

IR IR

PC addr inst

Inst Memory

0x4 Add

bubble

IR

E M Add

Jump?

PCSrc (pc+4 / jabs / rind/ br)

IRSrcD = Case opcodeD JAL ⇒ bubble ... ⇒ IM

To kill a fetched instruction -- Insert a mux before IR

Any interaction between stall and jump?

bubble

IRSrcD

I2 I1

304 bubble

I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD

kill

JumpPipelineDiagrams

41

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5 ID I1 I2 - I4 I5 EX I1 I2 - I4 I5 MA I1 I2 - I4 I5 WB I1 I2 - I4 I5

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: J 304 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 - - - - (I4) 304: ADD IF4 ID4 EX4 MA4 WB4

Resource Usage

- ⇒ pipeline bubble

PipeliningCondi8onalBranches

42

I1 096ADDI2 100BEQx1,x2+200I3 104ADDI4 304ADD

BEQ?

I2

I1

104

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

PCSrc(pc+4/jabs/rind/br)

bubble

IRSrcD

BranchcondiNonisnotknownunNltheexecutestage

whatac+onshouldbetakeninthedecodestage?

AYALU

Taken?

PipeliningCondi8onalBranches

43

I1 096ADDI2 100BEQx1,x2+200I3 104ADDI4 304ADD

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

PCSrc(pc+4/jabs/rind/br)

bubble

IRSrcD

AYALU

Taken?

Ifthebranchistaken-killthetwofollowinginstrucNons-theinstrucNonatthedecodestageisnotvalid⇒stallsignalisnotvalid

I2 I1

108I3

Bcond?

?

PipeliningCondi8onalBranches

44

I1: 096ADDI2: 100BEQx1,x2+200I3: 104ADDI4: 304ADD

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E M

PCSrc(pc+4/jabs/rind/br)

bubble AYALU

Taken?I2 I1

108I3

Bcond?

Jump?

IRSrcD

IRSrcE

Ifthebranchistaken-killthetwofollowinginstrucNons-theinstrucNonatthedecodestageisnotvalid⇒stallsignalisnotvalid

Add

PC

BranchPipelineDiagrams(resolvedinexecutestage)

45

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5 ID I1 I2 I3 - I5 EX I1 I2 - - I5 MA I1 I2 - - I5 WB I1 I2 - - I5

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 ID3 - - - (I4) 108: IF4 - - - - (I5) 304: ADD IF5 ID5 EX5 MA5 WB5

Resource Usage

- ⇒ pipeline bubble

WhatIf…

•  Weusedasimplebranchthatcomparesonlyoneregister(rs1)againstzero

•  Canwedoanybeyer?

46

IR IR IR

PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR

Imm Select

ALU rd1

GPRs

rs1 rs2

wa wd rd2

we

wdata

addr

wdata

rdata Data Memory

we

Usesimplerbranches(e.g.,onlycompareoneregagainstzero)withcompareindecodestage

47

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5 ID I1 I2 - I4 I5 EX I1 I2 - I4 I5 MA I1 I2 - I4 I5 WB I1 I2 - I4 I5

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 - - - - (I4) 300: ADD IF4 ID4 EX4 MA4 WB4

Resource Usage

- ⇒ pipeline bubble

BranchDelaySlots(exposecontrolhazardtosoeware)

•  ChangetheISAseman8cssothattheinstrucNonthatfollowsajumporbranchisalwaysexecuted–  givescompilertheflexibilitytoputinausefulinstrucNonwherenormally

apipelinebubblewouldhaveresulted.

48

Delayslotinstruc+onexecutedregardlessofbranchoutcome

I1 096 ADD I2 100 BEQZ r1, +200 I3 104 ADD I4 300 ADD

BranchPipelineDiagrams(branchdelayslot)

49

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 ID I1 I2 I3 I4 EX I1 I2 I3 I4 MA I1 I2 I3 I4 WB I1 I2 I3 I4

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 ID3 EX3 MA3 WB3 (I4) 300: ADD IF4 ID4 EX4 MA4 WB4

Resource Usage

Post-1990RISCISAsdon’thavedelayslots

•  EncodesmicroarchitecturaldetailintoISA–  C.f.IBM650drumlayout

•  Whataretheproblemswithdelayslots?

•  Performanceissues–  E.g.,I-cachemissorpagefaultondelayslotinstrucNoncauses

machinetowait,evenifdelayslotisaNOP•  Complicatesmoreadvancedmicroarchitectures

–  30-stagepipelinewithfour-instrucNon-per-cycleissue•  Complicatesthecompiler’sjob•  BeyerbranchpredicNonreducedneedfordelayslots

50

WhyanInstruc8onmaynotbedispatchedeverycycle(CPI>1)

•  Fullbypassingmaybetooexpensivetoimplement–  typicallyallfrequentlyusedpathsareprovided–  someinfrequentlyusedbypasspathsmayincreasecycleNmeandcounteractthebenefitofreducingCPI

•  Loadshavetwo-cyclelatency–  InstrucNonawerloadcannotuseloadresult– MIPS-IISAdefinedloaddelayslots,asowware-visiblepipelinehazard(compilerschedulesindependentinstrucNonorinsertsNOPtoavoidhazard).RemovedinMIPS-II(pipelineinterlocksaddedinhardware)

•  MIPS:“MicroprocessorwithoutInterlockedPipelineStages”•  CondiNonalbranchesmaycausebubbles

–  killfollowinginstrucNon(s)ifnodelayslots

51

Machineswithso^ware-visibledelayslotsmayexecutesignificantnumberofNOPinstruc+onsinsertedbythecompiler.NOPsincreaseinstruc+ons/program!

RISC-VBranchesandJumps

•  JAL:uncondi8onaljumptoPC+immediate

•  JALR:indirectjumptors1+immediate

•  Branch:if(rs1condsrs2),branchtoPC+immediate

52

RISC-VBranchesandJumps

53

Instruc<on Takenknown? Targetknown?

JAL

JALRB<cond.>

EachinstrucNonfetchdependsononeortwopiecesofinformaNonfromtheprecedinginstrucNon:

1)IstheprecedinginstrucNonatakenbranch?2)Ifso,whatisthetargetaddress?

•  JAL:uncondi8onaljumptoPC+immediate•  JALR:indirectjumptors1+immediate•  Branch:if(rs1condsrs2),branchtoPC+immediate

AeerInst.Decode

AeerInst.Decode AeerInst.Decode

AeerInst.Decode AeerReg.Fetch

AeerExecute

BranchPenal8esinModernPipelines

54

A PCGeneraNon/MuxP InstrucNonFetchStage1F InstrucNonFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstrucNonstoFuncNonalunitsR RegisterFileReadE IntegerExecute

Remainderofexecutepipeline(+another6stages)

UltraSPARC-IIIinstrucNonfetchpipelinestages(in-orderissue,4-waysuperscalar,750MHz,2000)

BranchTargetAddressKnown

BranchDirec+on&JumpRegisterTargetKnown

ReducingControlFlowPenalty

•  SowwaresoluNons–  Eliminatebranches-loopunrolling

•  Increasestherunlength–  ReduceresoluNonNme-instrucNonscheduling

•  ComputethebranchcondiNonasearlyaspossible(oflimitedvaluebecausebranchesowenincriNcalpaththroughcode)

•  HardwaresoluNons–  Findsomethingelsetodo-delayslots

•  Replacespipelinebubbleswithusefulwork(requiressowwarecooperaNon)

–  Speculate-branchpredicNon•  SpeculaNveexecuNonofinstrucNonsbeyondthebranch

55

BranchPredic8on

•  Mo+va+on:–  BranchpenalNeslimitperformanceofdeeplypipelined

processors–  Modernbranchpredictorshavehighaccuracy–  (>95%)andcanreducebranchpenalNessignificantly

•  Requiredhardwaresupport:–  Predic+onstructures:

•  Branchhistorytables,branchtargetbuffers,etc.

–  Mispredictrecoverymechanisms:•  Keepresultcomputa+onseparatefromcommit •  KillinstrucNonsfollowingbranchinpipeline•  Restorestatetothatfollowingbranch

56

Sta8cBranchPredic8on

57

Overallprobabilityabranchistakenis~60-70%but:

ISAcanayachpreferreddirecNonsemanNcstobranches,e.g.,MotorolaMC88110

bne0(preferredtaken) beq0(nottaken)

backward90%

forward50%

WhatC++statementdoesthislooklike

WhatC++statementdoesthislooklike

DynamicBranchPredic8onlearningbasedonpastbehavior

•  TemporalcorrelaNon(Nme)–  IfItellyouthatacertainbranchwastakenlastNme,doesthishelp?

–  ThewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecuNon

•  SpaNalcorrelaNon(space)–  Severalbranchesmayresolveinahighlycorrelatedmanner–  Forinstance,apreferredpathofexecuNon

58

DynamicBranchPredic8on

•  1-bitpredicNonscheme–  Low-porNonaddressasaddressforaone-bitflagforTakenor

NotTakenhistorically–  Simple

•  2-bitpredicNon–  Misstwicetochange

BranchPredic8onBits

•  Assume2BPbitsperinstrucNon•  ChangethepredicNonawertwoconsecuNvemistakes!

60

¬takewrong

taken¬taken

taken

taken

taken¬takeright

takeright

takewrong

¬taken

¬taken¬taken

BPstate: (predicttake/¬take)x(lastpredic+onright/wrong)

BranchHistoryTable

61

4K-entryBHT,2bits/entry,~80-90%correctpredicNons

00FetchPC

Branch? TargetPC

+

I-Cache

Opcode offsetInstruc+on

kBHTIndex

2k-entryBHT,2bits/entry

Taken/¬Taken?

Exploi8ngSpa8alCorrela8onYehandPaC,1992

62

IffirstcondiNonfalse,secondcondiNonalsofalseHistoryregister,H,recordsthedirecNonofthelastNbranchesexecutedbytheprocessor

if (x[i] < 7) then!!y += 1;!

if (x[i] < 5) then!!c -= 4;!

Two-LevelBranchPredictor

63

Pen+umProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)

0 0

kFetchPC

ShiwinTaken/¬Takenresultsofeachbranch

2-bitglobalbranchhistoryshiwregister

Taken/¬Taken?

Specula8ngBothDirec8ons•  AnalternaNvetobranchpredicNonistoexecutebothdirecNonsofabranchspeculaNvely

–  resourcerequirementisproporNonaltothenumberofconcurrentspeculaNveexecuNons

–  onlyhalftheresourcesengageinusefulworkwhenbothdirecNonsofabranchareexecutedspeculaNvely

–  branchpredicNontakeslessresourcesthanspeculaNveexecuNonofbothpaths

•  WithaccuratebranchpredicNon,itismorecosteffecNvetodedicateallresourcestothepredicteddirecNon!–  Whatwouldyouchoosewith80%accuracy?

64

AreWeMissingSomething?

•  Knowingwhetherabranchistakenornotisgreat,butwhatelsedoweneedtoknowaboutit?

Branchtargetaddress

65

Limita8onsofBHTs

66

OnlypredictsbranchdirecNon.Therefore,cannotredirectfetchstreamunNlawerbranchtargetisdetermined.

UltraSPARC-IIIfetchpipeline

Correctlypredictedtakenbranch

penalty

JumpRegisterpenalty

A PCGeneraNon/MuxP InstrucNonFetchStage1F InstrucNonFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstrucNonstoFuncNonalunitsR RegisterFileReadE IntegerExecute

Remainderofexecutepipeline(+another6stages)

BranchTargetBuffer

67

BPbitsarestoredwiththepredictedtargetaddress.IFstage:If(BP=taken)thennPC=targetelsenPC=PC+4Later:checkpredic+on,ifwrongthenkilltheinstruc+onandupdateBTB&BPbelseupdateBPb

IMEM

PC

BranchTargetBuffer(2kentries)

k

BPbpredicted

target BP

target

AddressCollisions(Mis-Predic8on)

68

WhatwillbefetchedawertheinstrucNonat1028?BTBpredicNon = Correcttarget = =>

Assumea128-entryBTB

BPbtargettake236

1028Add.....

132Jump+104

InstrucNonMemory

2361032

killPC=236andfetchPC=1032

Isthisacommonoccurrence?

BTBisonlyforControlInstruc8ons

•  IsevenbranchpredicNonfastenoughtoavoidbubbles?•  WhendoweindextheBTB?

–  i.e.,whatstateisthebranchin,inordertoavoidbubbles?

•  BTBcontainsusefulinforma8onforbranchandjumpinstruc8onsonly=>Donotupdateitforotherinstruc8ons

•  ForallotherinstrucNonsthenextPCisPC+4!

•  Howtoachievethiseffectwithoutdecodingtheinstruc+on?

69

BranchTargetBuffer(BTB)

70

• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• OnlytakenbranchesandjumpsheldinBTB• NextPCdeterminedbeforebranchfetchedanddecoded

2k-entry direct-mapped BTB (can also be associative) I-Cache PC

k

Valid

valid

EntryPC

=

match

predicted

target

targetPC

AreWeMissingSomething?(2)

•  WhendoweupdatetheBTBorBHT?

71

IR IR IR

PC A

B

Y

R

MD1 MD2

addr inst

Inst Memory

0x4 Add

IR

Imm Select

ALU rd1

GPRs

rs1 rs2

wa wd rd2

we

wdata

addr

wdata

rdata Data Memory

we

CombiningBTBandBHT

•  BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)

•  BHTcanholdmanymoreentriesandismoreaccurate

72

A PCGeneraNon/MuxP InstrucNonFetchStage1F InstrucNonFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstrucNonstoFuncNonalunitsR RegisterFileReadE IntegerExecute

BTB

BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch

BTB/BHTonlyupdateda^erbranchresolvesinEstage

UsesofJumpRegister(JR)

•  Switchstatements(jumptoaddressofmatchingcase)

•  DynamicfuncNoncall(jumptorun-NmefuncNonaddress)

•  SubrouNnereturns(jumptoreturnaddress)

73

HowwelldoesBTBworkforeachofthesecases?

BTBworkswellifsamecaseusedrepeatedly

BTBworkswellifsamefuncNonusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfuncNoncall)

BTBworkswellifusuallyreturntothesameplace⇒O^enonefunc+oncalledfrommanydis+nctcallsites!

Subrou8neReturnStack

SmallstructuretoaccelerateJRforsubrouNnereturns,typicallymuchmoreaccuratethanBTBs.

74

&fb() &fc()

Pushcalladdresswhenfunc+oncallexecuted

Popreturnaddresswhensubrou+nereturndecoded

fa() { fb(); } fb() { fc(); } fc() { fd(); }

&fd() kentries(typicallyk=8-16)

Recommended