58
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 6: Out-of-Order Processors Supercomputers Definitions of a supercomputer: § Fastest machine in world at given task § A device to turn a compute-bound problem into an I/O bound problem § Any machine costing $30M+ § Any machine designed by Seymour Cray § CDC6600 (Cray, 1964) regarded as first supercomputer 2

CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

CSC631:High-PerformanceComputerArchitecture

Spring2017Lecture6:Out-of-OrderProcessors

Supercomputers

Definitionsofasupercomputer:§ Fastestmachineinworldatgiventask§ Adevicetoturnacompute-boundproblemintoanI/Oboundproblem

§ Anymachinecosting$30M+§ AnymachinedesignedbySeymourCray

§ CDC6600(Cray,1964)regardedasfirstsupercomputer

2

Page 2: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

CDC6600SeymourCray,1963

§ Afastpipelinedmachinewith60-bitwords- 128Kwordmainmemorycapacity,32banks

§ Tenfunctionalunits(parallel,unpipelined)- FloatingPoint:adder,2multipliers,divider- Integer:adder,2incrementers,...

§ Hardwiredcontrol(nomicrocoding)§ Scoreboard fordynamicschedulingofinstructions§ TenPeripheralProcessorsforInput/Output

- afastmulti-threaded12-bitintegerALU§ Veryfastclock,10MHz(FPaddin4clocks)§ >400,000transistors,750sq.ft.,5tons,150kW,novelfreon-basedtechnologyforcooling

§ Fastestmachineinworldfor5years(until7600)- over100sold($7-10Meach)

33/10/2009

CDC6600:ALoad/StoreArchitecture

4

• Separateinstructionstomanipulatethreetypesofreg.• 8x60-bitdataregisters(X)• 8x18-bitaddressregisters(A)• 8x18-bitindexregisters(B)

• Allarithmeticandlogicinstructionsareregister-to-register

•OnlyLoadandStoreinstructionsrefertomemory!

Touchingaddressregisters1to5initiatesaload6to7initiatesastore

- veryusefulforvectoroperations

opcode i j k Ri ¬ Rj op Rk

opcode i j disp Ri ¬ M[Rj + disp]

6 3 3 3

6 3 3 18

Page 3: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

CDC6600:Datapath

5

AddressRegsIndexRegs8x18-bit8x18-bit

OperandRegs8x60-bit

Inst.Stack8x60-bit

IR

10FunctionalUnits

CentralMemory128Kwords,32banks,1µscycle

resultaddr

result

operand

operandaddr

CDC6600ISAdesignedtosimplifyhigh-performanceimplementation

§ Useofthree-address,register-registerALUinstructionssimplifiespipelinedimplementation- Only3-bitregisterspecifier fieldscheckedfordependencies- Noimplicitdependenciesbetweeninputsandoutputs

§ Decouplingsettingofaddressregister(Ar)fromretrievingvaluefromdataregister(Xr)simplifiesprovidingmultipleoutstandingmemoryaccesses- Softwarecanscheduleloadofaddressregisterbeforeuseofvalue- Caninterleaveindependentinstructionsinbetween

§ CDC6600hasmultipleparallelbutunpipelined functionalunits- E.g.,2separatemultipliers

§ Follow-onmachineCDC7600usedpipelinedfunctionalunits- ForeshadowslaterRISCdesigns

6

Page 4: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

CDC6600:VectorAddition

7

B0←- nloop: JZEB0,exit

A0←B0+a0 loadX0A1←B0+b0 loadX1X6←X0+X1A6←B0+c0 storeX6B0←B0+1jumploop

Ai=addressregisterBi=indexregisterXi=dataregister

CDC6600Scoreboard

§ Instructionsdispatchedin-ordertofunctionalunitsprovidednostructuralhazardorWAW- Stallonstructuralhazard,nofunctionalunitsavailable-Onlyonependingwritetoanyregister

§ Instructionswaitforinputoperands(RAWhazards)beforeexecution- Canexecuteout-of-order

§ Instructionswaitforoutputregistertobereadbyprecedinginstructions(WAR)- Resultheldinfunctionalunituntilregisterfree

8

Page 5: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

9[©IBM]

IBM360/91Floating-PointUnitR.M.Tomasulo,1967

10

Mult

1

123456

loadbuffers(frommemory)

1234

Adder

123

Floating-PointRegfile

storebuffers(tomemory)

...

instructions

Commonbusensuresthatdataismadeavailableimmediatelytoalltheinstructionswaitingforit.Matchtag,ifequal,copyvalue&setpresence“p”.

Distributereservationstationstofunctionalunits

<tag,result>

p tag/datap tag/datap tag/data

p tag/datap tag/datap tag/data

p tag/datap tag/datap tag/data

p tag/datap tag/data

p tag/datap tag/data2

p tag/datap tag/datap tag/data

p tag/datap tag/datap tag/data

p tag/datap tag/datap tag/datap tag/data

Page 6: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

Out-of-OrderFadesintoBackgroundOut-of-orderprocessingimplementedcommerciallyin1960s,butdisappearedagainuntil1990sastwomajorproblemshadtobesolved:§ Precisetraps- ImprecisetrapscomplicatedebuggingandOScode-Note,preciseinterruptsarerelativelyeasytoprovide

§ Branchprediction- Amountofexploitableinstruction-levelparallelism(ILP)limitedbycontrolhazards

Also,simplermachinedesignsinnewtechnologybeatcomplicatedmachinesinoldtechnology- Bigadvantagetofitprocessor &cachesononechip-Microprocessorshaderaof1%/weekperformancescaling

11

SeparatingCompletionfromCommit

§ Re-orderbufferholdsregisterresultsfromcompletionuntilcommit- Entriesallocatedinprogramorderduringdecode- Bufferscompletedvaluesandexceptionstateuntilin-ordercommitpoint

- Completedvaluescanbeusedbydependentsbeforecommitted(bypassing)

- Eachentryholdsprogramcounter,instructiontype,destinationregisterspecifier andvalueifany,andexceptionstatus(infooftencompressedtosavehardware)

§ Memoryreorderingneedsspecialdatastructures- Speculativestoreaddressanddatabuffers- Speculativeloadaddressanddatabuffers

12

Page 7: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

In-OrderCommitforPreciseTraps

§ In-orderinstructionfetchanddecode,anddispatchtoreservationstationsinsidereorderbuffer

§ Instructionsissuefromreservationstationsout-of-order§ Out-of-ordercompletion,valuesstoredintemporarybuffers§ Commitisin-order,checksfortraps,andifnoneupdatesarchitecturalstate

13

Fetch Decode

Execute

CommitReorderBuffer

In-order In-orderOut-of-order

Trap?Kill

Kill Kill

InjecthandlerPC

PhasesofInstructionExecution

14

Fetch: Instructionbitsretrievedfrominstructioncache.I-cache

FetchBuffer

IssueBuffer

FunctionalUnits

ArchitecturalState

Execute:Instructionsandoperands issuedtofunctionalunits.Whenexecutioncompletes,allresultsandexceptionflagsareavailable.

Decode: Instructions dispatchedtoappropriateissue buffer

ResultBufferCommit:Instructionirrevocablyupdatesarchitecturalstate(aka“graduation”),ortakesprecisetrap/interrupt.

PC

Commit

Decode/Rename

Page 8: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

In-OrderversusOut-of-OrderPhases

§ Instructionfetch/decode/renamealwaysin-order-NeedtoparseISAsequentiallytogetcorrectsemantics- ProposalsforspeculativeOoO instructionfetch,e.g.,Multiscalar.Predictcontrolflowanddatadependenciesacrosssequential programsegmentsfetched/decoded/executedinparallel,fixup ifpredictionwrong

§ Dispatch(placeinstructionintomachinebufferstowaitforissue)alsoalwaysin-order- Someuse“Dispatch”tomeanissue,butnotintheselectures

15

In-OrderVersusOut-of-OrderIssue

§ In-orderissue:- IssuestallsonRAWdependenciesorstructuralhazards,orpossiblyWAR/WAWhazards

- Instructioncannotissuetoexecutionunitsunlessallprecedinginstructionshaveissuedtoexecutionunits

§ Out-of-orderissue:- Instructionsdispatchedinprogramordertoreservationstations(orotherformsofinstructionbuffer)towaitforoperandstoarrive,orotherhazardstoclear

-Whileearlierinstructionswaitinissuebuffers,followinginstructionscanbedispatchedandissuedout-of-order

16

Page 9: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

In-OrderversusOut-of-OrderCompletion

§ Allbutthesimplestmachineshaveout-of-ordercompletion,duetodifferentlatenciesoffunctionalunitsanddesiretobypassvaluesassoonasavailable

§ ClassicRISC5-stageintegerpipelinejustbarelyhasin-ordercompletion- Loadtakestwocycles,butfollowingone-cycleintegeropcompletesatsametime, notearlier

- AddingpipelinedFPUimmediatelybringsOoO completion

17

In-OrderversusOut-of-OrderCommit

§ In-ordercommitsupportsprecisetraps,standardtoday- Someproposalstoreducethecostofin-ordercommitbyretiringsomeinstructionsearlytocompactreorderbuffer,butthisisjustanoptimizedin-ordercommit

§ Out-of-ordercommitwaseffectivelywhatearlyOoOmachinesimplemented(imprecisetraps)ascompletionirrevocablychangedmachinestate- i.e.,complete==commitinthesemachines

18

Page 10: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

OoO DesignChoices

§ Wherearereservationstations?- Partofreorderbuffer,orinseparateissuewindow?- Distributedbyfunctionalunits,orcentralized?

§ Howisregisterrenamingperformed?- Tagsanddataheldinreservationstations,withseparatearchitecturalregisterfile

- Tagsonlyinreservationstations,dataheldinunifiedphysicalregisterfile

19

“Data-in-ROB”Design(HPPA8000,PentiumPro,Core2Duo,Nehalem)

§ Managedascircularbufferinprogramorder,newinstructionsdispatchedtofreeslots,oldestinstructioncommitted/reclaimedwhendone(“p”bitsetonresult)

§ TagisgivenbyindexinROB(Freepointervalue)§ Indispatch,non-busysourceoperandsreadfromarchitecturalregisterfileandcopiedtoSrc1andSrc2withpresencebit“p”set.Busyoperandscopytagofproducerandclear“p”bit.

§ Setvalidbit“v”ondispatch,setissuedbit“i”onissue§ Oncompletion,searchsourcetags,set“p”bitandcopydataintosrc ontagmatch.WriteresultandexceptionflagstoROB.

§ Oncommit,checkexceptionstatus,andcopyresultintoarchitecturalregisterfileifnotrap.

§ Ontrap,flushmachineandROB,setfree=oldest,jumptohandler

Tagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv Opcode

Oldest

Free

Page 11: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

ManagingRenameforData-in-ROB

§ If“p”bitset,thenusevalueinarchitecturalregisterfile§ Else,tagfieldindicatesinstructionthatwill/hasproducedvalue§ Fordispatch,readsourceoperands<p,tag,value>fromarch.regfile,andalsoread<p,result>fromproducinginstructioninROB,bypassingasneeded.CopytoROB

§ Writedestinationarch.registerentrywith<0,Free,_>,toassigntagtoROBindexofthisinstruction

§ Oncommit,updatearch.regfile with<1,_,Result>§ Ontrap,resettable(Allp=1)

21

Tagp ValueTagp ValueTagp Value

Tagp Value

Oneentryperarch.register

Renametableassociatedwitharchitecturalregisters,managedindecode/dispatch

ROB

DataMovementinData-in-ROBDesign

22

ArchitecturalRegisterFile

Readoperandsduringdecode

SourceOperands

Writesourcesindispatch

Readoperandsatissue

FunctionalUnits

Writeresultsatcompletion

Readresultsforcommit

Bypassnewervaluesatdispatch

ResultData

Writeresultsatcommit

Page 12: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

UnifiedPhysicalRegisterFile(MIPSR10K,Alpha21264,IntelPentium4&Sandy/IvyBridge)

§ Renameallarchitecturalregistersintoasinglephysicalregisterfileduringdecode,noregistervaluesread

§ Functionalunitsreadandwritefromsingleunifiedregisterfileholdingcommittedandtemporaryregistersinexecute

§ Commitonlyupdatesmappingofarchitecturalregistertophysicalregister,nodatamovement

23

UnifiedPhysicalRegisterFile

Readoperandsatissue

FunctionalUnits

Writeresultsatcompletion

CommittedRegisterMapping

DecodeStageRegisterMapping

LifetimeofPhysicalRegisters

24

ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)

ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)

Rename

Whencanwereuseaphysicalregister?Whennextwriterofsamearchitecturalregistercommits

• Physicalregfile holdscommittedandspeculativevalues• PhysicalregistersdecoupledfromROBentries(nodatainROB)

Page 13: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PhysicalRegisterManagement

25

op p1 PR1 p2 PR2exuse Rd PRdLPRd

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

x5P5x6P6x7

x0P8x1

x2P7x3

x4

ROB

Rename Table

Physical Regs Free List

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

ppp

P0P1P3P2P4

(LPRd requires third read port on Rename Table for each instruction)

<x1>P8 p

PhysicalRegisterManagement

26

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8

Page 14: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PhysicalRegisterManagement

27

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1

PhysicalRegisterManagement

28

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3

Page 15: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PhysicalRegisterManagement

29

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2

PhysicalRegisterManagement

30

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

Page 16: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PhysicalRegisterManagement

31

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3

x ld p P7 x1 P0

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

Execute & Commitp

p

p<x1>

P8

x

PhysicalRegisterManagement

32

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

P8

x x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

Execute & Commitp

p

p<x1>

P8

x

p

p<x3>

P7

Page 17: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

MIPSR10KTrapHandling

§ Renametableisrepairedbyunrenaming instructionsinreverseorderusingthePRd/LPRd fields

§ TheAlpha21264hadsimilarphysicalregisterfilescheme,butkeptcompleterenametablesnapshotsforeachinstructioninROB(80snapshotstotal)- Flashcopyallbitsfromsnapshottoactivetableinonecycle

33

PartII:AdvancedOut-of-OrderSuperscalarDesigns

34

Page 18: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

ROB

DataMovementinData-in-ROBDesign

35

ArchitecturalRegisterFile

Readoperandsduringdecode

SourceOperands

Writesourcesindispatch

Readoperandsatissue

FunctionalUnits

Writeresultsatcompletion

Readresultsforcommit

Bypassnewervaluesatdispatch

ResultData

Writeresultsatcommit

UnifiedPhysicalRegisterFile(MIPSR10K,Alpha21264,IntelPentium4&Sandy/IvyBridge)

§ Renameallarchitecturalregistersintoasinglephysicalregisterfileduringdecode,noregistervaluesread

§ Functionalunitsreadandwritefromsingleunifiedregisterfileholdingcommittedandtemporaryregistersinexecute

§ Commitonlyupdatesmappingofarchitecturalregistertophysicalregister,nodatamovement

36

UnifiedPhysicalRegisterFile

Readoperandsatissue

FunctionalUnits

Writeresultsatcompletion

CommittedRegisterMapping

DecodeStageRegisterMapping

Page 19: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

LifetimeofPhysicalRegisters

37

ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)

ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)

Rename

Whencanwereuseaphysicalregister?Whennextwriterofsamearchitecturalregistercommits

• Physicalregfile holdscommittedandspeculativevalues• PhysicalregistersdecoupledfromROBentries(nodatainROB)

PhysicalRegisterManagement

38

op p1 PR1 p2 PR2exuse Rd PRdLPRd

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

x5P5x6P6x7

x0P8x1

x2P7x3

x4

ROB

Rename Table

Physical Regs Free List

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

ppp

P0P1P3P2P4

(LPRd requires third read port on Rename Table for each instruction)

<x1>P8 p

Page 20: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PhysicalRegisterManagement

39

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8

PhysicalRegisterManagement

40

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1

Page 21: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PhysicalRegisterManagement

41

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3

PhysicalRegisterManagement

42

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2

Page 22: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PhysicalRegisterManagement

43

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

PhysicalRegisterManagement

44

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3

x ld p P7 x1 P0

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

Execute & Commitp

p

p<x1>

P8

x

Page 23: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PhysicalRegisterManagement

45

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB

x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

P8

x x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4

Execute & Commitp

p

p<x1>

P8

x

p

p<x3>

P7

MIPSR10KTrapHandling

§ Renametableisrepairedbyunrenaming instructionsinreverseorderusingthePRd/LPRd fields

§ TheAlpha21264hadsimilarphysicalregisterfilescheme,butkeptcompleterenametablesnapshotsforeachinstructioninROB(80snapshotstotal)- Flashcopyallbitsfromsnapshottoactivetableinonecycle

46

Page 24: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

ReorderBufferHoldsActiveInstructions(DecodedbutnotCommitted)

47

(Olderinstructions)

(Newerinstructions)

Cyclet

…ld x1, (x3)add x3, x1, x2sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x1)…

Commit

Fetch

Cyclet+1

Execute

…ld x1, (x3)add x3, x1, x2sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x1)…

ROBcontents

SeparateIssueWindowfromROB

48

Reorderbufferusedtoholdexceptioninformationforcommit.

Theissuewindowholdsonlyinstructionsthathavebeendecodedandrenamedbutnotissuedintoexecution.Hasregistertagsandpresencebits,andpointertoROBentry.

op p1 PR1 p2 PR2 PRduse ex ROB#

ROBisusuallyseveraltimeslargerthanissuewindow– why?

Rd LPRd PC Except?Oldest

Free

Done?

Page 25: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

SuperscalarRegisterRenaming§ Duringdecode,instructionsallocatednewphysicaldestinationregister§ Sourceoperandsrenamedtophysicalregisterwithnewestvalue§ Executionunitonlyseesphysicalregisternumbers

49

RenameTable

Op Src1 Src2Dest Op Src1 Src2Dest

RegisterFreeList

Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest

UpdateMapping

Doesthiswork?

Inst1 Inst2

ReadAddresses

ReadDataWrite

Ports

SuperscalarRegisterRenaming

50

RenameTable

Op Src1 Src2Dest Op Src1 Src2Dest

RegisterFreeList

Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest

UpdateMapping

Inst1 Inst2

ReadAddresses

ReadDataWrite

Ports =?=?

MustcheckforRAWhazardsbetweeninstructionsissuinginsamecycle.Canbedoneinparallelwithrenamelookup.

MIPSR10Krenames4serially-RAW-dependentinsts/cycle

Page 26: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

ControlFlowPenalty

51

I-cache

Fetch Buffer

IssueBuffer

Func.Units

Arch.State

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started

Modernprocessorsmayhave>10pipelinestagesbetweennextPCcalculationandbranchresolution!

Howmuchworkislostifpipelinedoesn’tfollowcorrectinstructionflow?

~Looplengthxpipelinewidth+buffers

ReducingControlFlowPenalty

§ Softwaresolutions- Eliminatebranches- loopunrolling- Increasestherunlength

- Reduceresolutiontime- instructionscheduling- Computethebranchconditionasearlyaspossible(oflimitedvaluebecausebranchesoftenincriticalpaththroughcode)

§ Hardwaresolutions- Findsomethingelsetodo- delayslots- Replacespipelinebubbleswithusefulwork(requiressoftwarecooperation)– quicklyseediminishingreturns

- Speculate- branchprediction- Speculativeexecutionofinstructionsbeyondthebranch- Manyadvancesinaccuracy

52

Page 27: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

BranchPrediction

53

Motivation:BranchpenaltieslimitperformanceofdeeplypipelinedprocessorsModernbranchpredictorshavehighaccuracy(>95%)andcanreducebranchpenaltiessignificantly

Requiredhardwaresupport:Predictionstructures:

• Branchhistorytables,branchtargetbuffers,etc.

Mispredict recoverymechanisms:• Keepresultcomputationseparatefromcommit• Killinstructionsfollowingbranchinpipeline• Restorestatetothatfollowingbranch

ImportanceofBranchPrediction

§ Consider4-waysuperscalarwith8pipelinestagesfromfetchtodispatch, and80-entryROB,and3cyclesfromissuetobranchresolution

§ Onamispredict,couldthrowaway8*4+(80-1)=111instructions

§ Improvingfrom90%to95%predictionaccuracy,removes50%ofbranchmispredicts- If1/6instructionsarebranches,thenmovefrom60instructionsbetweenmispredicts,to120instructionsbetweenmispredicts

54

Page 28: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

StaticBranchPrediction

55

Overallprobabilityabranchistakenis~60-70%but:

ISAcanattachpreferreddirectionsemanticstobranches,e.g.,MotorolaMC88110

bne0 (preferredtaken) beq0 (nottaken)

ISAcanallowarbitrarychoiceofstaticallypredicteddirection,e.g.,HPPA-RISC,IntelIA-64

typicallyreportedas~80%accurate

backward90%

forward50%

DynamicBranchPredictionlearningbasedonpastbehavior

§ Temporalcorrelation- Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecution

§ Spatialcorrelation- Severalbranchesmayresolveinahighlycorrelatedmanner(apreferredpathofexecution)

56

Page 29: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

One-BitBranchHistoryPredictor

§ Foreachbranch,rememberlastwaybranchwent§ Hasproblemwithloop-closingbackwardbranches,astwomispredicts occuroneveryloopexecution1. firstiterationpredictsloopbackwardsbranchnot-taken

(loopwasexitedlasttime)2. lastiterationpredictsloopbackwardsbranchtaken(loop

continuedlasttime)

57

BranchPredictionBits

58

• Assume2BPbitsperinstruction• Changethepredictionaftertwoconsecutivemistakes!

¬takewrongtaken

¬taken

taken

taken

taken¬takeright

takeright

takewrong

¬taken

¬taken¬taken

BPstate:(predict take/¬take)x(lastprediction right/wrong)

Page 30: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

BranchHistoryTable(BHT)

59

4K-entryBHT,2bits/entry,~80-90%correctpredictions

0 0FetchPC

Branch? TargetPC

+

I-Cache

Opcode offsetInstruction

kBHTIndex

2k-entryBHT,2bits/entry

Taken/¬Taken?

ExploitingSpatialCorrelationYehandPatt,1992

60

Historyregister,H,recordsthedirectionofthelastNbranchesexecutedbytheprocessor

if (x[i] < 7) theny += 1;

if (x[i] < 5) thenc -= 4;

Iffirstconditionfalse,secondconditionalsofalse

Page 31: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

Two-LevelBranchPredictor

61

PentiumProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)

0 0

kFetchPC

ShiftinTaken/¬Takenresultsofeachbranch

2-bitglobalbranchhistoryshiftregister

Taken/¬Taken?

SpeculatingBothDirections

§ Analternativetobranchpredictionistoexecutebothdirectionsofabranchspeculatively- resourcerequirementisproportionaltothenumberofconcurrentspeculativeexecutions

- onlyhalftheresourcesengageinusefulworkwhenbothdirectionsofabranchareexecutedspeculatively

- branchpredictiontakeslessresourcesthanspeculativeexecutionofbothpaths

§ Withaccuratebranchprediction,itismorecosteffectivetodedicateallresourcestothepredicteddirection!

62

Page 32: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

LimitationsofBHTs

63

Onlypredictsbranchdirection.Therefore,cannotredirectfetchstreamuntilafterbranchtargetisdetermined.

UltraSPARC-IIIfetchpipeline

Correctlypredictedtakenbranch

penalty

JumpRegisterpenalty

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

Remainderofexecutepipeline(+another6stages)

BranchTargetBuffer(BTB)

64

• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded

2k-entry direct-mapped BTB(can also be associative)

I-Cache PC

k

Valid

valid

EntryPC

=

match

predicted

target

targetPC

Page 33: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

CombiningBTBandBHT

§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)

§ BHTcanholdmanymoreentriesandismoreaccurate

65

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

BTB

BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch

BTB/BHTonlyupdatedafterbranchresolvesinEstage

UsesofJumpRegister(JR)

§ Switchstatements(jumptoaddressofmatchingcase)

§ Dynamicfunctioncall(jumptorun-timefunctionaddress)

§ Subroutinereturns(jumptoreturnaddress)

66

HowwelldoesBTBworkforeachofthesecases?

BTBworkswellifsamecaseusedrepeatedly

BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)

BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!

Page 34: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

SubroutineReturnStack

SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.

67

&fb()&fc()

Pushcalladdresswhenfunctioncallexecuted

Popreturnaddresswhensubroutinereturndecoded

fa() { fb(); }fb() { fc(); }fc() { fd(); }

&fd() kentries(typicallyk=8-16)

ReturnStackinPipeline

§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubroutinecall/returnatdecode

68

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

RSRSPush/Popafterdecodegiveslargebubbleinfetchstream.

ReturnStackpredictionchecked

Page 35: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

ReturnStackinPipeline

§ CanrememberwhetherPCissubroutinecall/returnusingBTB-likestructure

§ Insteadoftarget-PC,juststorepush/popbit

69

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

RS

Push/Popbeforeinstructionsdecoded!

ReturnStackpredictionchecked

In-Ordervs.Out-of-OrderBranchPrediction

Fetch

Decode

Execute

Commit

In-OrderIssue Out-of-OrderIssue

Fetch

Decode

Execute

Commit

ROB

Br.Pred.

Resolve

Br.Pred.

Resolve

§ Speculativefetchbutnotspeculativeexecution- branchresolvesbeforelaterinstructionscomplete

§ Completedvaluesheldinbypassnetworkuntilcommit

§ Speculativeexecution,withbranchesresolvedafterlaterinstructionscomplete

§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileuntilcommit

• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemultipleinstructionspercycle

• Commontohave10-30pipelinestagesineitherstyleofdesign

In-Order

In-Order

In-Order

Out-of-Order

70

Page 36: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

InO vs.OoO Mispredict Recovery

§ In-orderexecution?- Designsonoinstructionissuedafterbranchcanwrite-backbeforebranchresolves

- Killallinstructionsinpipelinebehindmispredicted branch§ Out-of-orderexecution?-Multipleinstructionsfollowingbranchinprogramordercancompletebeforebranchresolves

- Asimplesolutionwouldbetohandlelikeprecisetraps- Problem?

71

BranchMispredictioninPipeline

§ CanhavemultipleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstructionsinROBthatfollowamispredicted branch

§ MIPSR10Kusesfourmaskbitstotaginstructionsthataredependentonuptofourspeculativebranches

§ Maskbitsclearedasbranchresolves,andreusedfornextbranch72

Fetch Decode

Execute

CommitReorder Buffer

Kill

Kill Kill

BranchResolution

Inject correct PC

BranchPrediction

PC

Complete

Page 37: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

RenameTableRecovery

§ Havetoquicklyrecoverrenametableonbranchmispredicts

§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspeculativebranches

§ Alpha21264has80snapshots,oneperROBinstruction

73

ImprovingInstructionFetch

§ Performanceofspeculativeout-of-ordermachinesoftenlimitedbyinstructionfetchbandwidth- speculativeexecutioncanfetch2-3xmoreinstructionsthanarecommitted

-mispredictpenaltiesdominatedbytimetorefillinstructionwindow

- takenbranchesareparticularlytroublesome

Page 38: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

IncreasingTakenBranchBandwidth(Alpha21264I-Cache)

§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncriticalloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining

CachedInstructions

LinePredict

WayPredict

TagWay0

TagWay1

=? =?

fastfetchpath

PCGeneration

PC

BranchPredictionInstructionDecodeValidityChecks

4insts

Hit/Miss/Way

TournamentBranchPredictor(Alpha21264)

§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredictingnextbranch

§ Globalhistoryisspeculativelyupdatedbutrestoredonmispredict

§ Claim90-100%successonrangeofapplications

Local history table (1,024x10b)

PC

Local prediction (1,024x3b)

Global Prediction (4,096x2b)

Choice Prediction (4,096x2b)

Global History (12b)Prediction

Page 39: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

TakenBranchLimit

§ Integercodeshaveatakenbranchevery6-9instructions

§ Toavoidfetchbottleneck,mustexecutemultipletakenbranchespercyclewhenincreasingperformance

§ Thisimplies:- predictingmultiplebranchespercycle- fetchingmultiplenon-contiguousblockspercycle

BranchAddressCache(Yeh,Marr,Patt)

PCk

EntryPC

=

match

Valid

valid

predicted

target#1

target#1len

len#1

predicted

target#2

target#2

ExtendBTBtoreturnmultiplebranchpredictionspercycle

Page 40: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

FetchingMultipleBasicBlocks

§ Requireseither-multiportedcache:expensive- interleaving:bankconflictswilloccur

§ Mergingmultipleblockstofeedtodecodersaddslatencyincreasingmispredictpenaltyandreducingbranchthroughput

TraceCache

§ KeyIdea:Packmultiplenon-contiguousbasicblocksintoonecontiguoustracecacheline

BR BR BR

• Singlefetchbringsinmultiplebasicblocks

• Tracecacheindexedbystartaddress andnextnbranchpredictions

• UsedinIntelPentium-4processortoholddecodeduops

BRBRBR

Page 41: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

Load-StoreQueueDesign

§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance

§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued

81

SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.

§ Duringdecode,storebufferslotallocatedinprogramorder

§ Storessplitinto“storeaddress”and“storedata”micro-operations

§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache

§ Onstoreabort:- clearvalidbit

82

DataTags

StoreCommitPath

SpeculativeStoreBuffer

L1DataCache

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV

StoreAddress

StoreData

Page 42: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

Loadbypassfromspeculativestorebuffer

83

§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer

§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload

Data

LoadAddress

Tags

SpeculativeStoreBuffer L1DataCache

LoadData

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV

MemoryDependencies

sd x1, (x2)ld x3, (x4)

§ Whencanweexecutetheload?

84

Page 43: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

In-OrderMemoryQueue

§ Executeallloadsandstoresinprogramorder

§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution

§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions

§ Needastructuretohandlememoryordering…

85

ConservativeO-o-OLoadExecution

sd x1, (x2)ld x3, (x4)

§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2

§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware

§ Don’texecuteloadifanypreviousstoreaddressnotknown

§ (MIPSR10K,16-entryaddressqueue)

86

Page 44: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

AddressSpeculation

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder

§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions

§ =>Largepenaltyforinaccurateaddressspeculation

87

MemoryDependencePrediction(Alpha21264)

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2 andexecuteloadbeforestore

§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait

§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete

§ Periodicallyclearstore-waitbits88

Page 45: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

PartIII:MoreAdvancedOut-of-OrderSuperscalarDesigns

89

LimitationsofBHTs

90

Onlypredictsbranchdirection.Therefore,cannotredirectfetchstreamuntilafterbranchtargetisdetermined.

UltraSPARC-IIIfetchpipeline

Correctlypredictedtakenbranch

penalty

JumpRegisterpenalty

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

Remainderofexecutepipeline(+another6stages)

Page 46: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

BranchTargetBuffer(BTB)

91

• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded

2k-entry direct-mapped BTB(can also be associative)

I-Cache PC

k

Valid

valid

EntryPC

=

match

predicted

target

targetPC

CombiningBTBandBHT

§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)

§ BHTcanholdmanymoreentriesandismoreaccurate

92

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

BTB

BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch

BTB/BHTonlyupdatedafterbranchresolvesinEstage

Page 47: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

UsesofJumpRegister(JR)

§ Switchstatements(jumptoaddressofmatchingcase)

§ Dynamicfunctioncall(jumptorun-timefunctionaddress)

§ Subroutinereturns(jumptoreturnaddress)

93

HowwelldoesBTBworkforeachofthesecases?

BTBworkswellifsamecaseusedrepeatedly

BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)

BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!

SubroutineReturnStack

SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.

94

&fb()&fc()

Pushcalladdresswhenfunctioncallexecuted

Popreturnaddresswhensubroutinereturndecoded

fa() { fb(); }fb() { fc(); }fc() { fd(); }

&fd() kentries(typicallyk=8-16)

Page 48: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

ReturnStackinPipeline

§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubroutinecall/returnatdecode

95

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

RSRSPush/Popafterdecodegiveslargebubbleinfetchstream.

ReturnStackpredictionchecked

ReturnStackinPipeline

§ CanrememberwhetherPCissubroutinecall/returnusingBTB-likestructure

§ Insteadoftarget-PC,juststorepush/popbit

96

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

RS

Push/Popbeforeinstructionsdecoded!

ReturnStackpredictionchecked

Page 49: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

In-Ordervs.Out-of-OrderBranchPrediction

Fetch

Decode

Execute

Commit

In-OrderIssue Out-of-OrderIssue

Fetch

Decode

Execute

Commit

ROB

Br.Pred.

Resolve

Br.Pred.

Resolve

§ Speculativefetchbutnotspeculativeexecution- branchresolvesbeforelaterinstructionscomplete

§ Completedvaluesheldinbypassnetworkuntilcommit

§ Speculativeexecution,withbranchesresolvedafterlaterinstructionscomplete

§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileuntilcommit

• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemultipleinstructionspercycle

• Commontohave10-30pipelinestagesineitherstyleofdesign

In-Order

In-Order

In-Order

Out-of-Order

97

InO vs.OoO Mispredict Recovery

§ In-orderexecution?- Designsonoinstructionissuedafterbranchcanwrite-backbeforebranchresolves

- Killallinstructionsinpipelinebehindmispredicted branch§ Out-of-orderexecution?-Multipleinstructionsfollowingbranchinprogramordercancompletebeforebranchresolves

- Asimplesolutionwouldbetohandlelikeprecisetraps- Problem?

98

Page 50: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

BranchMispredictioninPipeline

§ CanhavemultipleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstructionsinROBthatfollowamispredicted branch

§ MIPSR10Kusesfourmaskbitstotaginstructionsthataredependentonuptofourspeculativebranches

§ Maskbitsclearedasbranchresolves,andreusedfornextbranch99

Fetch Decode

Execute

CommitReorder Buffer

Kill

Kill Kill

BranchResolution

Inject correct PC

BranchPrediction

PC

Complete

RenameTableRecovery

§ Havetoquicklyrecoverrenametableonbranchmispredicts

§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspeculativebranches

§ Alpha21264has80snapshots,oneperROBinstruction

100

Page 51: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

ImprovingInstructionFetch

§ Performanceofspeculativeout-of-ordermachinesoftenlimitedbyinstructionfetchbandwidth- speculativeexecutioncanfetch2-3xmoreinstructionsthanarecommitted

-mispredictpenaltiesdominatedbytimetorefillinstructionwindow

- takenbranchesareparticularlytroublesome

IncreasingTakenBranchBandwidth(Alpha21264I-Cache)

§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncriticalloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining

CachedInstructions

LinePredict

WayPredict

TagWay0

TagWay1

=? =?

fastfetchpath

PCGeneration

PC

BranchPredictionInstructionDecodeValidityChecks

4insts

Hit/Miss/Way

Page 52: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

TournamentBranchPredictor(Alpha21264)

§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredictingnextbranch

§ Globalhistoryisspeculativelyupdatedbutrestoredonmispredict

§ Claim90-100%successonrangeofapplications

Local history table (1,024x10b)

PC

Local prediction (1,024x3b)

Global Prediction (4,096x2b)

Choice Prediction (4,096x2b)

Global History (12b)Prediction

TakenBranchLimit

§ Integercodeshaveatakenbranchevery6-9instructions

§ Toavoidfetchbottleneck,mustexecutemultipletakenbranchespercyclewhenincreasingperformance

§ Thisimplies:- predictingmultiplebranchespercycle- fetchingmultiplenon-contiguousblockspercycle

Page 53: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

BranchAddressCache(Yeh,Marr,Patt)

PCk

EntryPC

=

match

Valid

valid

predicted

target#1

target#1len

len#1

predicted

target#2

target#2

ExtendBTBtoreturnmultiplebranchpredictionspercycle

FetchingMultipleBasicBlocks

§ Requireseither-multiportedcache:expensive- interleaving:bankconflictswilloccur

§ Mergingmultipleblockstofeedtodecodersaddslatencyincreasingmispredictpenaltyandreducingbranchthroughput

Page 54: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

TraceCache

§ KeyIdea:Packmultiplenon-contiguousbasicblocksintoonecontiguoustracecacheline

BR BR BR

• Singlefetchbringsinmultiplebasicblocks

• Tracecacheindexedbystartaddress andnextnbranchpredictions

• UsedinIntelPentium-4processortoholddecodeduops

BRBRBR

Load-StoreQueueDesign

§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance

§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued

108

Page 55: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.

§ Duringdecode,storebufferslotallocatedinprogramorder

§ Storessplitinto“storeaddress”and“storedata”micro-operations

§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache

§ Onstoreabort:- clearvalidbit

109

DataTags

StoreCommitPath

SpeculativeStoreBuffer

L1DataCache

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV

StoreAddress

StoreData

Loadbypassfromspeculativestorebuffer

110

§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer

§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload

Data

LoadAddress

Tags

SpeculativeStoreBuffer L1DataCache

LoadData

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV

Page 56: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

MemoryDependencies

sd x1, (x2)ld x3, (x4)

§ Whencanweexecutetheload?

111

In-OrderMemoryQueue

§ Executeallloadsandstoresinprogramorder

§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution

§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions

§ Needastructuretohandlememoryordering…

112

Page 57: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

ConservativeO-o-OLoadExecution

sd x1, (x2)ld x3, (x4)

§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2

§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware

§ Don’texecuteloadifanypreviousstoreaddressnotknown

§ (MIPSR10K,16-entryaddressqueue)

113

AddressSpeculation

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder

§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions

§ =>Largepenaltyforinaccurateaddressspeculation

114

Page 58: CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

MemoryDependencePrediction(Alpha21264)

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2 andexecuteloadbeforestore

§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait

§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete

§ Periodicallyclearstore-waitbits115

Acknowledgements

§ Thesecoursenotesweredevelopedby:- Krste Asanovic (UCB)- Arvind(MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

116