CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring

CSC631:High-PerformanceComputerArchitecture

Spring2017Lecture6:Out-of-OrderProcessors

Supercomputers

Definitionsofasupercomputer:§ Fastestmachineinworldatgiventask§ Adevicetoturnacompute-boundproblemintoanI/Oboundproblem

§ Anymachinecosting$30M+§ AnymachinedesignedbySeymourCray

§ CDC6600(Cray,1964)regardedasfirstsupercomputer

2

CDC6600SeymourCray,1963

§ Afastpipelinedmachinewith60-bitwords- 128Kwordmainmemorycapacity,32banks

§ Tenfunctionalunits(parallel,unpipelined)- FloatingPoint:adder,2multipliers,divider- Integer:adder,2incrementers,...

§ Hardwiredcontrol(nomicrocoding)§ Scoreboard fordynamicschedulingofinstructions§ TenPeripheralProcessorsforInput/Output

- afastmulti-threaded12-bitintegerALU§ Veryfastclock,10MHz(FPaddin4clocks)§ >400,000transistors,750sq.ft.,5tons,150kW,novelfreon-basedtechnologyforcooling

§ Fastestmachineinworldfor5years(until7600)- over100sold($7-10Meach)

33/10/2009

CDC6600:ALoad/StoreArchitecture

4

• Separateinstructionstomanipulatethreetypesofreg.• 8x60-bitdataregisters(X)• 8x18-bitaddressregisters(A)• 8x18-bitindexregisters(B)

• Allarithmeticandlogicinstructionsareregister-to-register

•OnlyLoadandStoreinstructionsrefertomemory!

Touchingaddressregisters1to5initiatesaload6to7initiatesastore

- veryusefulforvectoroperations

opcode i j k Ri ¬ Rj op Rk

opcode i j disp Ri ¬ M[Rj + disp]

6 3 3 3

6 3 3 18

CDC6600:Datapath

5

AddressRegsIndexRegs8x18-bit8x18-bit

OperandRegs8x60-bit

Inst.Stack8x60-bit

IR

10FunctionalUnits

CentralMemory128Kwords,32banks,1µscycle

resultaddr

result

operand

operandaddr

CDC6600ISAdesignedtosimplifyhigh-performanceimplementation

§ Useofthree-address,register-registerALUinstructionssimplifiespipelinedimplementation- Only3-bitregisterspecifier fieldscheckedfordependencies- Noimplicitdependenciesbetweeninputsandoutputs

§ Decouplingsettingofaddressregister(Ar)fromretrievingvaluefromdataregister(Xr)simplifiesprovidingmultipleoutstandingmemoryaccesses- Softwarecanscheduleloadofaddressregisterbeforeuseofvalue- Caninterleaveindependentinstructionsinbetween

§ CDC6600hasmultipleparallelbutunpipelined functionalunits- E.g.,2separatemultipliers

§ Follow-onmachineCDC7600usedpipelinedfunctionalunits- ForeshadowslaterRISCdesigns

6

CDC6600:VectorAddition

7

B0←- nloop: JZEB0,exit

A0←B0+a0 loadX0A1←B0+b0 loadX1X6←X0+X1A6←B0+c0 storeX6B0←B0+1jumploop

Ai=addressregisterBi=indexregisterXi=dataregister

CDC6600Scoreboard

§ Instructionsdispatchedin-ordertofunctionalunitsprovidednostructuralhazardorWAW- Stallonstructuralhazard,nofunctionalunitsavailable-Onlyonependingwritetoanyregister

§ Instructionswaitforinputoperands(RAWhazards)beforeexecution- Canexecuteout-of-order

§ Instructionswaitforoutputregistertobereadbyprecedinginstructions(WAR)- Resultheldinfunctionalunituntilregisterfree

8

9[©IBM]

IBM360/91Floating-PointUnitR.M.Tomasulo,1967

10

Mult

1

123456

loadbuffers(frommemory)

1234

Adder

123

Floating-PointRegfile

storebuffers(tomemory)

...

instructions

Commonbusensuresthatdataismadeavailableimmediatelytoalltheinstructionswaitingforit.Matchtag,ifequal,copyvalue&setpresence“p”.

Distributereservationstationstofunctionalunits

<tag,result>

p tag/datap tag/datap tag/data



p tag/datap tag/data

p tag/datap tag/data2



p tag/datap tag/datap tag/datap tag/data

Out-of-OrderFadesintoBackgroundOut-of-orderprocessingimplementedcommerciallyin1960s,butdisappearedagainuntil1990sastwomajorproblemshadtobesolved:§ Precisetraps- ImprecisetrapscomplicatedebuggingandOScode-Note,preciseinterruptsarerelativelyeasytoprovide

§ Branchprediction- Amountofexploitableinstruction-levelparallelism(ILP)limitedbycontrolhazards

Also,simplermachinedesignsinnewtechnologybeatcomplicatedmachinesinoldtechnology- Bigadvantagetofitprocessor &cachesononechip-Microprocessorshaderaof1%/weekperformancescaling

11

SeparatingCompletionfromCommit

§ Re-orderbufferholdsregisterresultsfromcompletionuntilcommit- Entriesallocatedinprogramorderduringdecode- Bufferscompletedvaluesandexceptionstateuntilin-ordercommitpoint

- Completedvaluescanbeusedbydependentsbeforecommitted(bypassing)

- Eachentryholdsprogramcounter,instructiontype,destinationregisterspecifier andvalueifany,andexceptionstatus(infooftencompressedtosavehardware)

§ Memoryreorderingneedsspecialdatastructures- Speculativestoreaddressanddatabuffers- Speculativeloadaddressanddatabuffers

12

In-OrderCommitforPreciseTraps

§ In-orderinstructionfetchanddecode,anddispatchtoreservationstationsinsidereorderbuffer

§ Instructionsissuefromreservationstationsout-of-order§ Out-of-ordercompletion,valuesstoredintemporarybuffers§ Commitisin-order,checksfortraps,andifnoneupdatesarchitecturalstate

13

Fetch Decode

Execute

CommitReorderBuffer

In-order In-orderOut-of-order

Trap?Kill

Kill Kill

InjecthandlerPC

PhasesofInstructionExecution

14

Fetch: Instructionbitsretrievedfrominstructioncache.I-cache

FetchBuffer

IssueBuffer

FunctionalUnits

ArchitecturalState

Execute:Instructionsandoperands issuedtofunctionalunits.Whenexecutioncompletes,allresultsandexceptionflagsareavailable.

Decode: Instructions dispatchedtoappropriateissue buffer

ResultBufferCommit:Instructionirrevocablyupdatesarchitecturalstate(aka“graduation”),ortakesprecisetrap/interrupt.

PC

Commit

Decode/Rename

In-OrderversusOut-of-OrderPhases

§ Instructionfetch/decode/renamealwaysin-order-NeedtoparseISAsequentiallytogetcorrectsemantics- ProposalsforspeculativeOoO instructionfetch,e.g.,Multiscalar.Predictcontrolflowanddatadependenciesacrosssequential programsegmentsfetched/decoded/executedinparallel,fixup ifpredictionwrong

§ Dispatch(placeinstructionintomachinebufferstowaitforissue)alsoalwaysin-order- Someuse“Dispatch”tomeanissue,butnotintheselectures

15

In-OrderVersusOut-of-OrderIssue

§ In-orderissue:- IssuestallsonRAWdependenciesorstructuralhazards,orpossiblyWAR/WAWhazards

- Instructioncannotissuetoexecutionunitsunlessallprecedinginstructionshaveissuedtoexecutionunits

§ Out-of-orderissue:- Instructionsdispatchedinprogramordertoreservationstations(orotherformsofinstructionbuffer)towaitforoperandstoarrive,orotherhazardstoclear

-Whileearlierinstructionswaitinissuebuffers,followinginstructionscanbedispatchedandissuedout-of-order

16

In-OrderversusOut-of-OrderCompletion

§ Allbutthesimplestmachineshaveout-of-ordercompletion,duetodifferentlatenciesoffunctionalunitsanddesiretobypassvaluesassoonasavailable

§ ClassicRISC5-stageintegerpipelinejustbarelyhasin-ordercompletion- Loadtakestwocycles,butfollowingone-cycleintegeropcompletesatsametime, notearlier

- AddingpipelinedFPUimmediatelybringsOoO completion

17

In-OrderversusOut-of-OrderCommit

§ In-ordercommitsupportsprecisetraps,standardtoday- Someproposalstoreducethecostofin-ordercommitbyretiringsomeinstructionsearlytocompactreorderbuffer,butthisisjustanoptimizedin-ordercommit

§ Out-of-ordercommitwaseffectivelywhatearlyOoOmachinesimplemented(imprecisetraps)ascompletionirrevocablychangedmachinestate- i.e.,complete==commitinthesemachines

18

OoO DesignChoices

§ Wherearereservationstations?- Partofreorderbuffer,orinseparateissuewindow?- Distributedbyfunctionalunits,orcentralized?

§ Howisregisterrenamingperformed?- Tagsanddataheldinreservationstations,withseparatearchitecturalregisterfile

- Tagsonlyinreservationstations,dataheldinunifiedphysicalregisterfile

19

“Data-in-ROB”Design(HPPA8000,PentiumPro,Core2Duo,Nehalem)

§ Managedascircularbufferinprogramorder,newinstructionsdispatchedtofreeslots,oldestinstructioncommitted/reclaimedwhendone(“p”bitsetonresult)

§ TagisgivenbyindexinROB(Freepointervalue)§ Indispatch,non-busysourceoperandsreadfromarchitecturalregisterfileandcopiedtoSrc1andSrc2withpresencebit“p”set.Busyoperandscopytagofproducerandclear“p”bit.

§ Setvalidbit“v”ondispatch,setissuedbit“i”onissue§ Oncompletion,searchsourcetags,set“p”bitandcopydataintosrc ontagmatch.WriteresultandexceptionflagstoROB.

§ Oncommit,checkexceptionstatus,andcopyresultintoarchitecturalregisterfileifnotrap.

§ Ontrap,flushmachineandROB,setfree=oldest,jumptohandler

Tagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv Opcode

Oldest

Free

ManagingRenameforData-in-ROB

§ If“p”bitset,thenusevalueinarchitecturalregisterfile§ Else,tagfieldindicatesinstructionthatwill/hasproducedvalue§ Fordispatch,readsourceoperands<p,tag,value>fromarch.regfile,andalsoread<p,result>fromproducinginstructioninROB,bypassingasneeded.CopytoROB

§ Writedestinationarch.registerentrywith<0,Free,_>,toassigntagtoROBindexofthisinstruction

§ Oncommit,updatearch.regfile with<1,_,Result>§ Ontrap,resettable(Allp=1)

21

Tagp ValueTagp ValueTagp Value

Tagp Value

Oneentryperarch.register

Renametableassociatedwitharchitecturalregisters,managedindecode/dispatch

ROB

DataMovementinData-in-ROBDesign

22

ArchitecturalRegisterFile

Readoperandsduringdecode

SourceOperands

Writesourcesindispatch

Readoperandsatissue

FunctionalUnits

Writeresultsatcompletion

Readresultsforcommit

Bypassnewervaluesatdispatch

ResultData

Writeresultsatcommit

UnifiedPhysicalRegisterFile(MIPSR10K,Alpha21264,IntelPentium4&Sandy/IvyBridge)

§ Renameallarchitecturalregistersintoasinglephysicalregisterfileduringdecode,noregistervaluesread

§ Functionalunitsreadandwritefromsingleunifiedregisterfileholdingcommittedandtemporaryregistersinexecute

§ Commitonlyupdatesmappingofarchitecturalregistertophysicalregister,nodatamovement

23

UnifiedPhysicalRegisterFile

Readoperandsatissue

FunctionalUnits


CommittedRegisterMapping

DecodeStageRegisterMapping

LifetimeofPhysicalRegisters

24

ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)

ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)

Rename

Whencanwereuseaphysicalregister?Whennextwriterofsamearchitecturalregistercommits

• Physicalregfile holdscommittedandspeculativevalues• PhysicalregistersdecoupledfromROBentries(nodatainROB)

PhysicalRegisterManagement

25

op p1 PR1 p2 PR2exuse Rd PRdLPRd

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

x5P5x6P6x7

x0P8x1

x2P7x3

x4

ROB

Rename Table

Physical Regs Free List

ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)

ppp

P0P1P3P2P4

(LPRd requires third read port on Rename Table for each instruction)

<x1>P8 p


26

op p1 PR1 p2 PR2exuse Rd PRdLPRdROB


Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8


27



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1


28



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3


29



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3P1

P2

x add P1 P3 x3 P2


30



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3


P2

x add P1 P3 x3 P2x ld P0 x6 P4P3

P4


31


x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3

x ld p P7 x1 P0


Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2


P4

Execute & Commitp

p

p<x1>

P8

x


32


x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1


Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

P8

x x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2


P4

Execute & Commitp

p

p<x1>

P8

x

p

p<x3>

P7

MIPSR10KTrapHandling

§ Renametableisrepairedbyunrenaming instructionsinreverseorderusingthePRd/LPRd fields

§ TheAlpha21264hadsimilarphysicalregisterfilescheme,butkeptcompleterenametablesnapshotsforeachinstructioninROB(80snapshotstotal)- Flashcopyallbitsfromsnapshottoactivetableinonecycle

33

PartII:AdvancedOut-of-OrderSuperscalarDesigns

34

ROB

DataMovementinData-in-ROBDesign

35

ArchitecturalRegisterFile

Readoperandsduringdecode

SourceOperands

Writesourcesindispatch

Readoperandsatissue

FunctionalUnits


Readresultsforcommit

Bypassnewervaluesatdispatch

ResultData

Writeresultsatcommit

UnifiedPhysicalRegisterFile(MIPSR10K,Alpha21264,IntelPentium4&Sandy/IvyBridge)

§ Renameallarchitecturalregistersintoasinglephysicalregisterfileduringdecode,noregistervaluesread

§ Functionalunitsreadandwritefromsingleunifiedregisterfileholdingcommittedandtemporaryregistersinexecute

§ Commitonlyupdatesmappingofarchitecturalregistertophysicalregister,nodatamovement

36

UnifiedPhysicalRegisterFile

Readoperandsatissue

FunctionalUnits


CommittedRegisterMapping

DecodeStageRegisterMapping

LifetimeofPhysicalRegisters

37

ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)

ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)

Rename

Whencanwereuseaphysicalregister?Whennextwriterofsamearchitecturalregistercommits

• Physicalregfile holdscommittedandspeculativevalues• PhysicalregistersdecoupledfromROBentries(nodatainROB)


38

op p1 PR1 p2 PR2exuse Rd PRdLPRd

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

x5P5x6P6x7

x0P8x1

x2P7x3

x4

ROB

Rename Table

Physical Regs Free List


ppp

P0P1P3P2P4

(LPRd requires third read port on Rename Table for each instruction)

<x1>P8 p


39



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8


40



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1


41



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<R1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3

x sub p P6 p P5 x6 P3


42



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3


P2

x add P1 P3 x3 P2


43



Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

x addi P0 x3 P1P5

P3


P2


P4


44


x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3

x ld p P7 x1 P0


Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

<x1>P8 p

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2


P4

Execute & Commitp

p

p<x1>

P8

x


45


x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1


Free ListP0P1P3P2P4

<x6>P5<x7>P6<x3>P7

P0

Pn

P1P2P3P4

Physical Regs

ppp

P8

x x ld p P7 x1 P0

x5P5x6P6x7

x0P8x1

x2P7x3

x4

Rename Table

P0

P8P7

P1

P5

P3

P1

P2


P4

Execute & Commitp

p

p<x1>

P8

x

p

p<x3>

P7

MIPSR10KTrapHandling

§ Renametableisrepairedbyunrenaming instructionsinreverseorderusingthePRd/LPRd fields

§ TheAlpha21264hadsimilarphysicalregisterfilescheme,butkeptcompleterenametablesnapshotsforeachinstructioninROB(80snapshotstotal)- Flashcopyallbitsfromsnapshottoactivetableinonecycle

46

ReorderBufferHoldsActiveInstructions(DecodedbutnotCommitted)

47

(Olderinstructions)

(Newerinstructions)

Cyclet

…ld x1, (x3)add x3, x1, x2sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x1)…

Commit

Fetch

Cyclet+1

Execute

…ld x1, (x3)add x3, x1, x2sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x1)…

ROBcontents

SeparateIssueWindowfromROB

48

Reorderbufferusedtoholdexceptioninformationforcommit.

Theissuewindowholdsonlyinstructionsthathavebeendecodedandrenamedbutnotissuedintoexecution.Hasregistertagsandpresencebits,andpointertoROBentry.

op p1 PR1 p2 PR2 PRduse ex ROB#

ROBisusuallyseveraltimeslargerthanissuewindow– why?

Rd LPRd PC Except?Oldest

Free

Done?

SuperscalarRegisterRenaming§ Duringdecode,instructionsallocatednewphysicaldestinationregister§ Sourceoperandsrenamedtophysicalregisterwithnewestvalue§ Executionunitonlyseesphysicalregisternumbers

49

RenameTable

Op Src1 Src2Dest Op Src1 Src2Dest

RegisterFreeList

Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest

UpdateMapping

Doesthiswork?

Inst1 Inst2

ReadAddresses

ReadDataWrite

Ports

SuperscalarRegisterRenaming

50

RenameTable

Op Src1 Src2Dest Op Src1 Src2Dest

RegisterFreeList

Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest

UpdateMapping

Inst1 Inst2

ReadAddresses

ReadDataWrite

Ports =?=?

MustcheckforRAWhazardsbetweeninstructionsissuinginsamecycle.Canbedoneinparallelwithrenamelookup.

MIPSR10Krenames4serially-RAW-dependentinsts/cycle

ControlFlowPenalty

51

I-cache

Fetch Buffer

IssueBuffer

Func.Units

Arch.State

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started

Modernprocessorsmayhave>10pipelinestagesbetweennextPCcalculationandbranchresolution!

Howmuchworkislostifpipelinedoesn’tfollowcorrectinstructionflow?

~Looplengthxpipelinewidth+buffers

ReducingControlFlowPenalty

§ Softwaresolutions- Eliminatebranches- loopunrolling- Increasestherunlength

- Reduceresolutiontime- instructionscheduling- Computethebranchconditionasearlyaspossible(oflimitedvaluebecausebranchesoftenincriticalpaththroughcode)

§ Hardwaresolutions- Findsomethingelsetodo- delayslots- Replacespipelinebubbleswithusefulwork(requiressoftwarecooperation)– quicklyseediminishingreturns

- Speculate- branchprediction- Speculativeexecutionofinstructionsbeyondthebranch- Manyadvancesinaccuracy

52

BranchPrediction

53

Motivation:BranchpenaltieslimitperformanceofdeeplypipelinedprocessorsModernbranchpredictorshavehighaccuracy(>95%)andcanreducebranchpenaltiessignificantly

Requiredhardwaresupport:Predictionstructures:

• Branchhistorytables,branchtargetbuffers,etc.

Mispredict recoverymechanisms:• Keepresultcomputationseparatefromcommit• Killinstructionsfollowingbranchinpipeline• Restorestatetothatfollowingbranch

ImportanceofBranchPrediction

§ Consider4-waysuperscalarwith8pipelinestagesfromfetchtodispatch, and80-entryROB,and3cyclesfromissuetobranchresolution

§ Onamispredict,couldthrowaway8*4+(80-1)=111instructions

§ Improvingfrom90%to95%predictionaccuracy,removes50%ofbranchmispredicts- If1/6instructionsarebranches,thenmovefrom60instructionsbetweenmispredicts,to120instructionsbetweenmispredicts

54

StaticBranchPrediction

55

Overallprobabilityabranchistakenis~60-70%but:

ISAcanattachpreferreddirectionsemanticstobranches,e.g.,MotorolaMC88110

bne0 (preferredtaken) beq0 (nottaken)

ISAcanallowarbitrarychoiceofstaticallypredicteddirection,e.g.,HPPA-RISC,IntelIA-64

typicallyreportedas~80%accurate

backward90%

forward50%

DynamicBranchPredictionlearningbasedonpastbehavior

§ Temporalcorrelation- Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecution

§ Spatialcorrelation- Severalbranchesmayresolveinahighlycorrelatedmanner(apreferredpathofexecution)

56

One-BitBranchHistoryPredictor

§ Foreachbranch,rememberlastwaybranchwent§ Hasproblemwithloop-closingbackwardbranches,astwomispredicts occuroneveryloopexecution1. firstiterationpredictsloopbackwardsbranchnot-taken

(loopwasexitedlasttime)2. lastiterationpredictsloopbackwardsbranchtaken(loop

continuedlasttime)

57

BranchPredictionBits

58

• Assume2BPbitsperinstruction• Changethepredictionaftertwoconsecutivemistakes!

¬takewrongtaken

¬taken

taken

taken

taken¬takeright

takeright

takewrong

¬taken

¬taken¬taken

BPstate:(predict take/¬take)x(lastprediction right/wrong)

BranchHistoryTable(BHT)

59

4K-entryBHT,2bits/entry,~80-90%correctpredictions

0 0FetchPC

Branch? TargetPC

+

I-Cache

Opcode offsetInstruction

kBHTIndex

2k-entryBHT,2bits/entry

Taken/¬Taken?

ExploitingSpatialCorrelationYehandPatt,1992

60

Historyregister,H,recordsthedirectionofthelastNbranchesexecutedbytheprocessor

if (x[i] < 7) theny += 1;

if (x[i] < 5) thenc -= 4;

Iffirstconditionfalse,secondconditionalsofalse

Two-LevelBranchPredictor

61

PentiumProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)

0 0

kFetchPC

ShiftinTaken/¬Takenresultsofeachbranch

2-bitglobalbranchhistoryshiftregister

Taken/¬Taken?

SpeculatingBothDirections

§ Analternativetobranchpredictionistoexecutebothdirectionsofabranchspeculatively- resourcerequirementisproportionaltothenumberofconcurrentspeculativeexecutions

- onlyhalftheresourcesengageinusefulworkwhenbothdirectionsofabranchareexecutedspeculatively

- branchpredictiontakeslessresourcesthanspeculativeexecutionofbothpaths

§ Withaccuratebranchprediction,itismorecosteffectivetodedicateallresourcestothepredicteddirection!

62

LimitationsofBHTs

63

Onlypredictsbranchdirection.Therefore,cannotredirectfetchstreamuntilafterbranchtargetisdetermined.

UltraSPARC-IIIfetchpipeline

Correctlypredictedtakenbranch

penalty

JumpRegisterpenalty

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

Remainderofexecutepipeline(+another6stages)

BranchTargetBuffer(BTB)

64

• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded

2k-entry direct-mapped BTB(can also be associative)

I-Cache PC

k

Valid

valid

EntryPC

=

match

predicted

target

targetPC

CombiningBTBandBHT

§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)

§ BHTcanholdmanymoreentriesandismoreaccurate

65


BTB

BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch

BTB/BHTonlyupdatedafterbranchresolvesinEstage

UsesofJumpRegister(JR)

§ Switchstatements(jumptoaddressofmatchingcase)

§ Dynamicfunctioncall(jumptorun-timefunctionaddress)

§ Subroutinereturns(jumptoreturnaddress)

66

HowwelldoesBTBworkforeachofthesecases?

BTBworkswellifsamecaseusedrepeatedly

BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)

BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!

SubroutineReturnStack

SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.

67

&fb()&fc()

Pushcalladdresswhenfunctioncallexecuted

Popreturnaddresswhensubroutinereturndecoded

fa() { fb(); }fb() { fc(); }fc() { fd(); }

&fd() kentries(typicallyk=8-16)

ReturnStackinPipeline

§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubroutinecall/returnatdecode

68


RSRSPush/Popafterdecodegiveslargebubbleinfetchstream.

ReturnStackpredictionchecked


§ CanrememberwhetherPCissubroutinecall/returnusingBTB-likestructure

§ Insteadoftarget-PC,juststorepush/popbit

69


RS

Push/Popbeforeinstructionsdecoded!


In-Ordervs.Out-of-OrderBranchPrediction

Fetch

Decode

Execute

Commit

In-OrderIssue Out-of-OrderIssue

Fetch

Decode

Execute

Commit

ROB

Br.Pred.

Resolve

Br.Pred.

Resolve

§ Speculativefetchbutnotspeculativeexecution- branchresolvesbeforelaterinstructionscomplete

§ Completedvaluesheldinbypassnetworkuntilcommit

§ Speculativeexecution,withbranchesresolvedafterlaterinstructionscomplete

§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileuntilcommit

• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemultipleinstructionspercycle

• Commontohave10-30pipelinestagesineitherstyleofdesign

In-Order

In-Order

In-Order

Out-of-Order

70

InO vs.OoO Mispredict Recovery

§ In-orderexecution?- Designsonoinstructionissuedafterbranchcanwrite-backbeforebranchresolves

- Killallinstructionsinpipelinebehindmispredicted branch§ Out-of-orderexecution?-Multipleinstructionsfollowingbranchinprogramordercancompletebeforebranchresolves

- Asimplesolutionwouldbetohandlelikeprecisetraps- Problem?

71

BranchMispredictioninPipeline

§ CanhavemultipleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstructionsinROBthatfollowamispredicted branch

§ MIPSR10Kusesfourmaskbitstotaginstructionsthataredependentonuptofourspeculativebranches

§ Maskbitsclearedasbranchresolves,andreusedfornextbranch72

Fetch Decode

Execute

CommitReorder Buffer

Kill

Kill Kill

BranchResolution

Inject correct PC

BranchPrediction

PC

Complete

RenameTableRecovery

§ Havetoquicklyrecoverrenametableonbranchmispredicts

§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspeculativebranches

§ Alpha21264has80snapshots,oneperROBinstruction

73

ImprovingInstructionFetch

§ Performanceofspeculativeout-of-ordermachinesoftenlimitedbyinstructionfetchbandwidth- speculativeexecutioncanfetch2-3xmoreinstructionsthanarecommitted

-mispredictpenaltiesdominatedbytimetorefillinstructionwindow

- takenbranchesareparticularlytroublesome

IncreasingTakenBranchBandwidth(Alpha21264I-Cache)

§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncriticalloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining

CachedInstructions

LinePredict

WayPredict

TagWay0

TagWay1

=? =?

fastfetchpath

PCGeneration

PC

BranchPredictionInstructionDecodeValidityChecks

4insts

Hit/Miss/Way

TournamentBranchPredictor(Alpha21264)

§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredictingnextbranch

§ Globalhistoryisspeculativelyupdatedbutrestoredonmispredict

§ Claim90-100%successonrangeofapplications

Local history table (1,024x10b)

PC

Local prediction (1,024x3b)

Global Prediction (4,096x2b)

Choice Prediction (4,096x2b)

Global History (12b)Prediction

TakenBranchLimit

§ Integercodeshaveatakenbranchevery6-9instructions

§ Toavoidfetchbottleneck,mustexecutemultipletakenbranchespercyclewhenincreasingperformance

§ Thisimplies:- predictingmultiplebranchespercycle- fetchingmultiplenon-contiguousblockspercycle

BranchAddressCache(Yeh,Marr,Patt)

PCk

EntryPC

=

match

Valid

valid

predicted

target#1

target#1len

len#1

predicted

target#2

target#2

ExtendBTBtoreturnmultiplebranchpredictionspercycle

FetchingMultipleBasicBlocks

§ Requireseither-multiportedcache:expensive- interleaving:bankconflictswilloccur

§ Mergingmultipleblockstofeedtodecodersaddslatencyincreasingmispredictpenaltyandreducingbranchthroughput

TraceCache

§ KeyIdea:Packmultiplenon-contiguousbasicblocksintoonecontiguoustracecacheline

BR BR BR

• Singlefetchbringsinmultiplebasicblocks

• Tracecacheindexedbystartaddress andnextnbranchpredictions

• UsedinIntelPentium-4processortoholddecodeduops

BRBRBR

Load-StoreQueueDesign

§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance

§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued

81

SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.

§ Duringdecode,storebufferslotallocatedinprogramorder

§ Storessplitinto“storeaddress”and“storedata”micro-operations

§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache

§ Onstoreabort:- clearvalidbit

82

DataTags

StoreCommitPath

SpeculativeStoreBuffer

L1DataCache

Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV

StoreAddress

StoreData

Loadbypassfromspeculativestorebuffer

83

§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer

§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload

Data

LoadAddress

Tags

SpeculativeStoreBuffer L1DataCache

LoadData


MemoryDependencies

sd x1, (x2)ld x3, (x4)

§ Whencanweexecutetheload?

84

In-OrderMemoryQueue

§ Executeallloadsandstoresinprogramorder

§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution

§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions

§ Needastructuretohandlememoryordering…

85

ConservativeO-o-OLoadExecution

sd x1, (x2)ld x3, (x4)

§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2

§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware

§ Don’texecuteloadifanypreviousstoreaddressnotknown

§ (MIPSR10K,16-entryaddressqueue)

86

AddressSpeculation

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder

§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions

§ =>Largepenaltyforinaccurateaddressspeculation

87

MemoryDependencePrediction(Alpha21264)

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2 andexecuteloadbeforestore

§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait

§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete

§ Periodicallyclearstore-waitbits88

PartIII:MoreAdvancedOut-of-OrderSuperscalarDesigns

89

LimitationsofBHTs

90

Onlypredictsbranchdirection.Therefore,cannotredirectfetchstreamuntilafterbranchtargetisdetermined.

UltraSPARC-IIIfetchpipeline

Correctlypredictedtakenbranch

penalty

JumpRegisterpenalty


Remainderofexecutepipeline(+another6stages)

BranchTargetBuffer(BTB)

91

• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded

2k-entry direct-mapped BTB(can also be associative)

I-Cache PC

k

Valid

valid

EntryPC

=

match

predicted

target

targetPC

CombiningBTBandBHT

§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)

§ BHTcanholdmanymoreentriesandismoreaccurate

92


BTB

BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch

BTB/BHTonlyupdatedafterbranchresolvesinEstage

UsesofJumpRegister(JR)

§ Switchstatements(jumptoaddressofmatchingcase)

§ Dynamicfunctioncall(jumptorun-timefunctionaddress)

§ Subroutinereturns(jumptoreturnaddress)

93

HowwelldoesBTBworkforeachofthesecases?

BTBworkswellifsamecaseusedrepeatedly

BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)

BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!

SubroutineReturnStack

SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.

94

&fb()&fc()

Pushcalladdresswhenfunctioncallexecuted

Popreturnaddresswhensubroutinereturndecoded

fa() { fb(); }fb() { fc(); }fc() { fd(); }

&fd() kentries(typicallyk=8-16)


§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubroutinecall/returnatdecode

95


RSRSPush/Popafterdecodegiveslargebubbleinfetchstream.



§ CanrememberwhetherPCissubroutinecall/returnusingBTB-likestructure

§ Insteadoftarget-PC,juststorepush/popbit

96


RS

Push/Popbeforeinstructionsdecoded!


In-Ordervs.Out-of-OrderBranchPrediction

Fetch

Decode

Execute

Commit

In-OrderIssue Out-of-OrderIssue

Fetch

Decode

Execute

Commit

ROB

Br.Pred.

Resolve

Br.Pred.

Resolve

§ Speculativefetchbutnotspeculativeexecution- branchresolvesbeforelaterinstructionscomplete

§ Completedvaluesheldinbypassnetworkuntilcommit

§ Speculativeexecution,withbranchesresolvedafterlaterinstructionscomplete

§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileuntilcommit

• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemultipleinstructionspercycle

• Commontohave10-30pipelinestagesineitherstyleofdesign

In-Order

In-Order

In-Order

Out-of-Order

97

InO vs.OoO Mispredict Recovery

§ In-orderexecution?- Designsonoinstructionissuedafterbranchcanwrite-backbeforebranchresolves

- Killallinstructionsinpipelinebehindmispredicted branch§ Out-of-orderexecution?-Multipleinstructionsfollowingbranchinprogramordercancompletebeforebranchresolves

- Asimplesolutionwouldbetohandlelikeprecisetraps- Problem?

98

BranchMispredictioninPipeline

§ CanhavemultipleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstructionsinROBthatfollowamispredicted branch

§ MIPSR10Kusesfourmaskbitstotaginstructionsthataredependentonuptofourspeculativebranches

§ Maskbitsclearedasbranchresolves,andreusedfornextbranch99

Fetch Decode

Execute

CommitReorder Buffer

Kill

Kill Kill

BranchResolution

Inject correct PC

BranchPrediction

PC

Complete

RenameTableRecovery

§ Havetoquicklyrecoverrenametableonbranchmispredicts

§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspeculativebranches

§ Alpha21264has80snapshots,oneperROBinstruction

100

ImprovingInstructionFetch

§ Performanceofspeculativeout-of-ordermachinesoftenlimitedbyinstructionfetchbandwidth- speculativeexecutioncanfetch2-3xmoreinstructionsthanarecommitted

-mispredictpenaltiesdominatedbytimetorefillinstructionwindow

- takenbranchesareparticularlytroublesome

IncreasingTakenBranchBandwidth(Alpha21264I-Cache)

§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncriticalloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining

CachedInstructions

LinePredict

WayPredict

TagWay0

TagWay1

=? =?

fastfetchpath

PCGeneration

PC

BranchPredictionInstructionDecodeValidityChecks

4insts

Hit/Miss/Way

TournamentBranchPredictor(Alpha21264)

§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredictingnextbranch

§ Globalhistoryisspeculativelyupdatedbutrestoredonmispredict

§ Claim90-100%successonrangeofapplications

Local history table (1,024x10b)

PC

Local prediction (1,024x3b)

Global Prediction (4,096x2b)

Choice Prediction (4,096x2b)

Global History (12b)Prediction

TakenBranchLimit

§ Integercodeshaveatakenbranchevery6-9instructions

§ Toavoidfetchbottleneck,mustexecutemultipletakenbranchespercyclewhenincreasingperformance

§ Thisimplies:- predictingmultiplebranchespercycle- fetchingmultiplenon-contiguousblockspercycle

BranchAddressCache(Yeh,Marr,Patt)

PCk

EntryPC

=

match

Valid

valid

predicted

target#1

target#1len

len#1

predicted

target#2

target#2

ExtendBTBtoreturnmultiplebranchpredictionspercycle

FetchingMultipleBasicBlocks

§ Requireseither-multiportedcache:expensive- interleaving:bankconflictswilloccur

§ Mergingmultipleblockstofeedtodecodersaddslatencyincreasingmispredictpenaltyandreducingbranchthroughput

TraceCache

§ KeyIdea:Packmultiplenon-contiguousbasicblocksintoonecontiguoustracecacheline

BR BR BR

• Singlefetchbringsinmultiplebasicblocks

• Tracecacheindexedbystartaddress andnextnbranchpredictions

• UsedinIntelPentium-4processortoholddecodeduops

BRBRBR

Load-StoreQueueDesign

§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance

§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued

108

SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.

§ Duringdecode,storebufferslotallocatedinprogramorder

§ Storessplitinto“storeaddress”and“storedata”micro-operations

§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache

§ Onstoreabort:- clearvalidbit

109

DataTags

StoreCommitPath

SpeculativeStoreBuffer

L1DataCache


StoreAddress

StoreData

Loadbypassfromspeculativestorebuffer

110

§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer

§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload

Data

LoadAddress

Tags

SpeculativeStoreBuffer L1DataCache

LoadData


MemoryDependencies

sd x1, (x2)ld x3, (x4)

§ Whencanweexecutetheload?

111

In-OrderMemoryQueue

§ Executeallloadsandstoresinprogramorder

§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution

§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions

§ Needastructuretohandlememoryordering…

112

ConservativeO-o-OLoadExecution

sd x1, (x2)ld x3, (x4)

§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2

§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware

§ Don’texecuteloadifanypreviousstoreaddressnotknown

§ (MIPSR10K,16-entryaddressqueue)

113

AddressSpeculation

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder

§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions

§ =>Largepenaltyforinaccurateaddressspeculation

114

MemoryDependencePrediction(Alpha21264)

sd x1, (x2)ld x3, (x4)

§ Guessthatx4 !=x2 andexecuteloadbeforestore

§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait

§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete

§ Periodicallyclearstore-waitbits115

Acknowledgements

§ Thesecoursenotesweredevelopedby:- Krste Asanovic (UCB)- Arvind(MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

116

Documents

CSC 631: High-Performance Computer Architectureharmanani.github.io/classes/csc631/Notes/L06-ModernOutOf... · 2021. 3. 7. · CSC 631: High-Performance Computer Architecture Spring