an open-source out-of-order RISC-V core...it is open-source wriAen in Chisel (16k loc) It is...

Preview:

Citation preview

BOOMv2anopen-sourceout-of-orderRISC-Vcore

ChristopherCelio,Pi-FengChiu,BorivojeNikolic,DavidPaAerson,KrsteAsanović2017RISC-VWorkshop#7

UC Berkeley

UC Berkeley WhatisBOOM?

2http://ucb-bar.github.io/riscv-boom

◾"BerkeleyOut-of-OrderMachine"◾out-of-order◾superscalar◾ implementsRV64G,bootsLinux◾ Itissynthesizable◾ itisopen-source◾wriAeninChisel(16kloc)◾ Itisparameterizablegenerator◾builtontopofRocket-chipSoCEcosystem

integer

load/store

fp

fetchdec, ren,

& dis

issue/rrdqueues

BOOMv2-2f4i

wb

UC Berkeley BOOMfitsintoRocket-chipSoC

3

BOOM

Rocket

Rocket

UC Berkeley Lotsofneatfeatures

4

◾ advancedbranchpredicTon- BTB,RAS(1-cycle)- gshareandTAGE-basedcondiTonalpredictorimplementaTons(3-cycle)- fitsinsingle-portSRAM

◾ loadscanissueout-of-orderwrttootherloads,stores◾ hardwareperformancecounters(~40events)

FRFlsu

btbbpd I$D$

IRFIQ1IQ0

ROB

imulDFMA

SFMA

IQ2

idiv

mshrs

csr

dec

UC Berkeley ParameterizedSuperscalar

5

valexe_units=ArrayBuffer[ExecutionUnit]()exe_units+=Module(newALUExeUnit(is_branch_unit=true,has_fpu=true,has_mul=true))exe_units+=Module(newALUMemExeUnit(fp_mem_support=true,has_div=true))

Issue Select

RegfileWriteback

dual-issue (5r,3w)

bypassing

ALU

div

LSUAgen D$

bypassing

ALU

FPU

bypassnetwork

RegfileRead

imul

UC Berkeley ParameterizedSuperscalar

5

OR

valexe_units=ArrayBuffer[ExecutionUnit]()exe_units+=Module(newALUExeUnit(is_branch_unit=true,has_fpu=true,has_mul=true))exe_units+=Module(newALUMemExeUnit(fp_mem_support=true,has_div=true))

Issue Select

RegfileWriteback

dual-issue (5r,3w)

bypassing

ALU

div

LSUAgen D$

bypassing

ALU

FPU

bypassnetwork

RegfileRead

imul

exe_units+=Module(newALUExeUnit(is_branch_unit=true))exe_units+=Module(newALUExeUnit(has_fpu=true,has_mul=true))exe_units+=Module(newALUExeUnit(has_div=true))exe_units+=Module(newMemExeUnit())

Issue Select

RegfileWriteback

Quad-issue (9r,4w)

ALU

div

LSUAgen D$

ALU

imul

FPU

ALU

bypassing

bypassnetwork

RegfileRead

Async. FIFOs + level shifters

BOOM Domain (0.5V-1V)

Uncore Domain

VDDVDDCORE

BOOM Core

L1-to-L2 Crossbar

MMIO RouterRead/write all counters

and control registers

1MB L2 CacheBank0

μArchCounters

μArchCounters

JTAGDigital

IO Pads

Configurable Memory Traffic Switcher

To FPGA GPIO

BTB

Load/Store Unit

int int agen

Phys. Int RF (semi-custom

bit cell)

FPU

6KBBr Pred

Issue Units

16KB ScalarInst. Cache

(foundry 6T SRAM Macros)

Arbiter

16KB ScalarData Cache

(foundry 8T SRAM Macros)

Phys.FP RF

Tile Link IO Bidirectional SERDES

UC Berkeley Goodnews

◾ BOOMhas(finally)beentapedout!- 2017Aug15

◾ TSMC28nmHPM◾ Academictest-chip◾ BOOMisinsupportofstudyingotherfeatures

6

BOOMv2tapingoutanout-of-orderprocessorwitha2personteamin4months

ChristopherCelio,Pi-FengChiu,BorivojeNikolic,DavidPaAerson,KrsteAsanović2017CARRVWorkshop

UC Berkeley

ProcessorSiFive U54

Rocket (RV64GC)

Berkeley BOOMv2

UltraSPARC T2

ARM Cortex-A9

Intel Xeon Ivy

Language Chisel Chisel Verilog - SystemVerilog

Core LoC 8,000 16,000 290,000 - NDA

Total LoC 34,000 50,000 1,300,000 - NDA

Foundry TSMC TSMC TI TSMC Intel

Technology 28 nm (HPC)

28 nm (HPM)

65 nm 40 nm (G)

22 nm

Core+L1 Area 0.54 mm2 0.52 mm2 ~12 mm2 ~2.5 mm2 ~12 mm2

Coremark/MHz 2.75 3.92 1.64* 3.71 5.60

Frequency 1.5 GHz 1.2 GHz** 1.17 GHz 1.4 GHz 3.3 GHz

UC Berkeley Comparison:Industry

8

core+L1+L216kB/16kB

ProcessorSiFive U54

Rocket (RV64GC)

Berkeley BOOMv2

UltraSPARC T2

ARM Cortex-A9

Intel Xeon Ivy

Language Chisel Chisel Verilog - SystemVerilog

Core LoC 8,000 16,000 290,000 - NDA

Total LoC 34,000 50,000 1,300,000 - NDA

Foundry TSMC TSMC TI TSMC Intel

Technology 28 nm (HPC)

28 nm (HPM)

65 nm 40 nm (G)

22 nm

Core+L1 Area 0.54 mm2 0.52 mm2 ~12 mm2 ~2.5 mm2 ~12 mm2

Coremark/MHz 2.75 3.92 1.64* 3.71 5.60

Frequency 1.5 GHz 1.2 GHz** 1.17 GHz 1.4 GHz 3.3 GHz

UC Berkeley Comparison:Industry

9

core+L1+L2

*Fromeembc.org.32threads/8coresachieve13Cm/MHz.**EsTmated

16kB/16kB

UC Berkeley LimitsofEducaPonalLibraries

◾ BOOMv1- EducaTonal45nmlibraries- CACTImodelforSRAMs- synthesizedflip-flopsformostnon-cachememoryarrays- goodforcatchingsmallTmingbugs- notaccurateenoughtojusTfysweepingredesigns

◾ BOOMv2- TSMC28nmHPM(highperformancemobile)- LVT-basedstandardcells-memorycompiler(one-andtwo-port)

10

UC Berkeley 4monthsofagiletape-out

11

*ignore the Y-axis: -- too many parameters/variables changing between each run -- doesn't capture DRC violations

UC Berkeley BOOMv1

12

Fetch

Decode &Rename

Unified Issue

Window

PhysicalScalar Register

File (7x3)(PRF)

ROB

Commitwakeup

uop tags

inst

Fetch: 2wIssue: 3-issue

PRF: 7r3w (110)2

2

2

3

LSU

Datacache

ALUFPU iMul

ALUFDiv IDiv

BOOMv1-2f3i

int/idiv/fdiv

load/store

int/fma

fetch

issue/rrdqueues

wbdec, ren,

& dis

◾ shortpipeline- inspiredbyR10K,21264,Cortex-A9

◾ unifiedissuewindow

UC Berkeley BOOMv1:Frontend

13

I$

Fetch1

PC1

Fetch2

µBTB

Fetch0

PC2

hash

TLB

idx

next-pc

taken

I-Cachebackendredirecttarget pre-decode

& target compute

Fetch Buffer

BPD  redirect

UC Berkeley BOOMv1:Frontend

13

I$

Fetch1

PC1

Fetch2

µBTB

Fetch0

PC2

hash

TLB

idx

next-pc

taken

I-Cachebackendredirecttarget pre-decode

& target compute

Fetch Buffer

BPD  redirect

idx

 redirect

I$Fetch Buffer

Fetch1

PC1

Fetch2

+8backendredirecttarget

Fetch0

PC2

BPD

redirect

BTB

Fetch3

hash

TLB

pre-decode & target compute

checker{hit,

target}next-pc

taken

I-Cache

decode info

RAS

I$

Fetch1

PC1

Fetch2

µBTB

Fetch0

PC2

hash

TLB

idx

next-pc

taken

I-Cachebackendredirecttarget pre-decode

& target compute

Fetch Buffer

BPD  redirect

UC Berkeley Frontend

14

BOOMv1

BOOMv2

◾ BTBinSRAM- set-associaTve- parTallytagged- Checkertoverifyintegrity

◾ BPD(CondiTonalPredictor)- hashgetsenTrestage- redirectbasedonBTB- redirectpushedback(I$)

UC Berkeley Regfile:(thefirstP&R)

◾ Regfileblowsup◾ criTcalpathsinissue-select,registerread

15

regfilerob

iq

rename

lsu

mshr

bpdalu

fpu

alu

UC Berkeley BOOMv2

◾ splitunifiedissuewindow(1day)◾ splitphysicalregisterfile(1week)◾ issue/registerreadseparatestages(~1day)◾ 2stagerename(~1week)◾ 3stagefetch(3weeks)

16

Fetch

Decode &Rename

Unified Issue

Window

PhysicalScalar Register

File (7x3)(PRF)

ROB

Commitwakeup

uop tags

inst

Fetch: 2wIssue: 3-issue

PRF: 7r3w (110)2

2

2

3

LSU

Datacache

Before

ALUFPU iMul

ALUFDiv IDiv

Fetch

Decode &Rename1

FP Issue Window

PhysicalFP RegisterFile (3x2)(PFRF)

ROB

Commit

wakeup

uop tags

inst

Fetch: 2wIssue: 4-issue

PIRF: 6r3w (96)PFRF: 3r1w (48)

2

2

2

LSU

Datacache

FPU FDvALU

iMul iDiv

Physical Int Register

File (6x3)(PIRF)

uop tags uop tags

ALU

ALU Issue Window

Mem Issue Window

toInt(LL

wport)

toPFRF(LL wport)

2

I2F

toPFRF(LL wport)

F2I

Rename2 &Dispatch

Afterinteger

load/store

fp

fetchdec, ren,

& dis

issue/rrdqueues

BOOMv2-2f4i

wb

UC Berkeley RegfileChallenges

◾ foundaTonalIPprovidememorieswithsingleanddual-portmemories

◾ companiesbuildtheirownhand-crahedregisterfiles◾ notenoughlaborinacademiatosupportacustomizedregisterfile

17

... ...

WBL0 WBLB0

WWL0

WBLn WBLBn

WWLnWWLn

RWL[m-1,0]

RBL[m-1,0]x n write ports

x m read ports

UC Berkeley Regfile

18

•  Routing 4480 wires causes congestion issues

Q

4480

70x64 flip-flops

........

.............

read_data{5-0}[63-0]

7-de

ep lo

gic

tree

.......

......

...

.......

70 e

ntrie

s

64 bits

read_data{5-0}[63] read_data{5-0}[0]

RF unit cell

•  Reduce read_data routing wires to 384

WD0

WD1

WD2

WS[1:0]

D

EN

WE

Q

OE[5-0]

Tri-state buf [5-0]

RD

[5-0

]

RF unit cell

Congestion!

Tri-state buffer mimics the PG in a custom RF cell so that read data wire can be shared

synthesized array

WS

WD

validaddrdata

decode

WE

Write Port (x3)

addrRead Port (x6)

Regfile Bit Block

2

6RE

Flip flop(dff)D

Q

decode

RD

read data wires

UC Berkeley Regfile:OurquicksoluPon

◾ hand-crahourownbitblockoutofstandardcells- tri-statetodrivereadwires- hierarchicalbitlines

◾ pre-placearraysofbitblocks◾ letrouterhandlethe18wiresper1flip-flop◾ blackboxinChisel- simulatecontrolusingChiselmodel- verifyatgate-level 19

BOOMv1-2f3i

int/idiv/fdiv

load/store

int/fma

fetch

issue/rrdqueues

wb

integer

load/store

fp

fetchdec, ren,

& dis

issue/rrdqueues

BOOMv2-2f4i

wb

UC Berkeley BeforeandAXer

20

UC Berkeley Results

21

UC Berkeley Results

◾ Hardtocompute--notapplestooranges- alotoftheworkwasaboutdesignrulechecks,notperformance

21

UC Berkeley Results

◾ Hardtocompute--notapplestooranges- alotoftheworkwasaboutdesignrulechecks,notperformance

◾ Roughly25%decreaseinclockperiod◾ Roughly20%decreaseinCoremark/MHz- knownissuefromload-usedelay

21

UC Berkeley Results

◾ Hardtocompute--notapplestooranges- alotoftheworkwasaboutdesignrulechecks,notperformance

◾ Roughly25%decreaseinclockperiod◾ Roughly20%decreaseinCoremark/MHz- knownissuefromload-usedelay

◾ Barelyawin?◾ ChangesfixedDRCerrors,geometryerrors(shorts),so...

21

UC Berkeley Results

◾ Hardtocompute--notapplestooranges- alotoftheworkwasaboutdesignrulechecks,notperformance

◾ Roughly25%decreaseinclockperiod◾ Roughly20%decreaseinCoremark/MHz- knownissuefromload-usedelay

◾ Barelyawin?◾ ChangesfixedDRCerrors,geometryerrors(shorts),so...- infinitelyFaster!BeAer!Stronger!

21

UC Berkeley AgileHardwareDevelopment◾ RTLhackingcanbeveryagile

- ~6minutestocompile,build,andrun"riscv-tests"regressionsuite(10KHzforVerilatorsimulator)

- Chiselallowsforquick,far-reachingchanges- generatorapproachallowsforlate-bindingdesigndecisions- smallchanges,improvements(thatdon'taffectfloorplan)areagile

22

UC Berkeley AgileHardwareDevelopment◾ RTLhackingcanbeveryagile

- ~6minutestocompile,build,andrun"riscv-tests"regressionsuite(10KHzforVerilatorsimulator)

- Chiselallowsforquick,far-reachingchanges- generatorapproachallowsforlate-bindingdesigndecisions- smallchanges,improvements(thatdon'taffectfloorplan)areagile

◾ PhysicaldesignisaboAleneck- 2-3hoursforsynthesisresults- 8-24hoursforp&rresults- toomuchvalueprovidedbymanualintervenTon- reportsaredifficulttoreasonabout- sadlycan'tthrowthisatacomputerandget7gooddesignsaweeklater

22

UC Berkeley AgileHardwareDevelopment◾ RTLhackingcanbeveryagile

- ~6minutestocompile,build,andrun"riscv-tests"regressionsuite(10KHzforVerilatorsimulator)

- Chiselallowsforquick,far-reachingchanges- generatorapproachallowsforlate-bindingdesigndecisions- smallchanges,improvements(thatdon'taffectfloorplan)areagile

◾ PhysicaldesignisaboAleneck- 2-3hoursforsynthesisresults- 8-24hoursforp&rresults- toomuchvalueprovidedbymanualintervenTon- reportsaredifficulttoreasonabout- sadlycan'tthrowthisatacomputerandget7gooddesignsaweeklater

◾VerificaTonisanotherboAleneck- IcanwritebugsfasterthanIcanfindthem

22

UC Berkeley FutureDirecPons(forBOOM)

◾ AlotofimprovementstomaketoBOOM- CleardirecTonoffurtherIPC/QoRimprovements

23

UC Berkeley FutureDirecPons(forme)

◾ SeeDaveDitzel'stalkfromyesterdaytolearnmore◾ RISC-VFoundingGoldMember- onlotsofworkinggroups!

◾WearecommiAedtomaintainingtheBOOMopen-sourcerepository

◾We'rehiring...

24

"Esperanto Technologies designs high-performance energy-efficient computing solutions." - riscv.org

UC Berkeley A2-persontapeouttakesavillage!

◾ RISC-VISA- veryout-of-orderfriendly!

◾ ChiselhardwareconstrucTonlanguage- object-oriented,funcTonalprogramming

◾ FIRRTL- exposedRTLintermediaterepresentaTon(IR)

◾ Rocket-chip- AfullworkingSoCplarormbuiltaroundtheRocketin-ordercore

◾ Thanksto:- KrsteAsanović,RimasAvizienis,JonathanBachrach,ScoABeamer,DavidBiancolin,ChristopherCelio,HenryCook,PalmerDabbelt,JohnHauser,AdamIzraelevitz,SagarKarandikar,BenKeller,DonggyuKim,JackKoenig,JimLawson,YunsupLee,RichardLin,EricLove,MarTnMaas,ChickMarkley,AlbertMagyar,HowardMao,MiquelMoreto,QuanNguyen,AlbertOu,DavidA.PaAerson,BrianRichards,ColinSchmidt,WenyuTang,StephenTwigg,HuyVo,AndrewWaterman,AngieWang,andmore...

25

UC Berkeley QuesTons?

◾ Researchpar*allyfundedbyDARPAAwardNumberHR0011-12-2-0016,theCenterforFutureArchitectureResearch,amemberofSTARnet,aSemiconductorResearchCorpora*onprogramsponsoredbyMARCOandDARPA,andASPIRELabindustrialsponsorsandaffiliatesIntel,Google,Huawei,Nokia,NVIDIA,Oracle,andSamsung.

◾ Approvedforpublicrelease;distribu*onisunlimited.Thecontentofthispresenta*ondoesnotnecessarilyreflecttheposi*onorthepolicyoftheUSgovernmentandnoofficialendorsementshouldbeinferred.

◾ Anyopinions,findings,conclusions,orrecommenda*onsinthispaperaresolelythoseoftheauthorsanddoesnotnecessarilyreflecttheposi*onorthepolicyofthesponsors.

26

FundingAcknowledgements

Recommended