Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... ·...

Preview:

Citation preview

ExperiencesUsingtheRISC-VEcosystemtoDesignanAccelerator-CentricSoCinTSMC16nm

TutuAjayi2, KhalidAl-Hawaj1, Aporva Amarnath2, SteveDai1,ScottDavidson4, PaulGao4, Gai Liu1, AnujRao4,AustinRovinski2, Ningxiao Sun4, ChristopherTorng1, LuisVega4,Bandhav Veluri4, ShaolinXie4, ChunZhao4 RitchieZhao1,

ChristopherBatten1,RonaldG.Dreslinski2,RajeshK.Gupta3,MichaelB.Taylor4,Zhiru Zhang1

1 CornellUniversity2 UniversityofMichigan

3 UniversityofCalifornia,SanDiego4 BespokeSiliconGroup,(U.Washington/UCSanDiego)

MICRO-50 October 14, 2017

ComputerArchitectureResearchPrototypingPrototypingis importanttocomplementtheresultsofsimulation-basedresearch

Manybenefitstoprototyping:

• Validatingassumptions• Validating designmethodologies• Measuring realsystem-levelperformanceandenergy efficiency

• Creating platformsforsoftwareresearch• Buildingcredibilitywithindustry• Buildingintuitionforphysicaldesign• Pedagogical benefits• Buildingrealthingsisfun!

Celerity::Introduction

TheContinuingNeedforBuildingPrototypes

Theriseofthe darksiliconera [1],inwhichanincreasingfractionofsiliconmustremainunpowered,ismotivatinganincreasingtrendtowardsaccelerator-centricarchitectures.

Specializationresearchrequires:

• Newsimulation-based evaluationmethodologiesbasedonaccelerators[2]

• Newprototypingmethodologiesforrapidlybuildingaccelerator-centricprototypes

Unfortunately,buildingresearchprototypescanbetremendouslychallenging.

“Shrink” “Dim”

“Specialize” “Magic”

Celerity::Introduction

The FourHorsemenofthe Coming

DarkSilicon Apocalypse

[1]M.Taylor.“IsDarkSiliconUseful?HarnessingtheFourHorsemenoftheComingDarkSiliconApocalypse,”InDesignAutomationConference,2012.[2]Y.Shao,etal.“Aladdin:APre-RTL,Power-PerformanceAcceleratorSimulatorEnablingLargeDesignSpaceExplorationofCustomizedArchitectures”,ISCA2014

Prototyping withtheRISC-VSoftware/Hardware Ecosystem

Celerity::Introduction

SoftwareToolchain

• Acomplete,off-the-shelfsoftwarestack(e.g.,binutils,GCC,newlib/glibc,Linuxkernel&distros)forbothembeddedandgeneral-purpose

Architecture

• RISC-VISAspecificationdesignedtobebothmodularandextensible,withasmallbaseISAandoptionalextensions

Microarchitecture

• On-chipnetworkspecificationsandimplementations(NASTI,TileLink)• RISC-Vprocessorimplementationsforbothin-order(BerkeleyRocket)andout-of-order(BerkeleyBOOM)cores

PhysicalDesign

• PreviousspinsofchipsforreferenceTesting

• Standardcoreverificationtestsuites+Turn-keyFPGAgateware

ApplicationAlgorithm

Operating System

Instruction Set Architecture

Register-Transfer Level

Circuits

Programming Language

Compilers

Microarchitecture

Gate-Level

TechnologyDevices

TheCeleritySystem-on-Chip

Celerity, anaccelerator-centricSoCwithatieredacceleratorfabric

thattargets highlyperformantandenergy-efficientembeddedsystems

FundedbytheDARPACRAFTprogram,“CircuitRealizationAtFasterTimescales”

Thegoalwastodevelopnewmethodologiestodesignchipsmorequickly

Celerity::Introduction

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

WeleveragedtheRISC-Vsoftware/hardwareecosystem as webuiltCelerity,andwebelieveitwasinstrumentalinenablingateamof20graduatestudentstotapeoutacomplexSoCinonly9months

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Celerity:ChipOverview

• TSMC16nmFFC• 25mm2 diearea(5mmx5mm)• ~385milliontransistors• 511RISC-Vcores

• 5Linux-capableRV64GBerkeleyRocketcores• 496-core RV32IM meshtiledarray“manycore”• 10-coreRV32IMmeshtiledarray(lowvoltage)

• Binarized NeuralNetworkSpecializedAccelerator• On-chipsynthesizablePLLsandDC/DCLDO

• Developedin-house• 3Clockdomains

• 400MHz– DDRI/O• 625MHz– Rocketcore+Specializedaccelerator• 1.05GHz– Manycorearray

• 672-pinflipchipBGApackage• 9-monthsfromPDKaccesstotape-out

Celerity::Introduction

http://www.opencelerity.org

Agenda

• Introduction• ForeachTier:

• Whatdidwebuild?• Howdidwebuildit?• RISC-VEcosystemSuccesses• RISC-VEcosystemChallenges

• Conclusion

Celerity::Introduction

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Celerity:General-Purpose Tier

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V• ChallengeswithRISC-V

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

General-PurposeTierOverview

• 5 BerkeleyRocketCores(RV64G)• Workload

• General-purposecompute• Operatingsystem(e.g.Linux&TCP/IPStack)• InterruptandExceptionhandling• Programdispatchandcontrolflow

• Interface• Interfacetooff-chipI/Oandotherperipherals• 4Coresconnecttothemanycore array• 1CoreinterfaceswiththeBNN

• Memory• Eachcoreexecutesindependentlywithinitsownaddressspace

• Memorymanagementforalltiers

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V• ChallengeswithRISC-V

Man

ycore

BNN

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

BaseJumpF

SB

BaseJumpM

otherboard

Berkeley RocketCores

• 5BerkeleyRocketCores(https://github.com/freechipsproject/rocket-chip)

• GeneratedfromChisel• RV64GISA• 5-stage,in-order,scalarprocessor• Double-precisionfloatingpoint• I-Cache:16KB4-wayassoc.• D-Cache:16KB4-wayassoc.

• PhysicalImplementation• 625MHz(CriticalpathinFSB)• 0.19mm2 percore

http://www.lowrisc.org/docs/tagged-memory-v0.1/rocket-core/

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

DesignIterations2.Alpaca

3.Bison 4.Coyote

1.Loopback

Bas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B Loopback FIFO

Bas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B

NA

STI RISC-V Rocket Core

I-CacheD-CacheN

AST

I RISC-V Rocket Core

I-CacheD-Cache

RoCC AcceleratorBas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• SuccesseswithRISC-V•ChallengeswithRISC-V

ImplementedNASTIbridgeandconnectedrocketcoreBaselinedesigntovalidateFSBandNorthbridge

ImplementedacceleratorconnectedthroughBlackboxed RoCC ModularizedRoCCinterfacetoaccelerator

Bas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

Accelerator

… …

BaseJump Motherboard Celerity SoC

Off-ChipInterfaceandNorthbridge

• Open-sourceBaseJumpIPLibrary• http://bjump.org

• FrontSidebus• BaseJumpCommunicationLink• HighSpeed(DDR)Source-SynchronousCommunicationInterface

• Packaging• ModifiedBaseJumpBGAPackageandI/ORing

• Validation• BaseJumpSuperTroublePCB (DaughterCard)• BaseJumpMotherboard(ZedBoard)

DRAM Controller

Ethernet

SSD

L2 $

JTAG

Bas

eJum

pFS

B &

FPG

A B

ridge

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

Bas

eJum

pFP

GA

Brid

ge

Clocks

...

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

RISC-VSuccesses

• BerkeleyRocketCores• Veryquicklygeneratedvalidateddesigns• Vibrantecosystemtoprovidefeedbackandsupport• TestandValidationinfrastructure• SoftwareandToolchainsupport

• FlexiblememorysystemandperipheralI/Osupport• EasyintegrationwithBaseJump IPLibrary

• Balancesextensibilityandsoftwaresupport

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

RISC-VLessonsLearned

• Componentstability,compatibilityandversioning• Chiseladoption• RTLsimulationissues

• DecipheringChiselgeneratedRTL• RegisterinitializationandX-Pessimism

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Celerity:MassivelyParallel Tier

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

DevelopedbyTaylor’sBespokeSiliconGroup@UW

Celerity::Massively ParallelTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

http://bjump.org/manycore

Thetiledarchitecture

TheVanillacore: SimplebutefficienttorunCcodewithoutanytoolchainmodification

• ISA:RV32IM• Pipeline:5-stage,fullyforwarded,in-order,singleissue

• Scratchpadmemory:4KBforIMem,4KBforDMem

• SecondTape-outofthistiledarchitecture(10-core)

...

… … …...

...

...

...

… … …

NOCRouter

RISC-V Core

MEM

C

ross

bar

DMEM

IMEM

496RISC-VCores

Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

MeshNetwork

• LinkProtocol:Forward/Reversepaths,parameterizable address/databits

• Credit-Based:Eachpacketwillbeacknowledgedwithresponse

• Flowcontrol:Endpointcontrolsthenumberoftheoutstandingpacket.

• Router:simpleXY-dimensionrouting,buffered

17Celerity::MassivelyParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

...

...

...

Forward packetForward responseReverse packetReverse response

buffered

router

tilelink protocol

Manycore LinkstoGeneral-PurposeandSpecializedTier

CrossClockDomaininterface

• ToGeneral-PurposeTier:ConvertRoCCtolinkprotocol,supportconfiguringDMA,writeandresetmanycore etc.

• ToSpecializedTier:Aggregatelinkinterfacetoincreasethebandwidthandthroughput

Asy

nc F

IFO

Endp

oint

DM

A

L1D Cache

Core

req

resp

cmdrespbusy

link_to_rocc Router

...

… … …...

...

...

...

… … …

Rocket

Rocket

Rocket

Rocket

RoCC

RoCC

RoCC

RoCC

General-Purpose Tier clock domain

Massively Parallel Tier clock domain

Specialized Tier clock domain

Asy

nc

FIFO

Asy

nc

FIFO

Asy

nc

FIFO

Asy

nc

FIFO

32

32

32

32

64

64

64

Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Cross ClkDomain

Cross Clk Domain

ProgrammingModel

Producer-consumerprogrammingmodel:

extendedinstructionsforefficientinter-tilesynchronization

• LoadReserved(lr.w):loadvalueandsetthereservationaddress

• Load-on-broken-reservation(lr.lbr):stallifthereservedaddressdidn’twrittenbyothercores

• Consumer: waiton<address,value>• Benefits:Nopolling,nointerrupt,fastresponse,stalledpipelinecansavepower

InputSplit Join

Feedback

Pipeline

Output

Producer-consumer Programming

DMEM

Core A Core B

NoC

Remote store

Reserved Address

Invoke pipelineStalled Pipeline

waiting for events

Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

ThreadDensity Comparison

[1]J.Balkind,etal.“OpenPiton:AnOpenSourceManycoreResearchFramework,”intheInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),2016.[2]R.Balasubramanian,etal."EnablingGPGPULow-LevelHardwareExplorationswithMIAOW:AnOpen-SourceRTLImplementationofaGPGPU,"inACMTransactionsonArchitectureandCodeOptimization(TACO). 12.2(2015):21.

ConfigurationNormalizedArea

(32nm)

Area

Ratio

CelerityTile@16nm

D-MEM=4KBI-MEM=4KB

0.024*(32/16)2=0.096mm2 1x

OpenPitonTile@32nm

L1D-Cache=8KBL1I-Cache=16KB

L1.5/L2Cache=72KB1.17mm2 [1] 12x

RawTile@180nm

L1D-Cache=32KBL1I-SRAM=96KB

16.0*(32/180)2=0.506mm2 5.25x

MIAOWGPUComputeUnitLane

@32nm

VRF=256KBSRF=2KB

15.0/16=0.938mm2[2] 9.75x

Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

• Timing:1.05GHz@16nm• Area:0.024mm2 @16nm• SiUtilizationRatio:90%

NormalizedPhysicalThreads(ALUops)perArea

Howdidwebuildthemassivelyparalleltier?

BasejumpSTLlibrary

Dataflow

NoC

Arithmetic

RISC-Vtoolchain

AssemblyTestSuite

ModifiedRuntime

CCompiler

…Inhouse

Design Testing

I Mem D MemRF

Hard-macro

One tile

Floorplan

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

Celerity::Massively ParallelTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

HierarchicalFlow

RISC-VEcosystemSuccesses

• ModularISA• Flexibleforbothcomplexcores(i.e.Rocket) and simplecores(i.e.Vanilla)

• ExtensibleRoCCinterface• 4 customizableinstructions:weusedone

• Comprehensiveassembly testsuite(434testcases)• Off-the-shelftoolchain

Celerity::Massively ParallelTier::Whatisit?•Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

BuildinguptheRISC-VEcosystem

We provideanefficientRV-32IM implementation

inSystemVerilog.

We consolidatedInformationaboutRoCCthat

wasscattered acrosstheinternet.

WithCelerity

• Efficientopensourcecore• BasedonSystemverilog• Siliconproven

• PublicRoCC documentV.2[bjump.org/rocc_doc ]

• ExportedRoCCinterfaceontoplevel

Celerity::Massively ParallelTier::Whatisit?•Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Celerity:SpecializationTier

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-CacheN

AST

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

CaseStudy:MappingFlexibleImageRecognitiontoaTieredAcceleratorFabric

Threestepstomapapplicationstotieredacceleratorfabric:

Step1. Implementthealgorithmusingthegeneral-purposetierStep2. Acceleratethealgorithmusingeitherthemassively

paralleltierOR thespecializationtierStep3. Improveperformancebycooperativelyusingboththe

specializationAND themassivelyparalleltier

Convolution Pooling Convolution Pooling Fully-connected

bird(0.02)boat(0.94)

cat(0.04)dog(0.01)

MassivelyParallel Tier

Specialization Tier

General-PurposeTier

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Step1:AlgorithmtoApplicationBinarizedNeuralNetworks

• Trainingusuallyusesfloatingpoint,whileinferenceusuallyuseslowerprecisionweightsandactivations(often8-bitorlower)toreduceimplementationcomplexity

• Rastergari etal.[3]andCourbariaux etal.[4]haverecentlyshownsingle-bitprecisionweightsandactivationscanachieveanaccuracyof89.8%onCIFAR-10

• Performancetargetrequiresultra-lowlatency(batchsizeofone)andhighthroughput(60classifications/second)

[3]M.Rastergari,etal.“Xnor-net:Imagenet classificationusingbinaryconvolutionalneuralnetworks,”InEuropeanConferenceonComputerVision,2016.[4]M.Courbariaux,etal.“Binarizedneuralnetworks:Trainingdeepneuralnetworkswithweightsandactivationsconstrainedto+1or-1,”arXiv preprintarXiv:1602.02830(2016).

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Step1:AlgorithmtoApplicationCharacterizingBNNExecution

• Usingjustthegeneral-purposetierwould be200xslowerthanthe performancetarget(60classifications/sec)• Binarizedconvolutionallayersconsumeover97%ofdynamicinstructioncount• Perfectaccelerationofjustthebinarizedconvolutionallayersisstill5xslowerthanperformancetarget• Perfectaccelerationofalllayersusingthemassivelyparalleltiercouldmeetperformancetargetbutwithsignificantenergyconsumption

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Step 2:ApplicationtoAcceleratorBNNSpecializedAccelerator

1. AcceleratorisconfiguredtoprocessalayerthroughRoCCcommandmessages

2. MemoryUnitstartsstreamingtheweightsintotheacceleratorandunpackingthebinarizedweightsintoappropriatebuffers

3. Binaryconvolutioncomputeunitprocessesinputactivationsandweightstoproduceoutputactivations

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Off-Ch

ipI/O

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

Step2:ApplicationtoAcceleratorGeneral-PurposeTierforWeightStorage

• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Step3:AssistingAcceleratorsMassivelyParallelTierforWeightStorage

Off-Ch

ipI/O

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights

• Each core in the massively parallel tier executes a remote-load-store program to orchestrate sending weights to the specialization tier via a hardware FIFO

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

PerformanceBenefitsofCooperativelyUsingtheMassivelyParallelandtheSpecializationTiers

General-Purpose Tier Software implementation assuming ideal performance estimated with an optimistic one instruction per cycle

Specialization Tier Full-system RTL simulation of the BNN specialized accelerator running with a frequency of 625 MHz

Specialization + MassivelyParallel Tiers

Full-system RTL simulation of the BNN specialized accelerator with the weights being streamed from the manycore

General-Purpose Tier Specialization Tier Specialization + Massively Parallel Tiers

Runtime per Image (ms) 4,024 20 3.3

Power (Watts) 0.2 – 0.5 0.2 – 0.5 0.5 – 2.0

Improvement in Perf / Power 1x ~200x ~400x

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

DesignMethodology

SystemCConstraints

StratusHLS

RTL

PyMTL

Wrappers&

Adapters

FinalRTL

void bnn::dma_req() {while( 1 ) {DmaMsg msg = dma_req.get();

for ( int i = 0; i < msg.len; i++ ) {HLS_PIPELINE_LOOP( HARD_STALL, 1 );

int req_type = 0;word_t data = 0;addr_t addr = msg.base + i*8;

if ( type == DMA_TYPE_WRITE ) {data = msg.data;req_type = MemReqMsg::WRITE;} else {req_type = MemReqMsg::READ;}

memreq.put(MemReqMsg(req_type,addr,data));}

dma_resp.put(DMA_REQ_DONE);}}

Including

RoCCInterfaces

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

DesignMethodology

SystemCConstraints

StratusHLS

RTL

PyMTL

Wrappers&

Adapters

FinalRTL HardMacro

ASICFlow

Constraints

Files

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Including

RoCCInterfaces

RISC-VEcosystemSuccessesandChallenges

Successes

• TheRoCCcommandandmemoryinterfacewerebothsignificantsuccesses.Weconnected the acceleratorwithnochangestoRV64Gcore,justaswedidforthemanycorearrayinthemassivelyparallel tier.

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Challenges

• SmallchallengeintheRoCCacceleratorinterfaceatthespecificcommitwechosetouse

• MemorymanagementunitinRV64Gusedonlyphysicaladdresses

• Wedidasmallworkaroundtogiveusvirtualaddressesaswell

• Thischallengehasalreadybeenfixedupstream

TheCeleritySystem-on-Chip

Celerity, anaccelerator-centricSoCwithatieredacceleratorfabricthat

targetshighlyperformantandenergy-efficientembeddedsystems

Celerity’s goal wastodevelopnewmethodologiestodesignchipsmorequickly

WebelievetheRISC-Vsoftware/hardwareecosystem wasinstrumentalinenablingateamof20graduatestudents

totapeoutacomplexSoC inonly9months

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

Wethankthemanycontributorstotheopen-sourceRISC-VsoftwareandhardwareecosystemwithspecialthankstoU.C.

BerkeleyforformingtheRISC-Vecosystem

Celerity::Conclusion

Acknowledgements:DARPA,undertheCRAFTprogram

SpecialthankstoDr.LintonSalmonforprogramsupportandcoordination

http://www.opencelerity.org

Recommended