Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
EvaluationofRISC-VRTLwithFPGA-AcceleratedSimulation
DonggyuKim,ChristopherCelio,DavidBiancolin,JonathanBachrach,KrsteAsanovic
CARRV201710/14/2017
BAR
EvaluationMethodologiesForComputerArchitectureResearch
2
Analytic Power/Energy Modeling
Microarchitectural Cycle-by-Cycle Software Simulators
Simulation Sampling
BAR
3
HowToDoComputerArchitectureResearchInThePast?
FewlinesofC/C++code
Paper
100M~1Binstructions
BAR
HowEasySingle-CyclePerfectCaches?
4
§ 10linesofC++codeinMARRSx86è Easytoimplementunrealistic,non-cycle-accuratemodels
BAR
5
NoµarchSimulationAnyMore!
§ Validationisdifficult- Isyourmodelingstillcycle-accurate?
§ Simulationistooslowà SimulationSampling
Validation of one design instance[1]
Other designs points?
Your target modeling?
[1] Gutierrez et al. Sources of error in full-system simulation, ISPASS 2014
BAR
6
NoSimulationSampling!
§ Phase-basedSampling(e.g.SimPoint)- Notheoreticallyguaranteederrorbounds- ShortperiodsofphasesshowsimilarIPCwheneverrepeated- 401.bzip2withitsreferenceinputinBOOM-2w
§ StatisticalSampling(e.g.SMARTS)- Staticallyboundederrors- Statewarmingproblems• µarchitecturalstateshouldberecoveredfromfunctionalsimulators
§ Whataboutmanaged-languageapplications?
BAR
7
BuildRTLToValidateYourDesignIdeas!
§ RTLisnolongerdifficult- Hardwareconstructionlanguages(e.g.Chisel)- RISC-Vimplementationcodebase(e.g.RocketChip)
§ RTLsimulationisnolongerslow- TensofMIPSusingFPGAs
FGPA-Accelerated Simulation
RTL Implementation
HardwareSpecification
CAD Tools
Area, Timing, Power Evaluation
PerformanceEvaluation
BAR
OldBOOMLayout
8
RegFile
ICache
Uncore
LSU
RenameTable
FPU
ROB
Free List
Issue Window
Branch PredictorALUs
Fetch Buffer
DCache
DCacheControl
IDIVIMUL Busy Table
Bypasses
BAR
§ Automatically transformsanyRTLdesignsintoFPGA-AcceleratedRTLsimulators
§ Quickly evaluatesperformance,power,andenergyofRTLdesignswithrealisticsoftwareapplications.
§ Flexibly Co-simulatesabstractHW&SWmodels
9
MIDAS:TurnYourRTLintoGold
Target RTL
BARMIDASCustomCompilerPasses
10
§ FIRRTL:IRforRTLtransformshttps://github.com/freechipsproject/firrtl
§ InstrumentRTLdesignsfor-Accurateperformancemodeling- EasysimulationcontrolinFPGA- Interactionswithabstracttimingmodels- RTLstatesnapshotsforenergymodeling
FIR
RTL
Com
pile
r
FPGA-Accelerated RTL Simulator
Target RTL Design
Macro MappingFAME1 Transform
Scan Chain InsertionSimulation MappingPlatform Mapping
BARMappingSimulationtotheFPGAHost
11
I/ODevices
ProcessorL2 Cache
/ MainMemory
FPGA Board( e.g. Xilinx Zynq)
Software FPGA
I/ODevices Processor
MemorySystemTiming
Simulation Driver
Board DRAM
L2 Cache /Main Memory
I/O Endpoints
§ SimulationSpeed:3.56MHz(ISCA`16)à ~40MHz(CARRV`17)§ I/OEndpoints- Low-leveltimingtokens<->High-leveltransactions- OptimizethecommunicationsbetweenSW&FPGA
§ L2/MainMemory- Abstracttimingmodel(L2$tags)inFPGA- ActualdatainBoardDRAM- Setsize,associativity,blocksize,latency:runtimeconfigurable
BARMemoryTimingModelValidation
12
§ Apointer-chaseµbenchmarkofccbench runninginBOOM(https://github.com/ucb-bar/ccbench)
§ L1$:16KiB,6cycles§ L2$:1MiB,6+23cycles (runtimeconfigurable)§ DRAM:6+23+80cycles(runtimeconfigurable)
BARTargetDesign:RocketChipGenerator
13
Rocket(In-order Processor)
BOOM-2w (Version 1)(Out-of-order Processor)
Fetch-width 1 2
Issue-width 1 3
Issue slots 20
ROB size 80
Ld/St entries 16/16
Physical registers 32(int)/32(fp) 110
Branch predictor gshare: 16KiB history
BTB entries 40 40
RAS entries 2 4
MSHR entreis 2 2
L1 I$ / D$ 16 KiB or 32 KiB
ITLB / DTLB reaches 128 KiB / 128 KiB
L2 $ 1MiB / 23 cycles
DRAM latency 80 cycles
BAR
§ InstructioncountwithRISC-Vforthereferenceinputs
§ InstructionCountAverage:2.08Trillion§ 445.gobmk,456.hmmer,462.libquantum:failinBOOM
SPEC2006intBenchmarkSuite
14
Benchmarks Instruction Count(T) Benchmarks Instruction Count(T)
400.perlbench 2.48 458.sjeng 2.85
401.bzip2 3.08 462.libquantum 2.09
403.gcc 1.37 464.h264ref 5.07
429.mcf 0.29 471.omnetpp 0.61
445.gobmk 2.04 473.astar 1.05
456.hmmer 2.95 483.xalanbmk 1.10
BARSimulationTimeforSPEC2006int
15
Simulators Speed Average Max(464.h264ref: 5T)
GEM5+Ruby 100 KIPS[1]240 days
(8 months)640 days
(1.8 years)
MARSSx86 400 KIPS[2]60 days
(2 months)160 days
(5.3 months)
MIDAS 18 MIPS (40MHz) 1.4 days 3 days
[1] http://gem5-users.gem5.narkive.com/hCJL6O2Q/gem5-simulation-speed[2] Patel et al. MARSSx86: A full system simulator for x86 CPUs, DAC 2011.
BARCaseStudy:IPC
16
0.0
0.2
0.4
0.6
0.8
1.0
1.2
IPC
Rocket 16KiB L1 Rocket 32KiB L1 BOOM-2w 16KiB L1 BOOM-2w 32KiB L1 Cortex A9
§ RocketiscomparabletoCortexA9§ BOOMoutperformsCortexA9§ Performanceimprovement:16KiBL1$à 32KiBL1$
1.41
BARCaseStudy:MPKIsofBOOM-2w
17
0
10
20
30
40
50
60
MPK
I
Conditional Branch Indirect Branch L1 I-Cache ITLB L1 D-Cache DTLB L2 Cache
§ 40BTBentries,32KiBL1$,1MiBL2$,TLBreach=128KiB§ 473.omnetpp,483.xalancbmk:LargeindirectbranchMPKIsà BiggerBTBs
BAR
0
20
40
60
80
100
Issu
e Sl
ots
/ Iss
ue W
idth
(%)
Issued Slots Empty Slots Non-issued Slots
CaseStudy:IssueQueueUtilization
18
§ IssuedSlots/Cycle=IPC§ EmptySlots:frontendhazards,lackofresources(403.gcc)§ Non-issuedslots:backendhazards
BAR
Full Program Execution In FPGA
StroberPower/EnergyModeling
19
RTL State Snapshots
I/O Traces
PrimeTime PX
RTL SimulationPost-synthesisDesignsRTL Signal Activities
Average Power
§ NostatewarmingisnecessaryinRTL/gate-levelsimulation
BAR
0
100
200
300
400
500
600Ro
cket
BOOM
-2w
Rock
et
BOOM
-2w
Rock
et
BOOM
-2w
Rock
et
BOOM
-2w
Rock
et
BOOM
-2w
Rock
et
BOOM
-2w
Rock
et
BOOM
-2w
Rock
et
BOOM
-2w
Rock
et
BOOM
-2w
400.perlbench 401.bzip2 403.gcc 429.mcf 458.sjeng 464.h264ref 471.omnetpp 473.astar 483.xalancbmk
Pow
er (m
W)
Misc
Uncore
L1 D-cache control
L1 D-cache meta + data
L1 I-cache
ROB
LSU
FPU
Integer Unit
Issue Logic
Register File
Rename + Control
Branch Predictior
Fetch Unit
CaseStudy:PowerBreakdown
20
§ Synopsys32nmeducationaltechnology§ 50randomsampleRTLstatesnapshots§ PipelineUtilization↑à Power↑§ BOOM:40%powerfromRenameLogic&RegisterFile
BARCaseStudy:EnergyPerInstruction
21
§ BOOM-2wisperformant,Rocketismoreenergyefficient§ 429.mcf,471.omnetpp:lowpower,lessenergyefficientPipelineUtilization↑à EnergyEfficiency↑
0
200
400
600
800
1000
EPI(p
J / I
nst)
Rocket 32 KiB L1 BOOM-2w 32 KiB L1
1923.4 1027.0
BAR
§ Debugging,validationforRTLdesigns-Assertiondetection-CommitlogcomparisonwithSpike
§ Memorysystemtimingmodeling-RealisticDRAMtimingmodelsinFPGA
§ FireSim:datacentersimulation(fires.im)
-ConnectRocketChips withNICs&switchtimingmodels-Amazonblogpost,7th RISC-Vworkshop
On-goingMIDASResearchInUCB
22
BARMIDAS(Strober)IsOpen-SourceNow
23
Your Processors /Accelerators
§ Examples:https://github.com/ucb-bar/midas-examples
§ RocketChip template:https://github.com/ucb-bar/midas-top-release
§ Willsupportvarioushostplatforms(XilinxZynq,AmazonEC2F1,IntelXeon+In-packageFPGA)
https://github.com/ucb-bar/midas-releasestrober.org
BARAcknowledgements
24
§ Funding:- DARPAAwardNumberHR0011-12-2-0016- TheCenterforFutureArchitectureResearch,amemberofSTARnet,aSemiconductorResearchCorporationprogramsponsoredbyMARCOandDARPA
- ASPIRELabindustrialsponsorsandaffiliates:Intel,Google,HPE,Huawei,LGE,Nokia,NVIDIA,Oracle,andSamsung
- Kwanjeong Scholarship