Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Strober:FastandAccurateSample-BasedEnergySimulationFrameworkforArbitraryRTL
DonggyuKim,AdamIzraelevitz,ChristopherCelio,Hokeun Kim,BrianZimmer,Yunsup Lee,
JonathanBachrach,KrsteAsanović
ISCA20166/20/2016
BAR
EnergyEvaluationIsImportant
2
§ Energyefficiency:keydesignmetric§ Limitedenergyefficiencyimprovementfromtechnologyscaling
§ Greatopportunityforcomputerarchitects&softwareengineers
§ Howtoevaluateenergyefficiency?
Novel Hardware Designs
New Algorithms & Software Implementation
Energy EfficientSystems
BAR
EvaluatingExistingSystemsIsEasy
3
§ SiliconPrototyping- Justrun&measure!-Veryaccurate&fast-High-latencyturn-around
§ Whataboutnewsystems?-Modeling-Simulation
PrototypesfromUCBerkeley
BAR
WhatIsAGoodSimulationMethodology?
4
§ Fastenoughtosimulatethewholeexecutionofrealapplications- First70Bcyclesofgcc onthein-order-processor(Rocket)
§ General,easytouseforanyhardwaredesign- e.g.Acceleratorsfordeeplearning
§ Minimalmodelingerrorsè Accurateenergypredictionforrealsystems
Cycles (Billions)
CPI
BAR
EvaluatingNovelSystemsIsHard
5
Analytic Power Modeling+ µarch Software Simulation
RTL / Gate-levelSimulation
e.g. McPAT, Wattch + GEM5 Synopsys VCS
Accuracy ? [1](Validation against real systems) �
Generality �(Additional development efforts) �
Speed �(< 300 KHz)
�(< 1 KHz)
[1] Xi et. al. “Quantifying sources of error in McPAT and potential impacts on architectural studies”,HPCA `15
BAR
Full Program Execution
TheStrober Framework
6
Chisel RTL
Custom Transforms
FPGA Simulation
RTL State Snapshots
I/O Traces
CAD Tools
Gate-level Simulation
Post-layoutDesigns
BAR
Chisel:ConstructingHardwareInaScalaEmbeddedLanguage
7
§ Constructhardwaregenerator- Cleansimplesetofdesignconstructionprimitives
- Advancedparameterizationsystems- Object-oriented/functionalprogramingwithScala
§ FlexibleIntermediateRepresentationforRTL(FIRRTL)- Customtransforms forhardwaredesigns(e.g.scanchains)
chisel.eecs.berkeley.edu
Chisel RTL
Chisel Frontend
FIRRTL
Backend Passes+ Custom Transforms
Verilog
BAR
8
§ Problems- PerformancemodelingonheterogeneousFPGAplatform
- SimulationstalltoreadRTLstatesnapshotsfromFPGAsimulation
§ Solutions:FAME1Transform- Automatically generatesFPGA-acceleratedsimulators
- Systematically attachesenablesforregisterstostallsimulation
- Channels:tokens+I/Otraces
AutoFAME1Transform
FAME1 Transform
BAR
RTLStateSnapshotting
9
§ Addscanchainssystematically&automatically for-Registers:valuesarecopiedrightafterthesimulationstalls-SRAM/BRAM:addressesaregeneratedforthereadport
RAM
Address Generation
RAM
Add Register Scan Chains
Add RAM Scan Chains
BAR
SampleReplaysonGate-levelSimulation
10
§ IndependentRTLStateSnapshots§ Parallelize samplereplaysonmultipleinstancesofgate-levelsimulation
Custom Transforms
Chisel RTL
FPGA Simulation
Chisel Verilog Backend
Verilog RTL
Logic Synthesis / Place & Route
Post-layout Design
Gate-level Simulation
Samples
BAR
WhatIsFormalVerificationToolFor?
11
§ Problem:RTLnamemanglinginlogicsynthesis§ Solution:Formalverificationtool-FindmatchingpointsbetweenRTL&gate-level
Custom Transforms
Chisel RTL
FPGA Simulation
Chisel Verilog Backend
Verilog RTL
Logic Synthesis / Place & Route
Post-layout Design
Gate-level Simulation
Samples
Verification Tool
Matching Points
BAR
RTLStateLoadingonGate-levelSimulation
12
§ Problem:SlowRTLstateloadingongate-leveldesignswithsimulationscripts(e.g.SynopsysUCLIscripts)-400commands/secà 80sec for35kflip-flops
§ Solution:CustomVPIroutinestoloadRTLstate-20,000commands/secà 1.8sec for35kflip-flops
Gate-level Simulation
VPI Routines
RTL Snapshots
BAR
RegisterRetiming
13
§ Problem:RegisterRetimingforan-cycledatapathe.g.FloatingPointUnit
§ Solution- I/OtracesforthelastncyclesinFPGAsimulation-Forcedbeforereplaysingate-levelsimulation
RegisterRetiming
BAR
x86 Server
FPGA Board( e.g. Xilinx Zynq)
ARM Core FPGA
I/O Devices Target Design
RTL StateSnapshots
Scan Chain
Gate-levelSimulation
AveragePower
DRAM
IO Trace
Lat Pipe & IO Trace
Simulation Driver
SimulationExecutionModel
14
§ SimulationDriver- PeriodicallyreadsRTLsnapshotsfromtheFPGASimulator
§ AutomaticallygeneratetheinterfaceforspecificFPGAplatforms(e.g.XilinxZynq)
BAR
HowAreWeDifferent?
15
§ ArchitecturalStateSampling(e.g.SMARTS[1])
- Takearchitecturalstatesnapshotsfromsoftware/FPGAfunctionalsimulators(e.g.Simics,QEMU,ProtoFlex)
- Replaysnapshotsonµarchsimulators- µarchstatewarmingproblems
§ FPGA-acceleratedPowerEstimation(e.g.PrEsto[2])-Manually selectsimportantsignalstotrainpowermodels-Manually addsthepowermodelstotheFPGAsimulators-Needsdesigners’intuition&additionalmanualefforts
[1] Wunderlich et. al. “SMARTS: accelerating microarchitecture simulation via rigorous statisticalsampling”, ISCA `02
[2] Sunwoo et. al. “PrEsto: An FPGA- accelerated Power Estimation Methodology for Complex Systems”,FPGA `10
BAR
TargetDesign:RocketChipGenerator
16
https://github.com/ucb-bar/rocket-chip.git
Rocket [1](In-order Processor)
BOOM-1w [2](Out-of-order Processor)
BOOM-2w [2](Out-of-order Processor)
Fetch-width 1 1 2
Issue-width 1 1 2
Issue slots 12 16
ROB size 24 32
Ld/St entries 8/8 8/8
Physical registers 32(int)/32(fp) 100 110
L1 I$ / D$ 16 KiB / 16 KiB 16 KiB / 16 KiB 16 KiB / 16 KiB
DRAM latency 100 cycles 100 cycles 100 cycles
[1] Asanović et. al. “The Rocket Chip Generator”, UCB Tech Repor t[2] Celio et. al. “The Berkeley Out-of-Order Machine(BOOM): An Industry-
Competitive, Synthesizable, Parameterized RISC-V Processor”, UCB Tech Repor t
BAR
HowFastIsStrober?
17
§ Example:Two-wayOut-of-orderProcessor(BOOM)
- ASICtoolchaincanbeparallelizedwithFPGASimulation
§ SimulationPerformanceComparisons-Gate-levelsimulation=264years(100Bcycles/12Hz)- Fastestµarchsoftwaresimulation=3.86days(100Bcycles/300KHz)- FPGAsimulation=7.7hours(100Bcycles/3.6MHz)- SampleRecording=1.0hour(1.3x100x2ln(10B/(100x1000)))- SampleReplays=39min (100� (1000cycles/12Hz+2.5min)/10)- Strober Total=9.4hours
Execution Length 100B cycles FPGA Synthesis Time 1 hour# of snapshots 100 FPGA Clock Rate 50 MHzReplay Length 1000 cycles FPGA Simulation Speed 3.6 MHz
# of Instances of Gate-level Simulation 10 Gate-level Simulation Speed 12Hz
BAR
UnderlyingTheory:CentralLimitTheorem
18
AveragePower
UpperError Bound
LowerError Bound
SamplingDensity
Power
Confidence (e.g. 95%)
Determined by Sample Size, Variance, Not by Population Size
BAR
HowAccurateIsStrober?
19
§ Technology:TSMC45nm§ TheoreticalErrorBoundsvs.ActualErrors
- Replay30randomsamplesnapshotsof128cycles- Repeatfivetimesforbenchmarksonthein-order-processor(Rocket)- Theoreticalerrorbound:computedfromCentralLimitTheorem- Actualerror:comparedagainstthewholeexecutionofbenchmarks- Errorsareindependentofthelengthofexecution
0.0%
1.0%
2.0%
3.0%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
vvadd towers dhrystone qsort spmv dgemm
Erro
r
Theoretical Error Bound (99% Confidence) Actual Error
BAR
0
50
100
150
200
250
300
coremark linuxboot gcc
mW
Power Breakdonw for the Two-way Out-of-Order Processor(BOOM)MiscUncoreL1 D-cache controlL1 D-cache meta+dataL1 I-cacheROBFPULSUInteger UnitIssue LogicRegister FileRename + Decode LogicFetch Unit
Evaluation:PowerBreakDown
20
§ Coremark:designedtofitinL1cachesandstressprocessor’sintegerpipeline
§ Linuxboot,gcc:largermemoryfootprinttostressmemorysystem§ Replay30randomsamplesnapshotsof1024cycles
BAR
0
50
100
150
200
250
300
350
400
450
500
coremark linuxboot gcc
mW
Power Breakdonw for the Two-way Out-of-Order Processor(BOOM)DRAMMiscUncoreL1 D-cache controlL1 D-cache meta+dataL1 I-cacheROBFPULSUInteger UnitIssue LogicRegister FileRename + Decode LogicFetch Unit
Evaluation:DRAMPower
21
§ Micron’sLPDDR2SDRAMS4- 8banks,16K(16x1024)rowsforeachbank
§ Eventcountersreadoutfromscanchains§ SpreadsheetpowercalculatorbyMicron
BAR
Conclusion
22
§ AnynovelhardwaredesignwritteninChisele.g.Acceleratorsfordeeplearning
§ Nomodeling isnecessaryè AccurateEnergyEvaluation
§ Ordersofmagnitudespeedupoverexistingmethodologies
§ AutomaticallygeneratesFPGASimulatorsforperformanceevaluationandRTLstatesnapshotting
§ Operateswithindustry-standardCADtools[*]toreplayRTLsnapshotsforaveragepower
*Widelyavailabletoacademicsthroughacademiclicensingprograms
Open-source this summer: Strober.org
BAR
Acknowledgements
23
§ Funding:-DARPAAwardNumberHR0011-12-2-0016- TheCenterforFutureArchitectureResearch,amemberofSTARnet,aSemiconductorResearchCorporationprogramsponsoredbyMARCOandDARPA-ASPIRELabindustrialsponsorsandaffiliates:Intel,Google,HPE,Huawei,LGE,Nokia,NVIDIA,Oracle,andSamsung- Kwanjeong Scholarship