23
Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, Krste Asanović ISCA 2016 6/20/2016

Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

Strober:FastandAccurateSample-BasedEnergySimulationFrameworkforArbitraryRTL

DonggyuKim,AdamIzraelevitz,ChristopherCelio,Hokeun Kim,BrianZimmer,Yunsup Lee,

JonathanBachrach,KrsteAsanović

ISCA20166/20/2016

Page 2: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

EnergyEvaluationIsImportant

2

§ Energyefficiency:keydesignmetric§ Limitedenergyefficiencyimprovementfromtechnologyscaling

§ Greatopportunityforcomputerarchitects&softwareengineers

§ Howtoevaluateenergyefficiency?

Novel Hardware Designs

New Algorithms & Software Implementation

Energy EfficientSystems

Page 3: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

EvaluatingExistingSystemsIsEasy

3

§ SiliconPrototyping- Justrun&measure!-Veryaccurate&fast-High-latencyturn-around

§ Whataboutnewsystems?-Modeling-Simulation

PrototypesfromUCBerkeley

Page 4: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

WhatIsAGoodSimulationMethodology?

4

§ Fastenoughtosimulatethewholeexecutionofrealapplications- First70Bcyclesofgcc onthein-order-processor(Rocket)

§ General,easytouseforanyhardwaredesign- e.g.Acceleratorsfordeeplearning

§ Minimalmodelingerrorsè Accurateenergypredictionforrealsystems

Cycles (Billions)

CPI

Page 5: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

EvaluatingNovelSystemsIsHard

5

Analytic Power Modeling+ µarch Software Simulation

RTL / Gate-levelSimulation

e.g. McPAT, Wattch + GEM5 Synopsys VCS

Accuracy ? [1](Validation against real systems) �

Generality �(Additional development efforts) �

Speed �(< 300 KHz)

�(< 1 KHz)

[1] Xi et. al. “Quantifying sources of error in McPAT and potential impacts on architectural studies”,HPCA `15

Page 6: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

Full Program Execution

TheStrober Framework

6

Chisel RTL

Custom Transforms

FPGA Simulation

RTL State Snapshots

I/O Traces

CAD Tools

Gate-level Simulation

Post-layoutDesigns

Page 7: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

Chisel:ConstructingHardwareInaScalaEmbeddedLanguage

7

§ Constructhardwaregenerator- Cleansimplesetofdesignconstructionprimitives

- Advancedparameterizationsystems- Object-oriented/functionalprogramingwithScala

§ FlexibleIntermediateRepresentationforRTL(FIRRTL)- Customtransforms forhardwaredesigns(e.g.scanchains)

chisel.eecs.berkeley.edu

Chisel RTL

Chisel Frontend

FIRRTL

Backend Passes+ Custom Transforms

Verilog

Page 8: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

8

§ Problems- PerformancemodelingonheterogeneousFPGAplatform

- SimulationstalltoreadRTLstatesnapshotsfromFPGAsimulation

§ Solutions:FAME1Transform- Automatically generatesFPGA-acceleratedsimulators

- Systematically attachesenablesforregisterstostallsimulation

- Channels:tokens+I/Otraces

AutoFAME1Transform

FAME1 Transform

Page 9: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

RTLStateSnapshotting

9

§ Addscanchainssystematically&automatically for-Registers:valuesarecopiedrightafterthesimulationstalls-SRAM/BRAM:addressesaregeneratedforthereadport

RAM

Address Generation

RAM

Add Register Scan Chains

Add RAM Scan Chains

Page 10: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

SampleReplaysonGate-levelSimulation

10

§ IndependentRTLStateSnapshots§ Parallelize samplereplaysonmultipleinstancesofgate-levelsimulation

Custom Transforms

Chisel RTL

FPGA Simulation

Chisel Verilog Backend

Verilog RTL

Logic Synthesis / Place & Route

Post-layout Design

Gate-level Simulation

Samples

Page 11: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

WhatIsFormalVerificationToolFor?

11

§ Problem:RTLnamemanglinginlogicsynthesis§ Solution:Formalverificationtool-FindmatchingpointsbetweenRTL&gate-level

Custom Transforms

Chisel RTL

FPGA Simulation

Chisel Verilog Backend

Verilog RTL

Logic Synthesis / Place & Route

Post-layout Design

Gate-level Simulation

Samples

Verification Tool

Matching Points

Page 12: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

RTLStateLoadingonGate-levelSimulation

12

§ Problem:SlowRTLstateloadingongate-leveldesignswithsimulationscripts(e.g.SynopsysUCLIscripts)-400commands/secà 80sec for35kflip-flops

§ Solution:CustomVPIroutinestoloadRTLstate-20,000commands/secà 1.8sec for35kflip-flops

Gate-level Simulation

VPI Routines

RTL Snapshots

Page 13: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

RegisterRetiming

13

§ Problem:RegisterRetimingforan-cycledatapathe.g.FloatingPointUnit

§ Solution- I/OtracesforthelastncyclesinFPGAsimulation-Forcedbeforereplaysingate-levelsimulation

RegisterRetiming

Page 14: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

x86 Server

FPGA Board( e.g. Xilinx Zynq)

ARM Core FPGA

I/O Devices Target Design

RTL StateSnapshots

Scan Chain

Gate-levelSimulation

AveragePower

DRAM

IO Trace

Lat Pipe & IO Trace

Simulation Driver

SimulationExecutionModel

14

§ SimulationDriver- PeriodicallyreadsRTLsnapshotsfromtheFPGASimulator

§ AutomaticallygeneratetheinterfaceforspecificFPGAplatforms(e.g.XilinxZynq)

Page 15: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

HowAreWeDifferent?

15

§ ArchitecturalStateSampling(e.g.SMARTS[1])

- Takearchitecturalstatesnapshotsfromsoftware/FPGAfunctionalsimulators(e.g.Simics,QEMU,ProtoFlex)

- Replaysnapshotsonµarchsimulators- µarchstatewarmingproblems

§ FPGA-acceleratedPowerEstimation(e.g.PrEsto[2])-Manually selectsimportantsignalstotrainpowermodels-Manually addsthepowermodelstotheFPGAsimulators-Needsdesigners’intuition&additionalmanualefforts

[1] Wunderlich et. al. “SMARTS: accelerating microarchitecture simulation via rigorous statisticalsampling”, ISCA `02

[2] Sunwoo et. al. “PrEsto: An FPGA- accelerated Power Estimation Methodology for Complex Systems”,FPGA `10

Page 16: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

TargetDesign:RocketChipGenerator

16

https://github.com/ucb-bar/rocket-chip.git

Rocket [1](In-order Processor)

BOOM-1w [2](Out-of-order Processor)

BOOM-2w [2](Out-of-order Processor)

Fetch-width 1 1 2

Issue-width 1 1 2

Issue slots 12 16

ROB size 24 32

Ld/St entries 8/8 8/8

Physical registers 32(int)/32(fp) 100 110

L1 I$ / D$ 16 KiB / 16 KiB 16 KiB / 16 KiB 16 KiB / 16 KiB

DRAM latency 100 cycles 100 cycles 100 cycles

[1] Asanović et. al. “The Rocket Chip Generator”, UCB Tech Repor t[2] Celio et. al. “The Berkeley Out-of-Order Machine(BOOM): An Industry-

Competitive, Synthesizable, Parameterized RISC-V Processor”, UCB Tech Repor t

Page 17: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

HowFastIsStrober?

17

§ Example:Two-wayOut-of-orderProcessor(BOOM)

- ASICtoolchaincanbeparallelizedwithFPGASimulation

§ SimulationPerformanceComparisons-Gate-levelsimulation=264years(100Bcycles/12Hz)- Fastestµarchsoftwaresimulation=3.86days(100Bcycles/300KHz)- FPGAsimulation=7.7hours(100Bcycles/3.6MHz)- SampleRecording=1.0hour(1.3x100x2ln(10B/(100x1000)))- SampleReplays=39min (100� (1000cycles/12Hz+2.5min)/10)- Strober Total=9.4hours

Execution Length 100B cycles FPGA Synthesis Time 1 hour# of snapshots 100 FPGA Clock Rate 50 MHzReplay Length 1000 cycles FPGA Simulation Speed 3.6 MHz

# of Instances of Gate-level Simulation 10 Gate-level Simulation Speed 12Hz

Page 18: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

UnderlyingTheory:CentralLimitTheorem

18

AveragePower

UpperError Bound

LowerError Bound

SamplingDensity

Power

Confidence (e.g. 95%)

Determined by Sample Size, Variance, Not by Population Size

Page 19: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

HowAccurateIsStrober?

19

§ Technology:TSMC45nm§ TheoreticalErrorBoundsvs.ActualErrors

- Replay30randomsamplesnapshotsof128cycles- Repeatfivetimesforbenchmarksonthein-order-processor(Rocket)- Theoreticalerrorbound:computedfromCentralLimitTheorem- Actualerror:comparedagainstthewholeexecutionofbenchmarks- Errorsareindependentofthelengthofexecution

0.0%

1.0%

2.0%

3.0%

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

vvadd towers dhrystone qsort spmv dgemm

Erro

r

Theoretical Error Bound (99% Confidence) Actual Error

Page 20: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

0

50

100

150

200

250

300

coremark linuxboot gcc

mW

Power Breakdonw for the Two-way Out-of-Order Processor(BOOM)MiscUncoreL1 D-cache controlL1 D-cache meta+dataL1 I-cacheROBFPULSUInteger UnitIssue LogicRegister FileRename + Decode LogicFetch Unit

Evaluation:PowerBreakDown

20

§ Coremark:designedtofitinL1cachesandstressprocessor’sintegerpipeline

§ Linuxboot,gcc:largermemoryfootprinttostressmemorysystem§ Replay30randomsamplesnapshotsof1024cycles

Page 21: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

0

50

100

150

200

250

300

350

400

450

500

coremark linuxboot gcc

mW

Power Breakdonw for the Two-way Out-of-Order Processor(BOOM)DRAMMiscUncoreL1 D-cache controlL1 D-cache meta+dataL1 I-cacheROBFPULSUInteger UnitIssue LogicRegister FileRename + Decode LogicFetch Unit

Evaluation:DRAMPower

21

§ Micron’sLPDDR2SDRAMS4- 8banks,16K(16x1024)rowsforeachbank

§ Eventcountersreadoutfromscanchains§ SpreadsheetpowercalculatorbyMicron

Page 22: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

Conclusion

22

§ AnynovelhardwaredesignwritteninChisele.g.Acceleratorsfordeeplearning

§ Nomodeling isnecessaryè AccurateEnergyEvaluation

§ Ordersofmagnitudespeedupoverexistingmethodologies

§ AutomaticallygeneratesFPGASimulatorsforperformanceevaluationandRTLstatesnapshotting

§ Operateswithindustry-standardCADtools[*]toreplayRTLsnapshotsforaveragepower

*Widelyavailabletoacademicsthroughacademiclicensingprograms

Open-source this summer: Strober.org

Page 23: Strober: Fast and Accurate Sample-Based Energy ... - ISCA 2016isca2016.eecs.umich.edu/wp-content/uploads/2016/07/2B-2.pdf · 7/2/2016  · BAR RTL State Loading on Gate -level Simulation

BAR

Acknowledgements

23

§ Funding:-DARPAAwardNumberHR0011-12-2-0016- TheCenterforFutureArchitectureResearch,amemberofSTARnet,aSemiconductorResearchCorporationprogramsponsoredbyMARCOandDARPA-ASPIRELabindustrialsponsorsandaffiliates:Intel,Google,HPE,Huawei,LGE,Nokia,NVIDIA,Oracle,andSamsung- Kwanjeong Scholarship