22
1 Alan Mishchenko Robert Alan Mishchenko Robert Brayton Brayton UC Berkeley UC Berkeley Power Optimization Toolbox Power Optimization Toolbox for for Logic Synthesis and Logic Synthesis and Mapping Mapping

Stephen Jang Kevin Chung Xilinx Inc. Alan Mishchenko Robert Brayton UC Berkeley

  • Upload
    declan

  • View
    16

  • Download
    0

Embed Size (px)

DESCRIPTION

Power Optimization Toolbox for Logic Synthesis and Mapping. Stephen Jang Kevin Chung Xilinx Inc. Alan Mishchenko Robert Brayton UC Berkeley. Outline. Introduction Background Contributions SimSwitch : Switching activity estimation PowerMap : Mapping for power reduction - PowerPoint PPT Presentation

Citation preview

Page 1: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

11

Alan Mishchenko Robert BraytonAlan Mishchenko Robert Brayton

UC BerkeleyUC Berkeley

Power Optimization Toolbox Power Optimization Toolbox for for

Logic Synthesis and MappingLogic Synthesis and Mapping

Page 2: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

22

OutlineOutline

Introduction Introduction BackgroundBackground ContributionsContributions

SimSwitchSimSwitch: Switching activity estimation: Switching activity estimation PowerMapPowerMap:: Mapping for power reductionMapping for power reduction PowerDCPowerDC:: Re-synthesis for power reductionRe-synthesis for power reduction

ExperimentsExperiments ConclusionsConclusions

Page 3: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

33

IntroductionIntroduction

High power dissipation is a rising concernHigh power dissipation is a rising concern It was shown that, in FPGAs, 2/3 of dissipation is due to It was shown that, in FPGAs, 2/3 of dissipation is due to

dynamic power [J. Anderson, F. N. Najm, FPGA’02] dynamic power [J. Anderson, F. N. Najm, FPGA’02]

Minimization of dynamic power is achieved by reducing Minimization of dynamic power is achieved by reducing the total switching activity of the nodesthe total switching activity of the nodes

This workThis work Uses sequential simulation to estimate switchingUses sequential simulation to estimate switching Controls switching during synthesis and mappingControls switching during synthesis and mapping

signalsi

dynamic ii SCVfP2

2

1

f is the clock frequency, V the supply voltage, Ci the capacitance switched by signal i, and Si is the probability of signal i making a transition (switching)

Page 4: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

44

BackgroundBackground

Boolean networkBoolean network And-Inverter GraphsAnd-Inverter Graphs

Technology mappingTechnology mapping LUTs and standard cellsLUTs and standard cells

SAT-based re-synthesisSAT-based re-synthesis Resubstitution with don’t-caresResubstitution with don’t-cares

Page 5: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

55

AIGs: Unifying RepresentationAIGs: Unifying Representation

An underlying data structure for various computationsAn underlying data structure for various computations Rewriting, resubstitution, simulation, SAT sweeping, Rewriting, resubstitution, simulation, SAT sweeping,

induction, etc are based on the same AIG managerinduction, etc are based on the same AIG manager

A unifying representation for the whole A unifying representation for the whole synthesis/mapping/resynthesis/verification flowsynthesis/mapping/resynthesis/verification flow

Synthesis, mapping, verification use the same data-structureSynthesis, mapping, verification use the same data-structure Allows multiple structures to be stored and used for mappingAllows multiple structures to be stored and used for mapping

The main functional representation in ABCThe main functional representation in ABC A foundation of “contemporary logic synthesis”A foundation of “contemporary logic synthesis”

Page 6: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

66

AIG DAIG Definition and efinition and EExamplesxamples

cdcdabab 0000 0101 1111 1010

0000 00 00 11 00

0101 00 00 11 11

1111 00 11 11 00

1010 00 00 11 00

F(a,b,c,d) = ab + d(ac’+bc)

F(a,b,c,d) = ac’(b’d’)’ + c(a’d’)’ = ac’(b+d) + bc(a+d)

cdcdabab 0000 0101 1111 1010

0000 00 00 11 00

0101 00 00 11 11

1111 00 11 11 00

1010 00 00 11 00

6 nodes

4 levels

7 nodes

3 levels

b ca c

a b d

a c b d b c a d

AIG is a Boolean network composed of two-input ANDs and invertersAIG is a Boolean network composed of two-input ANDs and inverters

Page 7: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

77

Three Tricks That Make AIGs TickThree Tricks That Make AIGs Tick

Structural hashingStructural hashing Makes sure AIG is always stored in a compact formMakes sure AIG is always stored in a compact form Is applied during AIG constructionIs applied during AIG construction

• Propagates constantsPropagates constants• Ensures each node is structurally uniqueEnsures each node is structurally unique

Complemented edgesComplemented edges Represents inverters as attributes on the edgesRepresents inverters as attributes on the edges

• Leads to fast, uniform manipulationLeads to fast, uniform manipulation• Does not use memory for invertersDoes not use memory for inverters• Leads to efficient structural hashingLeads to efficient structural hashing

Memory allocationMemory allocation Uses fixed amount of memory for each nodeUses fixed amount of memory for each node

• Can be done by a simple custom memory managerCan be done by a simple custom memory manager• Even dynamic fanout manipulation is supported!Even dynamic fanout manipulation is supported!

Allocates memory for nodes in a topological orderAllocates memory for nodes in a topological order• Optimized for traversal in the same topological orderOptimized for traversal in the same topological order• Small static memory footprint for many applicationsSmall static memory footprint for many applications

a b

c d

a b

c d

Without hashingWithout hashing

With hashingWith hashing

Page 8: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

88

SimSwitchSimSwitch

Fast sequential logic simulatorFast sequential logic simulator Useful for switching activity estimationUseful for switching activity estimation

Improvements in simulationImprovements in simulation CompactCompact l logic ogic rrepresentation epresentation

• only 12 bytes per AIG nodeonly 12 bytes per AIG node RRecycling ecycling ssimulation imulation mmemoryemory

• allocate simulation memory only for nodes on the frontierallocate simulation memory only for nodes on the frontier Bit-parallel simulation of two time framesBit-parallel simulation of two time frames

• When comparing simulation info in two consecutive time frames, When comparing simulation info in two consecutive time frames, avoids storing the simulation info from the previous frameavoids storing the simulation info from the previous frame

Page 9: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

99

Simulation Runtime EvaluationSimulation Runtime Evaluation

Runtime for 64 frames and given number of inputs patterns (sec) Design AIG FF

2560 5120 10240 20480 C1 304K 1585 0.1 0.2 0.2 0.4 C2 362K 27514 2.7 2.9 4.1 6.6 C3 842K 58322 7.4 7.6 10.2 18.2 C4 1306K 87157 12.1 15.4 15.7 24.2

Intel Xeon 2-CPU 4-core computer with 8GB RAM.

Less than 100Mb was used in these experiments.

Page 10: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1010

Review of Cut-Based MappingReview of Cut-Based Mapping

Input:Input: And-Inverter Graph And-Inverter Graph

1.1. Compute Compute KK-feasible cuts for each node-feasible cuts for each node2.2. Compute best arrival time at each nodeCompute best arrival time at each node

• In topological order (from PI to PO)In topological order (from PI to PO)• Compute the depth of all cuts and choose the best oneCompute the depth of all cuts and choose the best one

3.3. Iterate area recoveryIterate area recovery• Using area flowUsing area flow• Using exact local areaUsing exact local area

4.4. Chose the best cover Chose the best cover • In reverse topological order (from PO to PI)In reverse topological order (from PO to PI)

Output:Output: Mapped netlist Mapped netlist

S. Chatterjee et al, “Reducing structural bias in technology mapping”, Proc. ICCAD’05.

Page 11: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1111

Cost FunctionsCost Functions

Area flow Area flow

Wire flowWire flow

Switching flowSwitching flow

))((

))(()()(

nLeafNumFanout

nLeafAFnAreanAF

i

ii

))((

))(()()(

nLeafNumFanout

nLeafEFnEdgenEF

i

ii

))((

))(()()(

nLeafNumFanout

nLeafSwitchFlownSwitchnSwitchFlow

i

ii

(J. Cong, FPGA’99 S. Chatterjee, ICCAD’05)

(S. Jang, FPGA’08)

(This work)

Page 12: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1212

Understanding a Cost-Function FlowUnderstanding a Cost-Function Flow

n1

n2 n3 n4

Resources owned by the nodes

Resources owned by the nodes

n5

Page 13: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1313

SAT-based Re-synthesis FrameworkSAT-based Re-synthesis Framework

SAT-based re-synthesis (FGPA’09) has these featuresSAT-based re-synthesis (FGPA’09) has these features substantial optimization powersubstantial optimization power

• due to the use of internal don’t-caresdue to the use of internal don’t-cares

scalable local computationscalable local computation• due to the use of windowingdue to the use of windowing

practical computation speedpractical computation speed• due to the use of Boolean satisfiability for functional manipulationdue to the use of Boolean satisfiability for functional manipulation

ability to use various optimization objectivesability to use various optimization objectives• due to the flexible conceptual framework.due to the flexible conceptual framework.

Page 14: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1414

Two Ways to Cool Down a Hot WireTwo Ways to Cool Down a Hot Wire

n10

n9

n3n4

n2

n8

x

n1

a b

n5

zn3

n10

n9

n3

n2

n8

n5

Resub the hot wire with cool

n10

n9

n3

n2

n8

n5

Remove the hot wire

Page 15: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1515

Experimental SetupExperimental Setup

Considered 20 industrial designs (12K to 165K 6-LUTs)Considered 20 industrial designs (12K to 165K 6-LUTs) Used Intel Xeon 2-CPU 4-core computer with 8GB RAM Used Intel Xeon 2-CPU 4-core computer with 8GB RAM Verified the results using command “cec” in ABCVerified the results using command “cec” in ABC Experimental runs performed:Experimental runs performed:

BaselineBaseline: comb synthesis with choices: comb synthesis with choices• (dch; if –e)(dch; if –e)2 2 (WireMap [FGPA’08] is disabled)(WireMap [FGPA’08] is disabled)

FullOptFullOpt: complete flow including high-effort seq and synthesis: complete flow including high-effort seq and synthesis• ((scl; lcorr; scorrscl; lcorr; scorr) + () + (dch; ifdch; if))2 2 (WireMap is enabled)(WireMap is enabled)

PowerMapPowerMap: power-aware LUT-mapping: power-aware LUT-mapping• FullOpt + FullOpt + (dch; if –p)(dch; if –p)2 2

PowerDCPowerDC: power-aware resynthesis: power-aware resynthesis• PowerMap + (PowerMap + (mfs –pmfs –p))22

Page 16: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1616

Experimental DataExperimental DataDesign Statistics Base FullOpt PowerMap PowerDC

name PI PO FF LUT Lv Pwr FF LUT Lv Pwr FF LUT Lv Pwr FF LUT Lv Pwr

D01 4725 16657 43309 71956 14 63592 41868 70060 13 53666 41868 67214 12 49268 41868 65693 12 46139

D02 8561 20356 65881 144295 27 59306 41327 90304 27 34090 41327 86789 25 33418 41327 85798 25 30497

D03 8879 52334 41521 123845 11 79928 39884 122947 11 66571 39884 119625 10 61239 39884 117946 10 57068

D04 781 5563 16205 50328 10 38961 15123 48392 10 29901 15123 47361 9 26706 15123 46651 9 23938

D05 3343 5533 23740 65704 17 37158 18933 55512 13 25415 18933 52086 12 20771 18933 50973 12 16247

D06 3664 21989 29947 36188 8 36899 27896 33733 8 27844 27896 33220 8 25985 27896 32962 8 25351

D07 1284 2929 81328 164437 10 90849 74898 153317 11 66515 74898 145918 10 56595 74898 143357 10 52249

D08 261 359 4586 12412 22 6480 4463 12325 23 4938 4463 11986 21 4461 4463 11844 21 3326

D09 2561 9586 23612 55639 9 40244 16290 37736 8 21981 16290 35123 7 20053 16290 34103 7 17538

D10 3765 9987 37630 96677 31 62304 36665 95005 31 50132 36665 90167 30 43864 36665 88191 30 37483

D11 2418 6000 34834 76798 19 51724 34446 75724 20 43953 34446 70531 18 36783 34446 69117 18 32231

D12 1134 3965 8371 14939 12 11632 8256 15316 12 10480 8256 14825 12 9254 8256 14623 12 8624

D13 210 299 6662 15888 11 5766 6591 15381 11 3893 6591 14960 11 3265 6591 14710 11 2498

D14 2326 3713 61789 109865 18 22738 36887 67338 19 7956 36887 66681 17 7582 36887 66097 17 7537

D15 2312 5523 26233 49031 7 35819 13575 27498 6 20779 13575 26234 6 17972 13575 25871 6 16819

D16 5124 21571 39127 146931 17 18685 37772 146298 16 14087 37772 139226 16 13095 37772 135903 16 12966

D17 2587 7025 6975 12528 12 8152 6429 11780 12 7124 6429 11957 11 6904 6429 11834 11 6685

D18 3918 6110 23727 35996 13 24990 22260 34630 12 21728 22260 34215 10 19339 22260 33994 10 18297

D19 4633 7540 28376 43515 13 31148 26241 41415 12 26941 26241 41120 10 24176 26241 40839 10 22822

D20 6631 19368 58322 158216 25 101921 53581 144455 25 75781 53581 139174 22 64332 53581 136747 22 54554

Geom 2438 6590 25656 55456 14.1 31294 22118 48700 13.7 22621 22118 47049 12.6 20238 22118 46318 12.6 18190

Ratio 1 1 1 1 0.862 0.878 0.97 0.723 0.862 0.848 0.90 0.647 0.862 0.835 0.90 0.581

Ratio 1 1 1 1 1.000 0.966 0.92 0.895 1.000 0.951 0.92 0.804

Ratio 1 1 1 1 1.000 0.984 1.00 0.899

Page 17: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1717

Power Reduction Power Reduction due to Power-Aware Optimizationdue to Power-Aware Optimization

Power BaseLine FullOpt PwrMap PwrDC

Geomean 17312 12687 11445 10445

Ratio 1 0.73 0.66 0.60

Ratio 1 0.90 0.82

Ratio 1 0.91

Power BaseLine FullOpt PwrMap PwrDC

Geomean 31294 22621 20238 18190

Ratio 1 0.72 0.64 0.58

Ratio 1 0.89 0.80

Ratio 1 0.89

Table 1: ITable 1: Inputs toggle rate nputs toggle rate isis 0.25 0.25

Table 2: ITable 2: Inputs toggle rate nputs toggle rate isis 0. 0.5050

The results are geometric averages over 20 industrial designsThe results are geometric averages over 20 industrial designs

Page 18: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1818

CChanges in hanges in WWire ire RRatiosatios due to Power-Aware Optimizationdue to Power-Aware Optimization

Wire Comparison

-15.00%

-10.00%

-5.00%

0.00%

5.00%

10.00%

PowrMap vs FullOpt PowerDc vs. PowrMap

Red

uct

ion (neg

ativ

e is

good)

T5 T4 T3 T2 T1 Total Wrs

Wire group codes: T5: “hot wires” (p > 0.4) … T1: “cold wires” (p < 0.1) Wire group codes: T5: “hot wires” (p > 0.4) … T1: “cold wires” (p < 0.1) where p is the probability of switching (note that p can be more than 0.5)where p is the probability of switching (note that p can be more than 0.5)

Page 19: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

1919

Power Dissipation per Wire GroupPower Dissipation per Wire GroupWith / Without Power-Aware OptimizationWith / Without Power-Aware Optimization

Power distribution before/after optimization

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Switching frequency (temperature)

Per

centa

ge

Wire Wire2 Pwr Pwr2

Wire 13.90% 1.62% 0.91% 1.29% 82.27%

Wire2 11.44% 1.79% 0.93% 1.29% 84.55%

Pwr 85.90% 7.36% 2.68% 2.32% 1.74%

Pwr2 82.62% 9.48% 3.19% 2.69% 2.02%

T5 T4 T3 T2 T1

Wire (Wire2) are wires before (after) synthesis.Wire (Wire2) are wires before (after) synthesis.Pwr (Pwr2) are power dissipations before (after) synthesis.Pwr (Pwr2) are power dissipations before (after) synthesis.

Page 20: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

2020

ConclusionsConclusions

Presented several contributionsPresented several contributions SimSwitch:SimSwitch: Estimation of switching activity Estimation of switching activity PowerMap:PowerMap: An extension of the priority cut LUT An extension of the priority cut LUT

mapper [ICCAD’07] to prioritize cuts based on mapper [ICCAD’07] to prioritize cuts based on switching activity of the nodesswitching activity of the nodes

PowerDC:PowerDC: An extension of SAT-based resynthesis An extension of SAT-based resynthesis [FPGA’09] to remove signals with high switching[FPGA’09] to remove signals with high switching

Demonstrated reductions in switching activity Demonstrated reductions in switching activity (without degradation of area and delay)(without degradation of area and delay) 27%27% reduction due to seq synthesis [ICCAD’08] and reduction due to seq synthesis [ICCAD’08] and

WireMap [FPGA’08] against a plain-vanilla flowWireMap [FPGA’08] against a plain-vanilla flow +19%+19% reduction due to PowerMap and WireDC reduction due to PowerMap and WireDC

described in this paperdescribed in this paper

Page 21: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

2121

Future WorkFuture Work Speeding up switching activity estimationSpeeding up switching activity estimation

Current implementation can be made fasterCurrent implementation can be made faster More accurate power estimationMore accurate power estimation

Estimating glitching in addition to switchingEstimating glitching in addition to switching Making other transforms power-awareMaking other transforms power-aware

Computing power-aware choicesComputing power-aware choices Specialized logic structuring (power gating)Specialized logic structuring (power gating)

Sequential techniques for power reductionSequential techniques for power reduction Clock-gating that uses induction to compute signals Clock-gating that uses induction to compute signals

that are valid clock gates on the reachable statesthat are valid clock gates on the reachable states

Page 22: Stephen Jang     Kevin Chung  Xilinx Inc. Alan Mishchenko     Robert Brayton  UC Berkeley

2222

AbstractAbstract The paper describes several complementary algorithms for power-The paper describes several complementary algorithms for power-

aware logic optimization: (1) SimSwitch is an efficient sequential aware logic optimization: (1) SimSwitch is an efficient sequential simulator for estimating switching activity of signals in large simulator for estimating switching activity of signals in large sequential designs. (2) PowerMap uses switching activity to make sequential designs. (2) PowerMap uses switching activity to make better decisions during power-aware technology mapping. (3) better decisions during power-aware technology mapping. (3) PowerDC is a resynthesis algorithm that eliminates wires with high PowerDC is a resynthesis algorithm that eliminates wires with high switching activity. The proposed simulator draws on new ideas in switching activity. The proposed simulator draws on new ideas in logic representation and is geared for speed, e.g. it can simulate a logic representation and is geared for speed, e.g. it can simulate a 1M-node sequential design using 1000 bit patterns for 100 cycles in 1M-node sequential design using 1000 bit patterns for 100 cycles in about 10 seconds on a typical one-core CPU. Experiments show about 10 seconds on a typical one-core CPU. Experiments show that, although each technique contributes to the final quality, it is that, although each technique contributes to the final quality, it is their combination that gives the best results. When applied to their combination that gives the best results. When applied to industrial designs in a highly-optimized industrial flow, previous work industrial designs in a highly-optimized industrial flow, previous work on sequential synthesis and wire-aware technology mapping led to on sequential synthesis and wire-aware technology mapping led to a 27.6% reduction in switching activity, while the techniques of this a 27.6% reduction in switching activity, while the techniques of this paper reduce it additionally by 19.6% without a substantial increase paper reduce it additionally by 19.6% without a substantial increase in runtime or degradation of other metrics.in runtime or degradation of other metrics.