37
page 1 Department of Electrical Engineering National Chung Cheng University, Chiayi, Taiwan Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging Shih-Chuan Lo Chih-Cheng Hsu Mark Po-Hung Lin

Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

  • Upload
    korene

  • View
    48

  • Download
    7

Embed Size (px)

DESCRIPTION

Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging. Shih-Chuan Lo Chih-Cheng Hsu Mark Po-Hung Lin. Outline. Introduction Preliminaries The Proposed Algorithms Experimental Results Conclusions. Outline. Introduction Low Power Design Methodologies - PowerPoint PPT Presentation

Citation preview

Page 1: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 1Department of Electrical EngineeringNational Chung Cheng University, Chiayi, Taiwan

Power Optimization for Clock Network with Clock Gate

Cloning and Flip-Flop Merging

Shih-Chuan Lo

Chih-Cheng Hsu

Mark Po-Hung Lin

Page 2: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 2

Outline

• Introduction• Preliminaries• The Proposed Algorithms• Experimental Results• Conclusions

Page 3: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 3

Outline

• IntroductionLow Power Design MethodologiesThe Concept of Clock-Gating CellThe Concept of Clock-Gate CloningThe Concept of Flip-Flop MergingPrevious WorkOur Contributions

• Preliminaries• The Proposed Algorithms• Experimental Results• Conclusions

Page 4: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 4

Low Power Design Methodologies

• Clock gating cell (CG) [Wu et al., TCAS'00], [Shen et al., TVLSI'10],

• Clock gate cloning [Teng & Soin, ICSE'10], [Vishweshwara et al., ISQED'12]

• Multi-bit flip-flop (MBFF) [Pokala et al., ASIC92], [Kretchmer, EE Times Asia'01],

[Chen et al., SNUG10], [Lin et al., TCAD'11],

[Wang et al.,TCAD'12], [Jiang et al., TCAD'12],

[Shyu et al., TVLSI13], [Tsai et al., ISPD13]

• …

Page 5: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 5

The Concept of Clock-Gating CellThe Concept of Clock-Gating Cell

• A clock-gating cell can turn off the clocks at flip-flop inputs when they are not required. In Fig.(a), the FFs will load new data at their input pins “D” only when

the enable signal “EN” is active. In Fig.(b), the CG can shut off “gclk” to the FFs when “Din” is not

changed.

Less clock network power and smaller chip area

Page 6: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 6

The Concept of Clock-Gate CloningThe Concept of Clock-Gate Cloning

• Clock buffer chain may result in: Longer delay Degrade the circuit performance Induce power consumption

• After replicate sufficient CGs and connect each CG to a smaller number of FFs The number of required clock

buffers can be reduced. Power consumption and path delay

of the gated clock network can be minimized.

Page 7: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 7

The Concept of Multi-bit flip-flop The Concept of Multi-bit flip-flop • Replacing 1-bit FFs with MBFFs

can reach up to 30% total clock power reduction. [Jiang et al., TCAD'12]

• An MBFF contains several 1-bit FFs which share common inverters in the MBFF cell. [Chen et al., SNUG'10]

• Replacing several 1-bit FFs with an MBFF will reduce Inverters in FF cells Clock sinks Clock drivers

Page 8: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 8

Previous Work of CG CloningPrevious Work of CG Cloning

• [Teng & Soin, ICSE'10] Introduced cutting-based algorithm to split a CG and redistribute the

CG fanout according to the cut line. The CG splitting algorithm is iteratively performed until the timing

violation of each CG’s enable signal is eliminated.

• [Vishweshwara et al., ISQED'12] Proposed a clustering-based algorithm to recursively replicate a CG

when the CG has a large number of fanout, or when the spreading area of its fanout is larger than a limit.

Page 9: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 9

Previous Work of FF MergingPrevious Work of FF Merging• [Kretchmer, EE Times Asia'01], [Chen et al., SNUG10]

Demonstrated the feasibility of applying MBFFs during logic synthesis.

• [Pokala et al., ASIC92] Applied MBFFs before placement optimization.

• [Tsai et al., ISPD13] Applied MBFFs during placement optimization.

• [Lin et al., TCAD'11], [Wang et al.,TCAD'12],

[Jiang et al., TCAD'12], [Shyu et al., TVLSI13] Perform power optimization with MBFFs at the post-placement stage for

better timing budgeting.

Page 10: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 10

Our ContributionsOur Contributions

• We present the first problem formulation For gated clock network optimization with simultaneous CG cloning

and FF merging.

• We introduce a novel optimization flow consisting of MBFF aware CG cloning CG-based FF merging MBFF and CG placement optimization

• We formulate the MBFF-aware CG cloning optimization problem as a partitioning problem. Our formulation is to maximize skew slack corresponding to different

CGs subject to bounded slack constraints.

• Our experimental results show that the proposed approach leads to better dynamic power and clock wirelength.

Page 11: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 11

Outline

• Introduction• Preliminaries

Power Model of Gated Clock Network Inter-CG Clock Skew due to CG CloningControl-Path Timing Constraint for Gated Clock NetworkData-Path Timing Constraint for FF MergingPlacement Density Constraint for CGs and MBFFsProblem Formulation

• The Proposed Algorithms• Experimental Results• Conclusions

Page 12: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 12

• The power dissipated in the gated clock network can be modelled as follows. [Shen et al., TVLSI'10]

Power Model of Gated Clock NetworkPower Model of Gated Clock Network

0c

C

l

dP

periodTddV

period

ddencgbufenenfbufgclkgclk

clkcgbufclkclkd T

VCClcCClcCClcP1

5.0 2000

dynamic power consumption

clock period

supply voltage

wirelength

input capacitance

unit wire capacitance

switching activity

clock net gated clock tree enable signal net

Page 13: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 13

• When a CG is replicated in the gated clock network, the inter-CG clock skew , can be calculated as follows.

• To minimize , we shall balance the wirelength and flip-flop fanout numbers among all different CGs.

Inter-CG Clock Skew due to CG CloningInter-CG Clock Skew due to CG Cloning

gclkj

clkj

gclki

clkiskew TTTTT

skewTclk

iTgclk

iT

inter-CG clock skew among gated FFs

interconnection delay from the clock root to gi

interconnection delay from gi to the farthest gated FF

CGT CG delay

ig the CGthi

skewT

skewT

Page 14: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 14

Control-Path Timing Constraint for Gated Control-Path Timing Constraint for Gated Clock NetworkClock Network

• The figure shows the control-path timing of the gated clock network.

CGELperiod

gclki

eni TTTTT

eniT interconnection delay from the clock root to gi

ELT CG delay

periodT

Page 15: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 15

Data-Path Timing Constraint for FF MergingData-Path Timing Constraint for FF Merging

• Only the FFs which have common intersection of their timing-feasible regions can be merged. [Lin et al., TCAD'11], [Wang et al.,TCAD'12], [Jiang et al., TCAD'12]

• The timing-feasible region of a flip-flop can be obtained from the available timing slack on the corresponding data paths.

Page 16: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 16

Placement Density Constraint for CGs and MBFFsPlacement Density Constraint for CGs and MBFFs

• We divide the chip area into a number of bins with equal size. [Lin et al., TCAD'11], [Wang et al.,TCAD'12], [Jiang et al., TCAD'12]

• A CG or an MBFF can only be placed in a bin whose density is less than the maximum placement density. To evenly distribute logic cells throughout the chip area, in order to

avoid routing congestion.

Page 17: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 17

Problem FormulationProblem Formulation

• Input A clock gating domain contains a set of FFs which are controlled by the

gated clock signals whose switching activities are the same. A cell library containing both CG and MBFF cells.

• Objectives Minimize Pd and Tskew of the clock-gating domain

(Pd is the primary objective, while Tskew is the secondary one because Tskew can be further minimized after clock tree routing.)

• Constraint Control-path timing constraint Data-path timing constraint Placement density constraint.

Page 18: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 18

Outline

• Introduction• Preliminaries• The Proposed Algorithms

The Proposed Algorithms FlowMBFF-aware CG CloningCG-based FF MergingMBFF and CG Placement Optimization

• Experimental Results• Conclusions

Page 19: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 19

The Proposed Algorithms Flow

Optimized placement containing newly generated CGs and MBFFs

Initial placement / Cell library / Design constraints

MBFF-aware CG Cloning

CG-based FF Merging

MBFF & CG Placement Opt.

Page 20: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 20

• The CG must be replicated and the fanout FFs are bisected when: Control path violates the timing constraint CG drives too many FFs leading to larger clock power consumption.

MBFF-aware CG CloningMBFF-aware CG Cloning

Page 21: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 21

• According to the timing-feasible region of each FF, we construct the hypergraph, H(V,E). vi: the timing-feasible region of the FF fi. ei: the intersection among the timing feasible regions of different fi. w(ei): the number of vertices connected by ei.

Hyper Graph ConstructionHyper Graph Construction

w(e1)=4 w(e2)=2

w(e3)=3

Page 22: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 22

Cut-line Determination with Cut-line Determination with Inter-CG Skew BudgetingInter-CG Skew Budgeting

• The cut direction is determined by the physical dimension of the FF bounding box. [Teng & Soin, ICSE'10] A vertical (horizontal) cut is applied if the dimension in x-direction is

larger (smaller) than that in y-direction.

• To balance the delay passing through different CGs, we sweep the cut line to search for the maximum skew slack . max

_ slackskewT

Page 23: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 23

Skew Slack (1/2)Skew Slack (1/2)• In Fig.(c) (Fig.(d)), the CGs are placed at the position closest to

(farthest from) the clock root within the respective FF bounding boxes, resulting in the shortest (longest ) clock signal delay from the clock root to the FFs.

maxgclk

iclk

i TT max

gclkj

clkj TT min

gclki

clki TT

min

gclkj

clkj TT

gclkclk TT

Page 24: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 24

Skew Slack (2/2)Skew Slack (2/2)

• The skew slack, can be calculated by the difference between the minimum longest and the maximum shortest clock signal delay.

• To more easily balance the delay passing through different CGs, we would like to find out a physical cut line which maximizes the skew slack.

minminmaxmax_ ,max,min gclk

jclkj

gclki

clki

gclkj

clkj

gclki

clkislackskew TTTTTTTTT

slackskewT _

maxgclk

iclk

i TT max

gclkj

clkj TT min

gclki

clki TT

min

gclkj

clkj TT

gclkclk TT

Page 25: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 25

• We perform the FM algorithm on H(V,E) to move FFs between different FF sets such that the cut size is minimized. Cut size: sum of edge weights on the cut line

• A balance condition that the skew slack after moving an FF to the other FF set must not less than . is a balance factor, .

MBFF-aware FF SwappingMBFF-aware FF Swapping

max_ slackskewT

10

Page 26: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 26

CG-based FF MergingCG-based FF Merging

• We merge 1-bit FFs into MBFFs starting from the four boundaries of the FF bounding box to the center area, based on INTEGRA [Jiang et al., TCAD'12] Spiral clustering technique [Chang et al., ISPD'12]

Page 27: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 27

MBFF and CG Placement OptimizationMBFF and CG Placement Optimization

• We perform MBFF and CG placement optimization to Minimize inter-CG clock skew Minimize wirelength Minimize required clock buffers Satisfying control/data-path timing constraints Satisfying placement density constraints

Page 28: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 28

MBFF PlacementMBFF Placement

• When placing the MBFFs controlled by the same CG, we search for the placement bins, which satisfy: Placement density constraint In the timing-feasible region corresponding to each MBFF The FF bounding box of the CG fanouts is minimized.

• The smaller FF bounding box can result in shorter gated clock signal wirelength, and hence smaller and .gclkT dP

Page 29: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 29

CG PlacementCG Placement• The CGs are initially placed inside their feasible positions

which satisfy the control-path timing constraings. The feasible region of a CG is roughly an ellipse whose the two foci are

at the positions of the enable logic and one of the CG fanout FFs.

• We perform an iterative optimization algorithm to: Move CGs around their feasible regions until inter-CG clock skew

cannot be further minimized. Add clock buffers to either clock path from the clock root to a CG for

delay balance. Insert buffers to either enable signal path from the enable logic to a CG

for a larger feasible region of the CG.

Page 30: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 30

Outline

• Introduction• Preliminaries• The Proposed Algorithms• Experiments

Experimental SetupsExperimental ComparisonsExperimental Results

• Conclusions

Page 31: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 31

Experimental Setups• Programming language

C++

• Platform 2.26GHz Intel Xeon machine under the Linux operating system

• We adopted the benchmark circuits in [Jiang et al., TCAD'12] Add other logical, physical and timing information for CGs, clock root, and EL. Referred to the Nangate 45nm Open Cell Library to set the input capacitance. Assumed that all FFs in each circuit are initially connected to the same CG. Chose the circuits containing less than 1,000 FFs with reasonable FF bounding boxes.

Page 32: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 32

• Reference Flow 1 & 2 CG cloning technique is based on the MBFF-aware CG cloning without

applying MBFF-aware FF swapping. FF merging technique is exactly the same as the CG-based FF merging.

Experimental Comparisons

Page 33: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 33

• Comparisons the numbers of MBFFs with different bit numbers (“# of FFs”) and CG numbers (“# of CGs”). When comparing with “Reference Flow 1” the proposed flow results

in much more MBFFs with similar clock gate numbers. When comparing with “Reference Flow 2” the proposed flow results

in much slightly more CGs and slightly fewer MBFFs.

Experimental Results (1/2)

Page 34: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 34

• Comparisons of the dynamic power consumption 15% less than that resulting from “Reference Flow 1”. 10% less than that resulting from “Reference Flow 2”.

• Comparisons of the clock net wirelength 22% less than that resulting from “Reference Flow 1”. 18% less than that resulting from “Reference Flow 2”.

• Comparisons of the signal net wirelength 2% less than that resulting from “Reference Flow 2”.

Experimental Results (2/2)

Page 35: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 35

Outline

• Introduction• Preliminaries• The Proposed Algorithms• Experimental Results• Conclusions

Page 36: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 36

Conclusions

• We have presented a new problem formulation for clock network optimization with both CGs and MBFFs.

• We have also introduced novel techniques to optimize gated clock network with CG cloning and FF merging simultaneously.

• The experimental results have shown that the proposed approach results in better dynamic power and clock wirelength compared with those which optimize gated clock network with CGs and MBFFs separately.

Page 37: Power Optimization for Clock Network with Clock Gate Cloning and Flip-Flop Merging

page 37

Thanks for Your AttentionThanks for Your Attention