Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning

Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.

Work supported in part by MARCO GSRC

Outline

Motivation• Performance driven bipartition problem• New bipartitioning algorithm• Experimental results• Conclusion and future work

Partitioning and Performance

The hypergraph partitioning problem is to divide the nodes of a hypergraph into roughly equal parts; the traditional objective is to minimize cutsize.

In performance-driven partitioning, we also seek to minimize path delay on timing paths.

– Reduces delay by 16% while increasing cutsize by 17%

– Requires substantial gate replication

Previous Work (I)• [Cong et al. ISPD-2002]

– Global clustering based algorithm with retiming

Min-delay Clusteringw/ retiming

De-clusteringand refinement

Min-cutsizeClustering

– 14% reduction of delay with 10% increase in cutsize

– 139% increase in runtime compared with hMetis

Previous Work (II)

• [Ababei et al. ICCAD-2002]– Reweighting based method

Global timing analysis Find critical paths

Reweighting Input

1

11

1

1 2

Path based

Net based

Cutsize oriented partitioner, suchas hMetis,MLPart

Motivating Questions Can we avoid global timing analysis?

– Global timing analysis is extremely time-consumingCan we improve path delay without significant

degrading of cutsize? – Need smooth tradeoff between delay and cutsize

Can we reduce implementation overheads?– Previous methods store thousands of critical paths and

continuously update them

Outline

• MotivationPerformance driven bipartition problem• New bipartitioning algorithm• Experimental results• Conclusion and future work

Delay ModelDelay = hop_delay + node_delay

Part 0 Part 1FF nodes

Combinational nodes

hop

cut

[Cong et al. ISPD-2002]hop_delay=5 node_delay=1 Delay = 3x5 + 5x1 = 20

[Ababei et al. ICCAD-2002]hop_delay=Elmore delay node_delay=constant

Performance Driven Bipartition Problem

Given: • Hypergraph H=(V,E)• Area Balance tolerance s (0<s<1), a parameter

to control allowable slack in the area constraint• , a given parameter which captures tradeoff

between cutsize and path delay (hopcount)Find: A bipartition (V0|V1) which satisfies: and minimizes (cutsize)+(1-)

(Max_hopcount)

Outline

• Motivation• Performance driven bipartition problem New bipartitioning algorithm• Experimental results• Conclusion and future work

Unidirectional Partition Path delay is minimized with

hopcount = 1 if the partition is unidirectional (“acyclic”), that is, all cuts are in the same direction

Problem:• High cutsize• No unidirectional solution

Can we achieve “locally unidirectional” partition?

Max hopcount=5 Max hopcount=3

Part 1Part 0Part 0

Part 0 Part 1 Part 0 Part 1

V-Shaped NodesV-shaped node If a combinational node v satisfies: there exist vj, vt in the other part and a path from vj to vt that includes only v

then v is a V-shaped node

vj

Part 1

Part 0 vt

v

V-Shaped Nodes in Critical Paths

Empirical observations from study of partitioning solutions:• there are V-shaped nodes in the partitioning solutions• every V-shaped node is included in many critical paths• every critical path contains several V-shaped nodes

For testcase 1:•Number of nets : 16377•Number of critical paths : 26772•On average, one critical path contains 27.6 nodes •On average, one critical path contains 3.4 V-nodes•On average, one V-node belongs to 233.7 critical paths

Key Idea: V-Shaped Nodes Elimination

PATH: abc hopcount=2

PATH: dbc hopcount=1

PATH: ebc hopcount=1

af

cb

edMove b

af

cb

ed

Move V-shaped node “b” to reduce path hopcount

Part 0

Part 1

Part 1

Part 0

PATH: abc hopcount=0

PATH: dbc hopcount=1

PATH: ebc hopcount=1

Distance-k V-Shaped Nodes Elimination

a d

b Move b,c

k = 2: Move V2 node “b, c” reduce path hopcount from 2 to 0

Part 0

Part 1 c

a d

b

Part 0

Part 1

c

Problems with large k:Cutsize may be greatly increasedDelay of one path reduced while other paths delay increased

New Gain Function

v

Before MoveAfter Move

v

g(v): traditional FM gainrj(v): reduction of Vj nodes after moving v

Gain(v)=δ(0)+ δ(1)

Distance-k Unidirectional Algorithm

Calculate initial gains for all nodes and store the gainsSelect the node v with maximum gain

/* CLIP-like method: move the cluster that v belongs to */Reset the gains of all nodes to zeroMove v and update the gains of v and its neighborsWhile ( one node not moved) Select one node v with the maximum updated gain

Move v and update the related gains Find the point in the move sequence at which the sum of

gains is maximum; undo all moves after this point

Outline

• Motivation• New bipartitioning algorithm Experimental results• Conclusion and future work

Experimental Setup

• Four industry testcases obtained as LEF/DEF• Model of Ababei et al. (ICCAD-2002) used to

calculate delay • Partitioning solutions compared to results of

MLPart – strongest multilevel netlist partitioning code– website:

http://nexus6.cs.ucla.edu/GSRC/bookshelf/Slots/Partitioning/MLPart

• All tests on 600MHz Intel Pentium-III Xeon

Biasing against V1 Nodes vs. MLPart

TestcaseMLPart MLPart+V-shaped nodes

Removal

cutsize h delay time(s) cutsize h delay time(s)

1 820.7 5.3 352.8 11.79 856.1 3.3 266.8 12.58

2 169.9 3.5 220.7 13.45 189.8 2.5 211.2 15.32

3 141.3 3 291.6 16.67 152.3 2.3 283.6 18.27

4 408.7 5.3 302.6 12.43 421.2 3.6 252.7 14.03

• Reduction of delay: 4.5%-24.4% average:15.1%• Increase of cutsize: 3.0%-10.0% average: 4.9%• Increase of runtime: 6.3%-11.4% average: 9.7% Using the delay model in Cong et al. ISPD -2002• Reduction of delay: 4.3%-21.2% average:14.7%

δ(0)=1, δ(1)=10

Biasing against V2 Nodes vs. MLPart

TestcaseMLPart MLPart+Vk=2 nodes Removal

cutsize h delay time(s) cutsize h delay time(s)

1 820.7 5.3 352.8 11.79 847.5 3 262.1 13.16

2 169.9 3.5 220.7 13.45 183.2 2 202.5 15.67

3 141.3 3 291.6 16.67 149.2 2 275.6 18.92

4 408.7 5.3 302.6 12.43 416.7 3.4 243.5 14.79

δ(0)=1, δ(1)=30, δ(2)=3

• Reduction of delay: 8.9%-30.0% average: 18.7%• Increase of cutsize: 3.1%-7.2% average: 3.5%• Increase of runtime: 11.9%-15.9% average: 13.1%Using the delay model in Cong et al. ISPD -2002• Reduction of delay: 8.3%-28.7% average: 17.3%

Outline

• Motivation• Performance driven bipartition problem• New bipartitioning algorithm• Experimental results Conclusions and future work

Conclusions• Simple yet efficient timing-driven partitioning

that does not require global timing analysis • Negligible implementation, runtime overhead• Significantly reduces path delay with cutsize

and runtime almost same as leading-edge MLPart

• Similar improvements observed with different path delay metrics

• Futures– Impact of new partitioner on placement– Efficient methods for biasing δ(k) k>2

Thank you!

Future Work• Impact of new partitioner on placement• Efficient methods for biasing δ(k) k>2

Why Performance Driven Partitioning?• Achieving timing closure becomes increasingly

difficult in deep-submicron technologies due to non-ideal scaling of interconnect delay

• Routing alone can no longer solve timing problem, even with aggressive optimizations (buffer insertion, buffer/wire sizing,…)

Timing needs to be addressed at all design stages• Partitioning is a critical step in defining

interconnect timing properties, but is traditionally driven by cutsize objective

Previous Work (I)• With Logic Replication

– Retiming – Replication graph

• Without Logic Replication– Net based reweighting– Path based reweighting

FM Partitioning and Gain Function

v

Before Move

v

After Move

Gain(v) = Reduction of cutsize after moving v

Gain(v)=-1

Move the node with the max gain and lock it

Start with random partition

Keep moving until all nodes are locked

Find the best point in the move sequence

Part 0

Part 1

Part 0

Part 1

Part 0

Part 1Part 0

Part 1

Procedure to Calculate rj(v)

Delete all FF nodes and their related edgesIn the remaining graph, BFS from vFor each level j from 1 to k If v is a Vj node before moving, rj’=1 If v is a Vj node after moving, rj’’=1 rj=rj’’-rj’

CLIP Algorithm

vCLIP

v

Reminiscent of CLIP (Deng et al. DAC 1996) in how it induces movement of clusters across the cutline.

Distance-k V-Shaped Nodes

Distance-k V-shaped nodes (Vk-node): If k combinational nodes vi,1 … vi,k satisfy: vi,1 … vi,k are in the same part vj, vt in the other part a path from vj to vt and only passes vi,1 … vi,k

then vi,1 … vi,k are distance-k V-shaped nodes

vj

Part 1

Part 0 vt

vi,1 vi,k

Notation

• H(V,E)= circuit hypergraph• V = set of nodes representing components of the

circuit• E = set of signal nets• A bipartition (V0|V1) of H(V,E) divides V into two

disjoint subsets s.t. V= V0V1, which are called Part 0 and Part 1

• A = the total area of all the nodes in V• A0 = the area of all the nodes in V0

Documents

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning