IEEE T rans. on Circuits and Systems-I, v ol. 41, no. 5 ...€¦ · small-cell Application Sp eci c In tegrated Circuits (ASICs) [8], in whic h they assumed unlimited freedom in placing

IEEE Trans. on Circuits and Systems-I, vol. 41, no. 5, pp. 395{404, May 1994Clock Distribution in General VLSI CircuitsParameswaran Ramanathan� Anthony J. Duponty Kang G. Shinz� Department of Electrical and Computer EngineeringUniversity of WisconsinMadison, Wisconsin [email protected], (608) 263-0557y AT&T Bell Laboratories2000 N. Naperville RoadNaperville, IL [email protected], (708) 979-1057z Department of Electrical Engineering and Computer ScienceThe University of MichiganAnn Arbor, Michigan [email protected], (313) 763-0391ABSTRACTMinimization of clock skew in VLSI circuits to within a tolerable range is important fordependable operation of any digital system. Moreover, excessive delay through a clockdistribution network can signi�cantly degrade the performance of the digital system. Dif-ferences in path lengths and active elements of a clock distribution network are largelyresponsible for clock skew while excessive delay is a result of very long signal routes in thenetwork. Improved integrated circuit processes are placing an increasing demand on currentclock routing schemes through higher clock rates and larger die sizes.This paper proposes a clock routing scheme that primarily minimizes clock skew in ageneral VLSI circuit whose functional elements may be of various sizes and placements. Asecondary objective is to reduce the overall network delay. The clock distribution networkis generated based on the analysis of RC trees. In the networks so generated, the delay seenfrom the clock entry point of a circuit to all modules within the circuit is nearly identical.In constructing the clock distribution networks, the fanout of a bu�er is accounted for and exibility in placement of bu�ers is utilized.Index Terms | clock skew, VLSI routing, RC trees, synchronization, networks.The work reported here is supported in part by National Science Foundation under Grant MIP-9009154and NASA under grants NAG{1{296 and NAG{1{492. Any opinions, �ndings, and conclusions or recom-mendations expressed in this paper are those of the authors and do not necessarily re ect the views of theNSF or NASA. 1

1 IntroductionThe clock signal in a digital circuit is responsible for synchronizing the transfer of databetween processing elements and de�nes the precise instants when the circuit is allowed tochange state. Since this signal is needed almost universally across the circuit, its character-istics determine the performance of the circuit it controls. It is therefore crucial that thedistribution of the clock signal to all processing elements be free of any timing uncertainties.Ideally, the clock signal should appear to all processing elements at the same instant.However, since the clock signal must be routed via a distribution network that includesclock distribution logic (bu�ers) and interconnects, it may arrive at the inputs of the pro-cessing elements at di�erent times. This di�erence in arrival times is de�ned as the clockskew, which, if greater than a certain threshold, may cause incorrect operation of the cir-cuit. Moreover, the distance through the distribution network that a clock signal musttravel determines the propagation delay of the signal. An excessively long signal path willintroduce large delays that will signi�cantly degrade the performance of the circuit.The problem of eliminating or minimizing clock skews and/or large clock delays hasbeen an ever increasing challenge due to rapid improvements in VLSI technology. Higherclock speeds due to shrinking circuit geometries have reduced the permissible delay andskew in delivering a clock signal. Meanwhile, larger die sizes have increased the minimumclock delays due to larger clock distribution networks, thus making the problem even moredi�cult.The solutions that have been proposed in literature can be classi�ed into two di�erentcategories. The �rst category of solutions deal with the problem by equalizing the length ofclock signal paths to all the processing elements [8,9,14]. In doing so, they do not accountfor clock skews that arise due to variations in process parameters. On the other hand,the second category of solutions deal mainly with skews due to process variations [7,12].The approach in this category of solutions is to tune FET parameters in clock bu�ers toeliminate or reduce process-dependent skews. Since, in practice, skews occur due to bothdi�erence in length of clock signal paths and due to process variations, it is necessary touse solutions from both categories to alleviate the clock skew problem.The solution proposed in this paper belongs to the �rst category. The main di�erence2

between the proposed solution and other related work is that the proposed solution isapplicable to general VLSI circuits in which the processing elements are of di�erent sizesand placed in any arbitrary form. For instance, in [14], Wann and Franklin proposed ascheme that is applicable only if all the processing elements are identical in size and placedin the form of a symmetric array. Since, in a general VLSI circuit, processing elements areusually of di�erent sizes and placed based on considerations such as area, the scheme in [14]cannot be easily extended to general VLSI circuits. Recently, Jackson et al. described ascheme for clock routing in small-cell Application Speci�c Integrated Circuits (ASICs) [8],in which they assumed unlimited freedom in placing the clock routes. Since such freedommay not exist when processing elements are large macro-structures, the scheme in [8] cannotalso be applied to general VLSI circuits.The scheme proposed in this paper is an extension of the solution in [9]. The primaryobjective of the scheme is to minimize the clock skew subject to a secondary objective ofminimizing the clock delay. Unlike [9], the scheme proposed here uses a more accurateestimate of the clock delay and skew based on the tree structure of the clock distributionnetwork [6,11]. Furthermore, unlike [9], the proposed scheme also determines the locationof the clock bu�ers based on delays, skews, and fanout restrictions.The basic idea of the proposed scheme is to recursively construct a delay{balanced clockdistribution network starting from the clock inputs of the processing elements. It is basedon the assumption that the relative placement of all the processing elements has alreadybeen determined by some placement algorithm. At all stages of the construction of thenetwork, the interconnecting paths are selected in such a way that the delay (rather thanlength) to all processing elements is the same. Besides, the construction of the network issuch that it can be easily parallelized.More recently, a similar approach has been proposed in [1,13]. The main di�erencebetween our approach and the approaches in [1,13] lies in the algorithms used to selectthe two sub-trees to combine when building the clock distribution tree in a bottom-upfashion. Since the selection algorithms are mainly heuristics, further studies are requiredto determine which algorithm is better. Such studies are beyond the scope of this paper.In this paper, several random circuit layouts and one actual integrated circuit were usedas examples to measure the e�ectiveness of the proposed scheme. As shown in Section 6,3

the runtimes for generating the clock distribution network in each one of these examples isfairly small. Further, the resulting skews are also a small fraction of the delay seen in thenetwork.This paper is organized as follows. In Section 2 the clock distribution problem is de�nedand the notations used are introduced. This is followed by a description of the proposedalgorithm in Section 3. Correctness of the algorithm is proved in Section 4 and somepractical considerations follow in Section 5. The results for several example layouts aregiven in Section 6. Finally, the paper concludes with Section 7.2 Problem FormulationAs mentioned earlier, the primary objective of the proposed scheme is to minimize theclock skew subject to a secondary objective of decreasing the delay caused by the clockrouting. In this section, we formulate this problem and introduce the notations used in thispaper. A brief review of the RC tree delay model from [6,11] is also presented since it formsthe basis of the proposed scheme.Consider a typical clock distribution tree shown in Fig. 1. At various points in sucha tree, the clock signal splits into two or more branches in order to distribute the signaldirectly to the processing elements or to extend the signal across the circuit where it maybe split further into more branches. The proposed scheme improves upon the general treestructure by forcing all branches to be balanced with respect to delay, creating a routing treein which the delay seen from the entry point of the clock signal to all processing elementsis nearly identical.In generating the clock distribution tree, the proposed scheme overcomes the main de�-ciency of the scheme in [9]. The primary consideration in [9] is the length of the signal routefrom the entry point to any single processing element. The skew between all processing ele-ments is assumed to be minimum if all such lengths are the same. However, this assumptionis not valid because the delay of a particular signal path is a�ected by the capacitance ofother paths in the distribution tree. For example, according to [9], the delays from a tod and g in Fig. 1 are the same because the lengths of the respective signal paths are thesame. However, the branch which provides the clock signals from b to e and f contribute acapacitance which would actually increase the delay from a to g. As a result, there will be4

a

b

c

d e fg

12

5

4

3

7

3

Numbers Indicate Length of Represented PathFigure 1: Clock distribution example.a skew between the clock signals at d and g. This problem becomes severer as the size ofthe clock distribution tree increases, and is hence addressed here by using a more accurateRC tree delay model [6,11].2.1 Layout RepresentationThe oorplan of a given circuit, such as the one in Fig. 2, is de�ned to consist ofmodules and channels. A module is a single processing element to which a clock signal mustbe routed. A circuit may be composed of any number of modules. A channel is the spacebetween the modules through which signals can be routed. It is through these channels thatthe clock is to be distributed in the circuit. A channel may be either horizontal or verticaldepending on its orientation within the given circuit oorplan. The perimeter of the circuit oorplan can be viewed as a set of two vertical and two horizontal channels which framethe oorplan.The oorplan of a given circuit is represented by an undirected weighted graph referred toas the oorplan graph [2,3,9]. The edges of the oorplan graph represent the channels in the5

M

M

M

M

M

M

M

M

1

2

3

4

5

6

7

8

X

X

X

X

X

X

X

X

X Denotes Clock Signal InputFigure 2: Example circuit oorplan.circuit oorplan, while the vertices of the oorplan graph represent either the intersectionpoints of the oorplan channels or the clock signal inputs to the individual modules. Twovertices, s and t, are connected by an edge in the oorplan graph if and only if there is achannel in the oorplan from s to t which does not contain any other vertex of the oorplangraph.The weight of an edge between vertices s and t in the oorplan graph is equal to thephysical length of the channel between the two points corresponding to the two vertices.The oorplan graph corresponding to the oorplan in Fig. 2 is shown in Fig. 3. The inputsto the modules are represented by � while the other vertices are represented by �. Verticesare numbered and the weights of all the edges are also shown.De�nition 1: In a oorplan graph, G, a path from vertex s to vertex t is a �nite sequenceof adjacent vertices v0v1 � � �vr, r � 1, such that: (i) v0 = s and vr = t are the endpoints ofthe path, (ii) 8 j 2 f0; : : : ; r � 1g, vj and vj+1 lie on a common edge. The length of thepath p, denoted by L(p), is equal to the sum of the weights of the edges between adjacentvertices.De�nition 2: In a oorplan graph G, a c{net N is a quadruple <p1; p2; L; R>, where p1and p2 are two distinct paths in G such that (i) p1 and p2 share a common endpoint referred6

X

X

X

X

X

X

X

X

1

2

3

4

5

6

7

8

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

2

2

2

2

22

2

1212

12 12

12 127

5

10

10

125

5

7

8

8

9

9

17

9 8

10

10

27

15

42

1818

18

108

12

18

2

40

34Figure 3: Floorplan graph for the example in Fig. 2.to as the head node of the c{net, denoted by Head(N), (ii) L and R are either c{nets or theclock inputs of modules, and (iii) if L (R) is a c{net, then p1 (p2) is a path from Head(N)to Head(L) (Head(R)). Otherwise, p1 (p2) is a path from Head(N) to L (R).De�nition 3: The leaf nodes of a c{net N = <p1; p2; L; R> is de�ned to be the union ofthe leaves of L and R, where if any of L or R is a clock input to a module, it is taken to bea leaf node. Let the set of leaf nodes of a c{net N be denoted by Leaves(N). The numberof leaf nodes in Leaves(N) is referred to as the net{fanout of N , denoted by N-fan(N).Fig. 4 illustrates a c{net, N = <p1; p2; L; R>, which includes two c{nets, L and R.Head(N) directly corresponds to vertex a, since both paths, p1 and p2 have a as theirshared endpoint. Vertices b and c are the head nodes of c{nets L and R, respectively.While the leaf nodes of c{net L are only the vertices d and e and those of c{net R are fand g, the leaf nodes of c{net N include those of both L and R: d, e, f and g. The inherenttree structure of the c{net is also illustrated in Fig. 4.7

X

X

X

X

a

b

c

d

f

e

g

N

a

b c

d e f g

1 2

3 4 5 6

p p

p p p p

(a) Physical C-net Structure (b) Conceptual Tree Structure of a C-net

RL

Figure 4: Example C-net within a oorplan graph.2.2 RC Tree DelaysThe paths in a c{net represent the physical conduits of the clock distribution networkthrough which the clock signal can propagate from the clock entry point of the circuit to thevarious modules. It is the built{in line resistance and capacitance of these physical conduitsthat determine the propagation delay through the network. Each path in the c{net canbe represented as a series resistance, Ri, and a shunt capacitance, Ci, which representsthe distributed resistance and capacitance used to model transmission lines. The circuitcon�guration of the c{net shown in Fig. 4 is illustrated in Fig. 5.Values for both the resistance and capacitance of each path are determined from thelength of the path. Given a path pi with length L(pi), and values for resistance per-unit-length r, and capacitance per-unit-length c, which are determined from the physicaldimensions and properties of the conduit, we can determine Ri and Ci as:Ri = rL(pi) and Ci = cL(pi):The capacitance Cg in Fig. 5 represents the gate capacitance of an input bu�er whichis assumed to exist within each module to which the clock is routed. The values for Cgmay vary from module to module in a realistic circuit, but for simplicity a uniform valueis assumed. With these values, the delay from the head node of a c{net to any leaf nodecan be found by applying the algorithm for RC tree delays [6,11]. As shown in [6], for a8

R

C1

2

3

5

6

R

R

R

R

R

C

C

C

C

C

C

C

C

C

1

2

3 g

g

4

4

5 g

g

6

a

b

c

d

e

f

gFigure 5: Equivalent RC circuit for the C{net of Fig. 4.unit step function, the time response at node i within a tree of RC interconnects can beapproximated by vi(t) = 1� e �tTDiwhere TDi is an estimate of the signal delay seen from the head node of a tree structure,such as a c{net, to a leaf node i. TDi is commonly referred to as the Elmore delay [5] andits value is given by: TDi =Xk RikCk (2:1)where Rik is the sum of the resistances of those paths in the c{net which are shared by theroutes from the head node to vertices i and k, and Ck is the shunt capacitance seen at thevertex k plus any gate capacitance, Cg, if k is a leaf node.By applying Eq. (2.1) to any c{net, a unique delay can be found from the head nodeof the c{net to each leaf node. If N is a c{net then the delay from Head(N) to a leaf nodex 2 Leaves(N) is denoted by D[N;x]. 9

For example, by applying Eq. (2.1) to the c{net N in Fig. 5 the delay from the headnode, Head(N) = a, to the leaf node g is given by:D[N; g] = R2 (C2 + C5 + Cg) + (R2 + R6) (C6 + Cg) :If the gate capacitance, Cg, is taken to be zero, and the lengths of all paths pi in c{net Nare represented by the term li = L(pi), then the delay expressed in terms of path length is:D[N;g] = rc (l2 (l2 + l5) + (l2 + l6) l6) :where r is the resistance per-unit-length and c is the capacitance per-unit-length.Each c{net N has a unique delay from its head node to each leaf node. The average ofthese delays is referred to as the net{delay of the c{net, denoted by D[N ]:D[N ] = 1N-fan(N) Xx2Leaves(N)D[N;x]: (2:2)In addition, the capacitance seen in all the paths of a c{net N can be represented bya single lumped capacitance at the head node of N . Such a capacitance, referred to asthe net{capacitance C[N ], is de�ned to be the sum of the lengths of all paths within Nmultiplied by c, the capacitance per-unit-length.3 Proposed AlgorithmThe solution proposed in this paper is based on the RC tree analysis as described inSection 2. A clock distribution network is constructed from the bottom up starting with theclock inputs at the individual modules and ending with a single node as entry to the network.The clock routes are determined with respect to the placement of modules and channelsand the layout constraints involved. Additionally, the fanout of clock bu�ers are consideredand exibility in placement of bu�ers is utilized to allow the maximum possibility for anoptimal routing scheme. In previous work, the clock distribution paths were determinedeither to �t pre-placed bu�ers, or with no consideration for bu�er placement during routing.10

A routing solution is obtained by repetitively selecting pairs of existing c{nets andcombining them via an interconnecting path to form a larger c{net. Initially, the layoutconsists of a set of modules which, for ease of presentation, are considered individual c{nets,with net{delays of zero and net{capacitance taken to be the value of the input bu�er gatecapacitance Cg. Associated with each c{net is a level which indicates the \height" of itshead node with respect to its leaf nodes. The initial c{nets (modules) are at level 0, ac{net constructed from two level{0 c{nets is at level 1, a c{net constructed from two level{1c{nets is at level 2 and so on. A new c{net can only be constructed from two c{nets havingthe same level number, i.e., all level{0 c{nets must be combined into level{1 c{nets beforelevel{1 c{nets are allowed to be combined into larger c{nets. As the level number of thec{nets increases, the number of c{nets to be combined decreases. The number of c{nets tobe combined at a level i + 1 is approximately one half of that at level i. By continuouslycombining c{nets, a �nal single c{net is obtained which represents the �nal clock routingsolution.The general algorithm, called Croute, for routing a set of modules into a clock distribu-tion network is presented in Fig. 6. Note that for simplicity, bu�er placement and fanout isnot considered in the algorithm but will be discussed later. Croute, as presented, assumesin�nite fanout for the clock bu�er driving the entire clock network.In what follows, we give a detailed account of each step in Fig. 6.Step 1: The set of c{nets to be combined at level i is denoted by i.Step 2: Initialize `Done' ag to FALSE to indicate incomplete network.Step 3: Initially, all modules are considered to be level{0 c{nets with net{delays of zeroand net{capacitances corresponding to the input gate capacitance, Cg, of themodule.Step 5: While there are still level{i c{nets which need to be combined to level{(i + 1)c{nets, execute Steps 6 through 8.Step 6: This step is self-explanatory. A c{net A is selected and removed from the set ofc{nets i.Step 7: A second c{net, if any, is selected and removed from the set of c{nets, i. Ifthe number of c{nets initially at level i is odd, then there will be a `lone' c{net,11

1. Let i be the set of c{nets at level i.2. Done := FALSE;3. Let i := 0 and initialize 0 to module clock inputs.4. repeat5. while j ij > 0 do6. Select a c{net, A, from i;7. Find a \suitable" c{net, B, from i;8. Combine A and B into new c{net C and place C in i+1;9. end;10. if j i+1j = 1 then11. Done := TRUE;12. else13. i := i+ 1;14. until Done = TRUE; Figure 6: Algorithm Croute.12

selected in Step 6, for which no second c{net exists with which to combine intoa new c{net. Such a c{net will simply be included in the set of c{nets at leveli+ 1, i+1, and become a candidate for combination during the next iteration ofthe loop. However, if a second c{net B does exist, then B must satisfy certaincriteria in order to be paired with A. In order to minimize the overall delaywithin the network, it is necessary to minimize the length of the interconnectneeded between the two selected c{nets. Therefore, it is desirable for A and B tobe physically close, and the criteria may require B to be the closest c{net to A.Additional conditions may be imposed depending on the desired resulting clockdistribution network.Step 8: This is the heart of the algorithm. The primary objective when creating a newc{net, N = <p1; p2; A; B>, is to balance the di�erence in net{delays, D[A] andD[B], of the two original c{nets, A and B, by determining paths of some lengthsuch that the delay seen from Head(N) to all Leaves(A) and Leaves(B) is thesame. The process for creating such a c{net is outlined in the following steps.Step 8.1: Given two c{nets A and B, �nd a path p within the oorplan graph, havingHead(A) and Head(B) as endpoints and a minimum length given by:L(p) � 8>>>>>>>><>>>>>>>>: �rC[A] + r�rC[A]�2 + 4rc ��D[A] �D[B]��2rc if D[A] � D[B]�rC[B] + r�rC[B]�2 + 4rc ��D[A] �D[B]��2rc if D[A] > D[B].Step 8.2: Create a new c{net C by choosing a vertex contained in p to be Head(C),splitting p into two paths, p1 and p2, such that:L(p1) = D[B] � D[A] + rcL(p)2 + rL(p)C[B]2rcL(p) + rC[A] + rC[B] :Step 8.3: Given the new c{net, C = <p1; p2; A; B>, calculate values for the net{delay,D[C], and the net{capacitance, C[C]. Include C in the set i+1.Step 10: The set i+1 is examined to see if it has more than one c{net is included.13

Step 11: If a single c{net remains, then that is the �nal routing network so set `Done' agto TRUE.Step 13: Otherwise, repeat the loop with the next level i+ 1.Step 14: Continue with Step 5 if `Done' is not TRUE.It is important that the path selected in Step 8.1 be of su�cient length to completelycompensate for the di�erence in the net{delays of A and B. If p were to be of lengthless than the minimum required, the resulting paths p1 and p2 would not be long enoughand would be unable to contribute enough of an rc delay to equalize the delay seen fromHead(C) to Leaves(A) and Leaves(B). Occasionally, the di�erence between the two net{delays is signi�cant enough to require a path of substantial length. If the delay di�erenceis equivalent to that of a single bu�er, then a bu�er may be included with the path to aidin equalizing the delays. Note that a bu�er so inserted is done so without regard or bene�tof fanout issues and is simply employed for delay purposes.4 Proof of CorrectnessThe basic idea of algorithm Croute is to repeatedly combine two smaller c{nets intoa larger c{net. In this section, we prove that the sequence of steps in Croute is correct inthe sense that it results in a delay{balanced clock distribution tree.Fig. 7 shows a conceptual picture of the construction of a larger c{net C from two smallerc{nets A and B. The creation of this c{net depends primarily on selecting a path p in the oorplan graph which joins the two vertices, Head(A) and Head(B). A vertex contained inp is selected to become the head node of the new c{net, C, which e�ectively splits p intotwo paths, each having the selected vertex as one of their endpoints. For example, in Fig. 7,vertex h is the selected head node which creates the new paths, p1 and p2. The sum of thelengths of the individual paths, p1 and p2, exactly equals the length of the original path p,i.e., L(p) = L(p1) + L(p2): (4:1)Having split the path, p, into paths p1 and p2, the delays seen from Head(C) to the leafnodes of both c{nets A and B need to be determined. Since, except for the vertex h, path14

h

i j

p1 2

A B

C

p

Head Node

C-net C-netFigure 7: Construction of c{net C from c{nets A and B.p1 is disjoint from path p2, there is no common shared resistance between the two. Thusthe delay seen from h through path p1 to the leaf nodes of c{net A is di�erent than thedelay seen from h, through path p2 to the leaf nodes of B.Lemma 1: If N = <p1; p2; A; B> is a c{net, then the delay from Head(N) to x 2Leaves(N) is:D[N;x] = 8><>: D[A; x] + rcL(p1)2 + rL(p1)C[A] if x 2 Leaves(A)D[B;x] + rcL(p2)2 + rL(p2)C[B] if x 2 Leaves(B).where D[A; x] and D[B;x] are the delays to the leaf node x from Head(A) and Head(B),respectively, and C[A] and C[B] are the sum of all capacitances in c{nets A and B, respec-tively.Proof: From Eq. (2.1), the delay from Head(C) to a leaf node x 2 Leaves(A) of a c{netC = <p1; p2; A; B> is D[C; x] =Xl2CRxlCl: (4:2)Path p1 contributes a shared resistance and shunt capacitance to all nodes in A. Fac-15

toring out the terms dependent on p1, Eq. (4.2) becomesD[C; x] = Xk2A ( rL(p1) + Rxk )Ck + rcL(p1)2where rL(p1) and cL(p1) are the resistance and capacitance of path p1, respectively.Substituting D[A; x] = Xk2ARikCk and C[A] = Xk2ACkinto the above expression leads to:D[C; x] = D[A; x] + rL(p1)C[A] + rcL(p1)2:The case where the leaf node x 2 Leaves(B) can be proved similarly.De�nition 4: A delay{balanced c{net N is de�ned to be one in which the delay from thehead node Head(N) to all vertices x 2 Leaves(N) is the same.To create a new delay{balanced c{net, C, from two delay{balanced c{nets A and B, thenew head node must be properly selected so that the path lengths, L(p1) and L(p2), cancompensate for the di�erences in the net{delays, D[A] and D[B]. This is accomplished bySteps 8.1 and 8.2 as shown in the following theorem.Theorem 1: If c{nets A and B are delay{balanced, then a c{net C created by Steps 8.1and 8.2 from the algorithm is also delay{balanced.Proof: C-net C = <p1; p2; A; B> is delay{balanced if and only iffor all x 2 Leaves(A) and for all y 2 Leaves(B): D[C; x] = D[C; y]: (4:3)Since A and B are delay{balanced, it follows from Lemma 1 thatfor all x 2 Leaves(A): D[C; x] = D[A] + rcL(p1)2 + rL(p1)C[A]and 16

for all y 2 Leaves(B): D[C; y] = D[B] + rcL(p2)2 + rL(p2)C[B]:Substituting the above expressions into Eq. (4.3) givesD[A] + rcL(p1)2 + rL(p1)C[A] = D[B] + rcL(p2)2 + rL(p2)C[B]: (4:4)Solving for L(p1) by substituting L(p)� L(p1) for L(p2) we getL(p1) = D[B] � D[A] + rcL(p)2 + rL(p)C[B]2rcL(p) + rC[A] + rC[B] : (4:5)Hence Step 8.2 of Algorithm Croute.Similarly, substituting L(p)� L(p2) for L(p1) in Eq. (4.4) and solving for L(p2) we getL(p2) = D[A] � D[B] + rcL(p)2 + rL(p)C[A]2rcL(p) + rC[A] + rC[B] : (4:6)Since L(p1) � 0 and L(p2) � 0, Eqs. 4.5 and 4.6 together imply thatL(p) � 8>>>>>>>><>>>>>>>>: �rC[A] + r�rC[A]�2 + 4rc ��D[A] � D[B]��2rc if D[A] � D[B]�rC[B] + r�rC[B]�2 + 4rc ��D[A] � D[B]��2rc if D[A] > D[B] ;Hence Step 8.1 of algorithm Croute.5 Practical ConsiderationsImplementation of the algorithm presented in Section 3 requires consideration of a num-ber of additional issues. The following subsections address some of these issues.5.1 Fanout and Placement of Clock Bu�ersThe maximum number of active elements that can be driven by a single bu�er is knownas the bu�er's fanout. The clock distribution network must incorporate bu�ers since the17

number of modules that will typically make up a circuit will exceed the fanout of a singlebu�er. At each level i in algorithm Croute, the set of all c{nets to be combined ispartitioned into small subsets of cardinality less than or equal to the fanout of a bu�er.Algorithm Croute is then applied to each independent subset to form a single c{net towhich a bu�er is assigned. Such a bu�er assignment scheme minimizes the total numberof bu�ers required in the clock distribution network. Furthermore, by dividing the c{nets into subsets, the problem can be parallelized by combining the c{nets in each subsetindependently.Placement of assigned bu�ers is handled in the following manner. The vertex that isthe head node of the c{net indicates the general location in the oorplan where the bu�eris to be placed. It is assumed that any bu�er is small relative to the size of the channel andcan be included as part of the routing without consuming excessive area. The new headnode of the now bu�ered c{net would then correspond to the input of this bu�er with thenet{delay increased to include the delay of the bu�er. The net{capacitance of the c{netwould be assigned a value equal to the input gate capacitance of the bu�er.5.2 Pairing of C-nets for CombinationIn Steps 6 and 7 of algorithm Croute, two c{nets are selected from the set of eligiblec{nets for combination into a new c{net. As mentioned in Step 7, the determination ofwhich two c{nets to select is important in order to minimize the overall delay of the clockdistribution network. This objective can be realized if the head nodes of the two selectedc{nets are required to be physically close to each other in the circuit oorplan.In the current implementation of the algorithm, this objective is accomplished by �rstordering the c{nets into a single list. The order of the list is based on the geometric distanceof c{nets from a reference vertex. At any given level i, the set of c{nets i is ordered asfollows. The farthest c{net from the reference vertex is placed �rst in the list. Its nearestneighbor is then selected and placed next in the list. The next farthest c{net and its nearestneighbor are then placed in the list as third and fourth. The process is repeated until allc{nets have been included in the ordered list. After all c{nets have been included, thec{nets are combined in pairs starting from the top of the list.The reference vertex may be any vertex in the oorplan graph. However, each vertex18

will produce a di�erent ordering that will result in a di�erent network having di�erent delayand skew values. In our algorithm, all vertices in the periphery of the oorplan are triedas reference vertices and the one producing a network with the least skew is selected as thesolution. As shown later in Section 6, in spite of this repeated use of the algorithm for eachreference vertex, the total runtime for determining the clock distribution tree is fairly small.5.3 Path and Vertex SelectionAlgorithm Croute as presented in Section 3 is based on the assumption that pathsof arbitrary lengths exist within a given oorplan graph. Moreover, in Step 8.2, whenselecting a point to become the head node of a new c{net, it is assumed that any pointin the interconnecting path may be used. Under these assumptions, perfectly balanced c{nets are produced and a resulting network would have virtually no skew. However, a typicalrealistic routing network would branch almost exclusively at channel intersections and avoidbranches within a channel. These channel intersections are de�ned by the vertices in the oorplan graph. This consideration constrains all path endpoints to correspond to de�nedvertices. In addition, selected head nodes must correspond to those vertices de�ning theinterconnecting path.These restrictions cause unbalanced c{nets to be created. Unbalanced c{nets resultwhen a vertex does not exist at the desired location of a new head node as determined inStep 8.2 of the algorithm. An ideal solution is obtained by selecting that vertex in the pathwhich falls closest to the desired position.Step 8.1 of algorithm Croute speci�es that a single path be chosen for the interconnectbetween the two c{net head nodes to be combined. However, as discussed above, the selectedpath will most likely not have a vertex at the position where the head node should be placed.Rather than accepting the closest vertex to that position in the single path, a number ofpaths may be examined which may result in a more optimal �t. The current implementationallows the �rst k shortest paths to be selected for some �xed k [10]. Each path is checkedto determine which provides a vertex closest to the optimal position if that path were usedas the interconnect. The path which provides the closest �t is selected for use. By thismethod, every c{net being constructed will be balanced as nearly as possible with respectto delay. 19

6 Numerical ResultsAlgorithm Croute was implemented in C on a Sun SPARCstation{1 with 16 Mbytesmemory running SunOS 4.1. A oorplan with module and channel locations identi�ed ispresented as input to the algorithm. The placement information is converted into a oorplangraph from which a clock distribution network is constructed. Eq. (2.1) is then applied tothe completed network to determine the delays from the entry point to all module clockinputs. Network path information as well as delay and skew data are output from theprogram.In this section, the values of resistance per-unit-length r, capacitance per-unit-lengthc, and input gate capacitance Cg, were derived from a 1.2�m (� = 0:6�m) CMOS processreported by MOSIS Parametric Test Results for a Hewlett Packard run dated Decem-ber 17, 1989. In particular the values used arer = 34:4� 10�3=�m; c = 0:065fF=�mCg = 7:5fFwhere the width of a conduit is taken to be 3�.The algorithm was applied to a number of examples varying in size and number ofmodules. The �rst example is the oorplan graph in Fig. 3. Using a value of four as thefanout of a single bu�er, a solution was generated for each vertex of the periphery of the oorplan graph. The solution giving the least skew is shown in Fig. 8. Delay and skew dataare not presented for this solution as the dimensions of the oorplan are so small that suchdata is meaningless.Since there are eight modules in the layout of Fig. 8 and the fanout of a single bu�eris four, a minimum of two bu�ers are required as part of the clock distribution network.The head nodes of the c{nets that the bu�ers are to drive indicate the general area in the oorplan where the bu�ers are to be located. The required bu�ers and their placement areshown in Fig. 8.The next set of examples consist of �ve randomly generated circuits layouts. A sixthexample is also provided which is derived from a real integrated circuit designed for anexperimental distributed system. It is the routing controller of the Hexagonal Architecture20

X

X

X

X

X

X

X

X

X Denotes Clock Signal Input

Entry Point

M 1

2

3

4

5

6

M

M

M

M

M

M

M

7

8

Figure 8: Routing solution for circuit layout of Fig. 2.21

Figure 9: HARTS Routing Controller.22

Dimensions Number of Maximum Maximum CPUName in �m of Modules Delay Skew Timeex1 515 � 700 18 6.1 ps .40 ps 2.51 sex2 600 � 600 25 8.8 ps .47 ps 4.17 sex3 900 � 600 32 15.1 ps .47 ps 8.10 sex4 900 � 1200 36 30.4 ps 1.1 ps 33.94 sex5 1000 � 1200 26 26.8 ps .58 ps 5.3 sRC1 4311 � 4381 38 412 ps 16.1 ps 21.9 s1. Routing ControllerTable 1: Results for in�nite bu�er fanout.for Real-Time Systems (HARTS) [4], which is being built at the Real-Time ComputingLaboratory, The University of Michigan, Ann Arbor. The routing controller is a customVLSI chip designed for use as an intelligent front-end interface to implement the low-levelmessage routing algorithms in HARTS.A oorplan of the modules in one implementation of the controller is shown in Fig. 9.The modules of the oorplan include such items as two 64�16 bit static RAMs, an 8�8 bitFIFO, a micro-sequencer, controlling PLAs, an ALU, and various registers and additionallogic. Dimensions of the circuit are 4311�m � 4381�m.These six layouts are of di�erent sizes and have di�erent numbers of modules. Resultswere produced for two cases. In the �rst case, bu�er fanout was neglected, while in thesecond case, the bu�er fanout was assumed to be eight.The results for the �rst case are given in Table 1. The runtimes shown indicate the timerequired to generate solutions for all reference points (peripheral vertices), which for theseexamples numbered from 19 to 30. Maximum delay seen in the network as well as the worstcase skew is given for the solution with the smallest worst case skew.Although the �rst �ve sample layouts used here are smaller than today's integratedcircuits, the skews shown are small relative to the delay seen in the network. As expected,the delay seen in the clock distribution network increases with the size of the oorplan.The �rst �ve examples utilized a single interconnecting path search as illustrated in23

Dimensions Number of Maximum Maximum CPUName in �m of Modules Delay Skew Timeex1 515 � 700 18 D + 1.08 ps .24 ps 2.18 sex2 600 � 600 25 D + 1.74 ps .28 ps 3.68 sex3 900 � 600 32 D + 3.18 ps .37 ps 8.23 sex4 900 � 1200 36 D + 4.6 ps .76 ps 7.71 sex5 1000 � 1200 26 D + 5.0 ps .47 ps 4.87 sRC1 4311 � 4381 38 D + 82.1 ps 8.1 ps 12.31 s1. Routing ControllerTable 2: Results for a bu�er fanout of eight.Step 8.1, rather than a multiple{path search as discussed in Section 5.3. However, a solutionfor the routing controller was generated utilizing a two{path search. The delay in thenetwork increased from the 412 ps for single path search to 457 ps, while the skew wasreduced from 16.2 ps to 8.1 ps. The runtime resulting from the two path search, however,increased from 21.9 s to 3,037 s.The results for the case when bu�er fanout is eight are given in Table 2. All six layoutsinclude a single bu�er delay from their clock entry points to their modules. The delay for agiven bu�er is taken to be a constant value given as D. Since D is at least equivalent, andmore likely, greater than the delays given in Table 2, the ratios of skew to the maximumdelay in clock distribution will be quite small. Also note that the skews in Table 2 aresmaller than the skews in Table 1.7 ConclusionThis paper presents a clock routing scheme that accommodates a variety of realisticlayout concerns. As compared to other work in this area, a more accurate delay model whichrepresents the physical tree structure of a distribution network is utilized. Bu�er fanoutand exibility in placement of those bu�ers is also taken into consideration. Additionally,the physical areas reserved for placement of signal routes in a circuit layout are observedand adhered to. 24

The proposed algorithm was implemented and solutions were generated for several ex-ample layouts. In all cases, the skews in the resulting clock distribution network were asmall fraction of the minimum clock period. Since the proposed scheme does not requireany particular placement or module size, it can be applied to any general VLSI circuit.References[1] T. Chao, Y.-C. Hsu, and J.-M. Ho, \Zero skew clock net routing," in Proc. DesignAutomation Conference, pp. 518{523, June 1992.[2] W. Dai and E. S. Kuh, \Simultaneous oor planning and global routing for hierarchicalbuilding-block layout," IEEE Transactions on Computer-Aided Design, vol. CAD-6,no. 5, pp. 828{837, September 1987.[3] W. Dai, M. Sato, and E. S. Kuh, \A dynamic and e�cient representation of buildingblock layout," in 24th ACM/IEEE Design Automation Conference, 1987.[4] J. W. Dolter, P. Ramanathan, and K. G. Shin, \A microprogrammable VLSI routingcontroller for HARTS," in Proceedings IEEE International conference on Computerdesign: VLSI in computers and processors, pp. 160{163, 1989.[5] W. C. Elmore, \The transient response of damped linear networks with particularregard to wide-band ampli�ers," Journal of Applied Physics, vol. 19, no. 1, pp. 55{63,January 1948.[6] E. G. Friedman and J. H. Mulligan, \Estimates of delay and skew in clock distributionnetworks," Submitted to IEEE Journal of Solid-State Circuits.[7] E. G. Friedman and S. Powell, \Design and analysis of a hierarchial clock distributionsystem for syncrhonous standard cell/macrocell VLSI," IEEE Journal of Solid-StateCircuits, vol. SC-21, no. 2, pp. 240{246, April 1986.[8] M. A. B. Jackson, A. Srinivasan, and E. S. Kuh, \Clock routing for high performanceICs," in 27th ACM/IEEE Design Automation Conference, 1990.[9] P. Ramanathan and K. G. Shin, \A clock distribution scheme for non-symmetric VLSIcircuits," in IEEE International Conference on Computer-Aided Design (ICCAD-89),pp. 398{401, 1989.[10] E. M. Reingold, J. Nievergelt, and N. Deo, Combinatorial Algorithms: Theory andpractice, chapter 8, Prentice-Hall, 1977.[11] J. Rubenstein, P. Pen�eld, and M. A. Horowitz, \Signal delay in RC tree networks,"IEEE Transactions on Computer-Aided Design, vol. CAD-2, no. 3, pp. 202{211, July1983.[12] M. Shoji, \Elimination of process-dependent clock skew in CMOS VLSI," IEEE Journalof Solid-State Circuits, vol. SC-21, no. 5, pp. 869{880, October 1986.25

[13] R. Tsay, \Exact zero skew," in Proc. ICCAD-91, pp. 336{339, November 1991.[14] D. F. Wann and M. A. Franklin, \Asynchronous and clocked control structures forVLSI based interconnect networks," IEEE Transactions on Computers, vol. C-32, no.5, pp. 284{293, March 1983.

26

Documents

IEEE T rans. on Circuits and Systems-I, v ol. 41, no. 5 ...€¦ · small-cell Application Sp eci c In tegrated Circuits (ASICs) [8], in whic h they assumed unlimited freedom in placing