6
CMOS differential logic family with charge-recycling for high-speed and self-timing low-power and VLSl 6.-S. Kong, J.-D. Im, Y : C . Kim, S.-J. Jang and Y.-H. Jun Abstract: The paper describes a differential CMOS logic family employing self-timing for speed enhancement and charge recycling for power reduction. The logic family is up to 49% faster than other types of dynamic circuits. A pseudo one-phasc clocking pipeline configuration implemented with the proposed logic family can boost clock frequency by eliminating latching stages between pipeline sections. A @-bit adder designed using the proposed logic family achieves 0.9711s latency with power dissipation comparahk to that of the conventional precharged differential logic family. 1 Introduction As portable systems are attaining widespread use, low- power and high-speed operations have become major design concerns [I]. For implementing such systems, proper seleclion of logic style is very important as the power and the speed are significantly affected by logic implementation [24]. This requirement has spawned many interesting circuit techniques. Charge-recycling differential logic (CRDL) [5] was proposed to improve power efficiency by reusing previously used charge. However, the circuit requires the design overhead of usingp-channel devices with higher threshold voltage for maximum power efficiency. Half-rail differential logic (HRDL) [6] overcomes the drawback and uses only devices with nominal threshold but at the expense of performance in terms of power and speed. This paper describes a differentid CMOS logic family called asynchronous sense differential logic (ASDL) [7]. The logic family utilises charge recycling to improve power efficiency hut with no shortcomings aforementioned. It also employs a high-speed sense amplifier activated in a self- timed manner to accelerate logic operations. Pairs of enable signals are used for this purpose, whose propagation delays are properly inatched to meet the timing requirements dictated by the sense circuit. The following Sections describe the key characteristics of ASDL technique. 2 Circuit description 2.1 Circuit structure and operation The circuit structure of ASDL is shown in Fig. 1. It consists of three components: the differential logic network, the enable inverters and the sense/equalise circuit. The differ- ential logic network implements the logic function by configuring nMOS transistors in a differential form. The <D IEE. 2w3 I€€ Plow~L+ miline no. 2w30?7i doi: IO. 1049/ipcds:20030?7 I Rip lim isceiwd 2nd April 2M1 and in revised foim 131h May 2002 B:S. Kong ih will? the School of Eleclronics Telecommunications and Compulcr En~neenng. Hmkuk A\,ialiun Uniweraity. 2W-I. Hwajun-dong. Ileokynng-gu. Goynng. Kyunggi-do, 412-791. Korea JLD. im. Y.~C. Kim. S.-J. Jimgand Y.-H. Jun :ire with the Siweung Electronics Co.. Lld.. San #24. Nongreo-ri. kiheuogiup, Yongin. Kyuna-do. 449-71 I, K0rw JEE Ploc-Cir~uk~ 0er;crs Smr., Vol. /SO, Nu. 1. Fehnuiry 2uDl 0 E enable inverters, 11 and 12, generate the complementary enable outputs. E,, and E,,,,, from the complementary enable inputs. E, and E;,,. to control the operation of the sense/ equalise circuit. A crosscoupled circuit consisting of M2 to M5 with equalisation transistor M1 constitutes the sense/ equalise circuit. When E, and E,,, have low and high logic values, respectively, the circuit is in the equalise phase. During this phase, the sense circuit is disabled, and the output nodes are connected to each other by the equalise transistor_ producing the output voltages in between the supplies. When the logic states of E, and Ejh are changed to high and low logic values, the circuit enters the evaluate phase. During this phase, the output nodes are separated from each other. and the differential logic network is enabled, pulling down one of the output nodes. After the enable inverter delay, the enable outputs, E, and Eohr change their values, and the sense circuit is activated, accelerating respective output transitions. Multiple ASDL gates can be cascaded to form a function block with multiple operating stages as shown in Fig. 2u. The enable outputs at each stage are connected to the enable inputs of the next stage. With this configuration, the operating phases of each operating stage change progressively as the enable signals propagate along the stages, as indicated by the timing diagram in Fig. 2b. As compared to the differential cascode voltage switch (DCVS) logic [SI_ the proposed logic family improves power 45

CMOS differential logic family with self-timing and charge-recycling for high-speed and low-power VLSI

  • Upload
    y-h

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CMOS differential logic family with self-timing and charge-recycling for high-speed and low-power VLSI

CMOS differential logic family with charge-recycling for high-speed and

self -timing low-power

and VLSl

6.-S. Kong, J.-D. Im, Y:C. Kim, S.-J. Jang and Y.-H. Jun

Abstract: The paper describes a differential CMOS logic family employing self-timing for speed enhancement and charge recycling for power reduction. The logic family is up to 49% faster than other types of dynamic circuits. A pseudo one-phasc clocking pipeline configuration implemented with the proposed logic family can boost clock frequency by eliminating latching stages between pipeline sections. A @-bit adder designed using the proposed logic family achieves 0.9711s latency with power dissipation comparahk to that of the conventional precharged differential logic family.

1 Introduction

As portable systems are attaining widespread use, low- power and high-speed operations have become major design concerns [I]. For implementing such systems, proper seleclion of logic style is very important as the power and the speed are significantly affected by logic implementation [24]. This requirement has spawned many interesting circuit techniques. Charge-recycling differential logic (CRDL) [5] was proposed to improve power efficiency by reusing previously used charge. However, the circuit requires the design overhead of usingp-channel devices with higher threshold voltage for maximum power efficiency. Half-rail differential logic (HRDL) [6] overcomes the drawback and uses only devices with nominal threshold but at the expense of performance in terms of power and speed.

This paper describes a differentid CMOS logic family called asynchronous sense differential logic (ASDL) [7]. The logic family utilises charge recycling to improve power efficiency hut with no shortcomings aforementioned. It also employs a high-speed sense amplifier activated in a self- timed manner to accelerate logic operations. Pairs of enable signals are used for this purpose, whose propagation delays are properly inatched to meet the timing requirements dictated by the sense circuit. The following Sections describe the key characteristics of ASDL technique.

2 Circuit description

2.1 Circuit structure and operation The circuit structure of ASDL is shown in Fig. 1. It consists of three components: the differential logic network, the enable inverters and the sense/equalise circuit. The differ- ential logic network implements the logic function by configuring nMOS transistors in a differential form. The

<D IEE. 2w3 I€€ P l o w ~ L + miline no. 2w30?7i doi: IO. 1049/ipcds:20030?7 I R i p lim isceiwd 2nd April 2M1 and in revised foim 131h May 2002 B:S. Kong ih will? the School of Eleclronics Telecommunications and Compulcr En~neenng. Hmkuk A\,ialiun Uniweraity. 2W-I. Hwajun-dong. Ileokynng-gu. Goynng. Kyunggi-do, 412-791. Korea JLD. im. Y.~C. Kim. S.-J. Jimgand Y.-H. Jun :ire with the Siweung Electronics Co.. Lld.. San #24. Nongreo-ri. kiheuogiup, Yongin. Kyuna-do . 449-71 I, K0rw

JEE Ploc-Cir~uk~ 0er;crs Smr . , Vol. /SO, Nu. 1. Fehnuiry 2uDl

0

E

enable inverters, 11 and 12, generate the complementary enable outputs. E,, and E,,,,, from the complementary enable inputs. E, and E;,,. to control the operation of the sense/ equalise circuit. A crosscoupled circuit consisting of M2 to M5 with equalisation transistor M1 constitutes the sense/ equalise circuit. When E, and E,,, have low and high logic values, respectively, the circuit is in the equalise phase. During this phase, the sense circuit is disabled, and the output nodes are connected to each other by the equalise transistor_ producing the output voltages in between the supplies. When the logic states of E, and Ejh are changed to high and low logic values, the circuit enters the evaluate phase. During this phase, the output nodes are separated from each other. and the differential logic network is enabled, pulling down one of the output nodes. After the enable inverter delay, the enable outputs, E, and Eohr change their values, and the sense circuit is activated, accelerating respective output transitions. Multiple ASDL gates can be cascaded to form a function block with multiple operating stages as shown in Fig. 2u. The enable outputs at each stage are connected to the enable inputs of the next stage. With this configuration, the operating phases of each operating stage change progressively as the enable signals propagate along the stages, as indicated by the timing diagram in Fig. 2b.

As compared to the differential cascode voltage switch (DCVS) logic [SI_ the proposed logic family improves power

45

Page 2: CMOS differential logic family with self-timing and charge-recycling for high-speed and low-power VLSI

6 2 evaluate I e m evaluate L

evaluate I equalise €2 1 .,

E" IevaiUateI equalise I evaluate L-

b

Cu.scuding mynclrronoiis s e w di&entiil logic Fig.2 U Block diagram b Timing diagram

efficiency by reusing previously used charge. It also improves speed by providing additional low impedance current paths using the sense circuit. As compar,?d to CRDL and HRDL. the circuit has some other important advantages. As mentioned earlier, the most critical dis- advantage of CRDL is the desig overhead of increasing the threshold voltage of pi-hannel devices. Although HRDL overcomes this drawback by employing stacked pull-up devices and complementary enablc signals. the use of a larger number of devices and the speed-limiting effect related to the operation of the circuit lead to degraded performance. The device count overhead of HRDL comes from the use of two crosscoupled circuits, where non-

negligible parasitic components of these circuits cause slow transitions of each node with increased power. Careful examination of HRDL operation reveals that a part of the charge flowing into the rising output is transferred into the rising enable signal, E<,, and the charge flowing out of the falling enable signal: Euh is transferred into the Falling output. The chargc transfers between these nodes prevent fast transitions of the outputs, leading to overall speed degradation. Slow evaluation of cnable outputs, in turn, may cause additional loss of performance at the following stage due to unnecessarily delayed activation of the sense circuit. On the contrary, in an ASDL gate. the circuits for accelerating the output transitions and for evaluating the enable outputs are effectively merged into one optimal crosscoupled circuit with a pair of inverters, reducing device count. Example implementations of an inverting stage with different logic families, depicted in Fig. 3. highlight the simplicity of the sense/equalise circuit of ASDL as compared to that of HRDL. Moreover, the enable outputs are not evaluated by the charges from the logic outputs but by those from the supply rails. This allows faster cvaluation of the logic and enable outputs. Faster evaluation of enable outputs can also help optimise the speed of the subsequent svage by allowing earlier activation of the sense circuit in that stage. These effects of ASDL technology are combined to deliver the speed and power performances superior to those of the conventional circuits.

2.2 Pseudo one-phase clocking pipeline configuration The block diagram of the pseudo one-phase clocking pipeline configuration using the proposed logic family is

CKb I

Ob-

6-1

CRDL

E#+<

HRDL

ASDL

Fig. 3

46

Inoerrer implefirenrution irilh dflerenr logic fufiii1ie.y

Page 3: CMOS differential logic family with self-timing and charge-recycling for high-speed and low-power VLSI

shown in Fig. 4. The noninverting global clock CK drives the enable input E, of the first operating stage in the CK- section, and the inverting global clock CKB drives the same input in the CKB-section. The remaining stages in each pipeline section receive local enable outputs as their enable inputs. Hence, the global clocks dictate the phase transitions of the first stage in each pipeline section, and the remaining stages change operating phases one after another as the local enable signals propagate along the stages. The noninverting enable output E,, of the last operating stage in each pipeline section is connected to the input E,j of the first operating stage in the following pipeline section. In each pipeline section, all the operating stages except for the first one are designed using the circuit shown in Fig. 1. The circuit diagram of the first operating stage is depicted in Fig. 4. Two n-channel transistors, M7 and M8, driven by E,,, are inserted between the output nodes and the differential logic tree. An important timing condition for the signal is that it must always have transitions no later than the enable inputs. Namely, E,, becomes low by the time E; transitions low, and becomes high before E; transitions high. Fig. 5 depicts the timing diagram for the operation of the pipeline configuration. Since the enable outputs of each operating stage are the delayed versions of the enable inputs to the stage, ECK out of CK-section is the delayed version of the global clock coming into the section. The amount of delay is thus equal to the latency of the pipeline section, as is indicated by t , in the figure. Now, since the phase transition of the CKB-section is directed by the inverting global clock, there always exists an overlapping period between the evaluate phase of the last operating stage of the

CK-section and that of the first operating stage of the CKB- section. The length of this overlapping period is again equal to the latency of the leading pipeline section. If in this case the width of the overlapping period is enough for the following stage to have sufficient time for evaluating its inputs, the data from the leading pipeline section can be safely handed over to the following section. When the leading stage enters the equalise phase, the inputs to the following stage become equalised and have intermediate logic values. Then, E,, is immediately pulled down and allows the outputs of the following stage to be separated from the differential logic tree, causing no harm to the evaluated output values. Until the stage enters the equalise phase, these logic values are preserved by the sense circuit. Therefore, the pseudo one-phase pipeline configuration constructed with the ASDL log% family requires no additional latching stages between pipeline sections, whereas in conventional schemes latches are used to sample and hold the result before it is lost to precharge. Thus. the pipeline overhead due to these latches is totally eliminated. and the cycle time is the sum of the delays through all logically useful gates in the critical path.

3 Design consideration

An important design issue of ASDL is related to the operation of the sense circuit. This is because the speed and the reliability of the logic family depend on the sensing ability. For selecting an optimum activation time of the sense circuit, we must first consider the sensitivity, which is defined as the minimum allowable signal voltage difference

,

ECKB

Em, E m s

CK-Section CKb section

i ...................................................... . Ear EOE i ...................................................... .

a

1 diHerential cascade logic tree

47

Page 4: CMOS differential logic family with self-timing and charge-recycling for high-speed and low-power VLSI

E,, A-\ I evaluate equalise - 2-i equalise ~ r

evaluate-phase overlapping , I

I i period between pipeline ~~ct ions I .

EON e uaitse , evaiuate equalise

C K B ~ equalise evaluate equalise

. . ( E W ;/"/I ///

fop: latency at pipeline secficn.

Fig. 5

that can he detected correctly. The sensitivity value is affected by several factors such as dimerence in load capacitances, mismatches in the threshold voltages and the gain factors of MOS devices, and the amount of current drawn by the logic tree during sensing operation. Load difference and mismatches in threshold and gain factors tend to degrade the sensitivity. while the current through logic tree helps improve it [SI. Since a smaller sensitivity value is desirable for high speed, we must minimise all these mismatch factors. Unavoidable mismatches can also be counterbalanced by changing the amount of current through the logic tree such that the sensitivity can be made to be zero or even have a negative value. Once the sensi.tivity value of the sense circuit is found. the optimum actiwtion time of the sense circuit can be determined by considering the sensing speed and reliability. The sense circuit must be activated at the earliest possible time as long as reliable sensing is maintained. To find this timing, we define the sensing voltage difference. d vTcptTi,z:J, meaning the minimum output voltage difference for reliable sensing, to be the sum of the sensitivity value and some voltage margin. The amount of voltage margin can be estimated by simulation considering the worst-case process and enriiron- ment conditions. Then, d VTen.y,>,g, can be represented its

Tiiiiiuq rliuyruni o/pipeliiie confiyumtion

d v,e,,T;n, = + d v.,nrgi,, (1) where d vxc~,.7;n!J is the sensing voltage difference, S IS the sensitivity. L! L&!li,, is the voltage margin.

With this definition, reliable sensing operation am he obtained by ensuring the sense circuit to bc activated with the output voltage difference larger than d t'~erz.,.,w The next step is to select enable signals for activating the sense circuit with the right timing. We have multiple candidates fcNr this since we have multiple inputs to an operating stage and each input has its own complementary enable signals. A pair of enable signals can span multiple operating sta.ges if they have sufficient timing margin, i.e. the output voltage difference at the start of sensing is larger than d V5ex-.g. If the timing margin is not sufficient, the timing of the enable signals can be adjusted by changing the transistor size of the enable inverters or inserting additional delay elements along the enable signal paths. Reduction of the sensitivity d u e by increasing the transistor size of the logic tree can also be helpful. If the inputs have a race among themselves, mable signals associated with some later inputs can be used. l h t , if the input race condition is too severe, this may not provide reliable operation because some earlier inputs c m be equalised while the operating stage is still in the ev.duate phase, which may harm the evaluated logic values ,of the outputs. This situation is similar to that occurring between pipeline sections described in the preceding Section. Thus, the circuit shown in Fig. 4 can be used with the input E,, being connected to the noninverting enable output asso- ciated with the earliest input. Then, the outputs, after being

48

evaluated, are disconnected from the differential Ioac tree as soon as the earliest input becomes equalised, with their logic values being preserved by the sense circuit until the stage enters the equalise phase.

Another important design issue is the timing constraint related to the pipeline configuration for safe data handover. Let us assume that a pipeline section composed of N operating stages has the latency of lo,,. and the clock skew and the hold time are t,Tkc,y and tjv<,1(1. respectively. Then. at cach pipeline boundary. the overlapping period between evaluation phases of consecutive pipeline sections is represented as

lord"p = t,jJ - t,skt.,v

This overlapping period must he large enough for the first stage of the following pipeline section to he able to complete its evaluation, yielding the condition of

( 2 )

to,iop 2 tol,/N - Id ~ ~ , ~ i ~ < , + f b i d (3) Here, td la~m,ny is defined as the time period required for the first operating stage of the following section to develop the output voltage difference of A l's~~~Lyl,zgJ, and each operating stage in a pipeline section is assumed to have identical latency of t<,ArjN. Combining (2) and (3), we have

bp - t.Tkch 2 top/N - fAReming (hold (4) This condition is easy to meet for ordinary implementations because r, is suficiently larger than tC2IJN for practical values of N .

4 Performance evaluation

To assess the performance of the proposed logic family, XORjXNOR gates with various fan-in numbers are designed and simulated with a 0. 35 pm triple-metal CMOS process technology having the gate oxide thickness of lOnm and threshold voltage of 0.35V. Simulation conditions are at typical process comers at room temperature with the supply voltage of 2.5 V with the output loads consisting of a fanout-three load and additional capacitive load of 80 ff . All the logic families are designed to have the same power consumption for each fan-in number. Fig. 6 plots the simulation results for these logic circuits in terms of propagation delay and power consumption. As shown in the figure, ASDL is 28-49% faster than DCVS, and 15-25% and 11-18% faster than CRDL (with no VI. increase) and HRDL, respectively.

A high-speed 64-bit adder was designed to demonstrate practical application of ASDL technology. The adder

400

n 300 2

?

200 : 2. 4

100 k 4

" 3

3

0

Fig. 6 Simulation mmparkon: propuyulion deluy und power consrinzption uyainsf fan-in number P~ocess parameter: typical; supply voltagc: 2.SV; temperature: 25°C. load condition: a fanout-three load plus 80ff capacitive load

I€€ Prv-C;rcui~.% Deiiccs Sy.it.. Vu;. ISO, N a I , Fcbrirui). 2003

Page 5: CMOS differential logic family with self-timing and charge-recycling for high-speed and low-power VLSI

PROP PROP

Fig. 7 Block diapmz qf 64-bir u&v

employed the carry selection scheme for both the local and the block carry generations. For local carry propagation, a Manchester carry chain was used. The block diagram of the adder is depicted in Fig. 7. All the blocks except PROP were

PROP PROP PROP

'B/Bbare input operands

a

...................... ....................................................... ~ .. .............. ~ ..... ..

, -. R

.. ~ . 0

designed using the proposed circuit. The PROP block was designed using conventional DCVS with no output buffers. Fig. 8rr shows a detailed schematic diagram of an 8-bit Manchester carry chain. For high-speed carry propagation, the sensitivity value of the sense circuit in the carry chain cell was made negative such that a pair of complementary enable signals can drive up to four consecutive chain stages. The worst-case cany-chain path, i.e. propagating carry values from the lowest to the highest bit positions, consumes 0.47 ns, as illustrated in Fig. 8h. At the physical design level, each function block is laid out as symmetrically as possible to minimise the effect of possible mismatches of MOS devices on the sensitivity. Interconnect wires are routed with the same trace length and width for complementary signals to match loading capacitances. Propagation delay and crosstalk-related mismatches are minimised by avoiding routing signal lines in parallel in close proximity and by minimising the number of crossings of a signal line over other lines. The adder uses 91 18 transistors including periphery circuits, and the geometly occupies 1.63 x 0.46mm2. The simulated addition time and power consumption of the adder, broken down into components,

~

conventional new design design

Fig. 9 Process parameter: typical; supply voltage: 2.5 V: temperature: 25°C; clock frequency: SOOMHr

Compori,wn of 64-bil udder.y: addirion liine'

49

Page 6: CMOS differential logic family with self-timing and charge-recycling for high-speed and low-power VLSI

Table 1: Comparison of &bit adders: power consumlption

Power PROP, CARRY, ECG. SUM, SMUX. Total. mW mW mW mW mW mW

Conventional 36.0 113.4 11.3 88.1 51.2 3ClO.O

New 28.5 137.6 10.9 71.8 51.5 3Cl0.3

Process parameter: typical; supply voltage: 2.5V; temperature: 25'C clock frequency: 500 MHz

are illustrated in Fig. 9. and in Table 1, respectively. As shown in the figure, ASDL adder achieves 0.9711s addition time with a power consumption of - 300 mW. The addition time of the conventional design with DCVS is - 1.92 ns for the same power consumption.

5 Conclusion

In this paper, a differential CMOS logic family with self- timing and charge recycling was described. A pseudo one- phase clocking pipeline configuration using the proposed circuit with no additional intermediate latching stages is also introduced. For guaranteeing reliable operation of the circuit, some important design issues are considered. The discussion reveals that, because of delay-matched self-liming operation, the logic family is more suitable for regular- structured logic implementations rather than random, logc implementations. If the logic functions are random, the timing of each logic gate must be individually optimised, and thus the design process would bc time-consuming because timing relationships tend to be too complex to

handle. For regular-structured implementations, timing relationships are unifomi and easy to handle, and thus the speed advantage can be fully exploited.

6 Acknowledgment

This work was supported by LG Semicon (currently Hynix Semiconductor) and the IC Design Education Center (IDEC).

References

CHANDRAKASAN. A.D.. SHENG. S.. and BRODERSEN. R.W.: 'Low-power CMOS digital design'. IEEE J. Soliil-Srole Cireirir.i. 1992. 27. (4). pp. 473483 LAW. C.F.. ROFAIL, S.S., and YEO, K.S.: 'Lowpon'er circuit implement;ition for partial-prduct addition using pass-transistor logic'. IEE Pwc.,-Cirmit.s Dcmices Sjsi., 1999, 146. (3). pp. 12G129 LAW, C.F., ROFAIL. S.S.. and YEO. K.S.: 'A lowpuwcr 16x 16 parallel multiplier utilizing pass-transistor logic'. IEEE J. SdilSroir

MISHRA. S.M.. ROFAIL, S.S.. and SENG, Y.K positions: impact on the performance and power di latches and flip-llops', IEE Pwc; Cirmir.r Dei:ic<m Sj.rr.. 1999. 146. (5). pp. 279-284 KONG, B.S., CHOI, J.S.. LEE, S.J.. and LEE, K.: 'Charge-recycling differential lug% for low-powcr application'. IEEE J. Solid-Srure Circuirs, 1996, 31. (9). pp. 1267-1216 CHOE. S.Y.. RIGBY. G.A.. and HELLESTRAND. G.R.: 'HaIf.mil diWrential lo@c'. IEEE Intemation;il Solid-State Circuits Conference (ISSCC). Dig. Tech. Papers. San Francisco. USA. February 1997. pp. 42M21 KONG, B.S., 1M. J.D.. KIM, Y.C.. JANG. S.J.. and JUN. Y.H.: 'Asynchronous scnss dificcrcntial logic'. IEEE International Solid-State Circuits ConCerence (ISSCC). Dig. Tech. Papen, %in Francisco. USA.

c i ~ u i r . ~ . 1 9 9 . M. (in). pp. 1395-1399

Fehnwrv 1940 mm XA.7UC . .,,_, _. HELLER, L.G., GRIFFIN, W.R.. DAVIS, J.W.. and THOMA. N.G.: 'Carode voltage switch losic: A differcntiirl CMOS lopc Pamily'. IEEE lntcmational Solid-State Circuits Conference (ISSCC) Die. Tech. . - Papers. San Francisco. USA, Fehnrary 1084, pp. 1617