Low Power Multiplier

8/3/2019 Low Power Multiplier

1/4

r Multiplier Design Usinvaluation

Gerald E. Sobelman and Donovan L. Raatz

Ab23fract-A circuit design technique for very low powerparallel multipliers is presented. The design usesdynamic CMOS circuits together with a self-timedevaluate signal in such a way that each carry-saveor carry-propagate adder within the array eval-uates only after all of its inputs have stablized.This technique avoids the spurious switching ofinternal nodes so that the average power dissipa-tion is minimized. Circuit simulation results arepresented which illustrate the power dissipationcharacteristics of the multiplier.

1. INTRODUCTIONA multiplier is one of the key hardware blocks in most

digital signal processing (DSP) systems. Typical DSP ap-plications where a multiplier plays an important role in -clude digital filtering, digital communications and sp ect rdanalysis. Many current DS P applications are targeted atportable, battery-operated systems, so tha t power dissipa-tion becomes one of the pr imary design constraints. Sincemultipliers are rather complex circuits and typically mustoperate at a high system clock rate, reducing the pomerdissipation of a multiplier i s an essential part of satisEyingthe overall power budget.

Power dissipation in CMOS circuits is primarily due tothe charging and discharging of capacitive nodes through-out the circuit, and is characterized by the equation:

P = CV2fwhere C is the effective capacitance that is charging anddischarging, V is the power supply voltage and f is theswitching frequency [l].The precise value of the effectivecapacitance is difficult to determine analytically becauseit depends on the particular set of input data and priorstate of the circuit. However, it is easy to see that powercan be reduced by minimizing the capacitance of circuit

G. Sobelman is with the Department of Electrical Engineering,University of Minnesota, Minneapolis, M N 55455. D. Raatz wa swith the University of Minnesota when this work was performedand is presently with Motorola, Inc., Austin, TX. This work wa ssupported by a grant from the House Ear Institute

nodes wherever possible and by reducing the number ofswitching events that occur. In addition, it is clear thatthe power supply voltage has a arge impact on th e mag-nitude of the power dissipation due to the squared termactor in the above equation. Therefore, one should try t ooperate at the lowest possible power supply voltage thatis consistent with the given speed constraint of the appli-cation.

In this paper, we describe a dynamic CMOS array-type multiplier that has very low-power dissipation. Aself-timed evaluate signai is generated in such a way thateach row of adders in the multiplier evaluates only afterall of its inputs have stablized to their final values. Inthis way, a large number of intermediate transitions thatwould normally occur in a static CMOS implementationa x avoided, thereby saving a correspondingly large frac-tion of the power consumed by the circuit.

11. MULTIPLIERRCIIITECTUREWe consider the design of parallel multipliers in which

ali bits of both operands are presented in parallel to t hemultiplier and where the product is available within a sin-gle clock cycle. The operands are each assumed to be N -bit twos complement numbers and the product from themultiplier is tc be in the form of a 2M-bit twos comple-ment number. There are many alternative architecturesthat can b e use?, including both tree-type and array-typedesigns [2j. Here, we will focus on a Booth encoded array-type multiplier of the type shown in Figure 1 [3]. For con-creteness, we will use the example of a 1Zb it by 12-bitmultiplier in this paper, but the basic ideas to be intro-duced can be applied to any size of the wordlength W .The modified Booth encoding OF the Y operand reducesthe number of partial products by a factor of 2 , so thatin the present case we have a total of 6 partial productswhich must be summed. This is accomplished using aseries of 5 carry-save adders (CSAs) followed by a finalcarry-propagate adder (CPA). Each CSA is composed ofa parallel set of 12 full adders, although some of thesecan be replaced by half adders in those cases where onlytwo inputs at a given bit position must be summed. EachCSA performs a 340-2 compression of 3 input operands

0-7803-2570-2/95 $4 . 0 01995 IEEE I564


2/4

into a sum vector and a carry vector. The CP A is a 24-bitt ~ o - o p e ~ n ddder that compntes the final %bit outputvector c ~ r ~ e s ~ o ~ ~ ~ ~ r ~ ~o t h e product. A simple but pel-ativety slow innpleniearjtation of the GPA is ti ripple-carrystructure, b u t VZPIOUS types of carry looLahead stnucturesmay be aased if a r h r t e z latency 1s require& Note that wemake use of t h e SO- calket b'~Xg~a-generatey'ethod to min-imize t h e number cE sign-extension its thaj are ,-equvedin each partial product [ 3 ] .

In a d d i ~ a ~ no th e adder itmay itself, there are threeo t hes types of b:ur s that are used 1111t h c architecture,

t boxes a d he add-khc wclk k n o w n "rmdified Wocith in which 3 ad-jacent mulGpher bits are mapped into th e signed-digit se tBooth-encoded mu ~tipiier tgits to form t h e correspondingpartial preduct, i.e. -%X, , ox, X or 12 :Y . T h e multi-ply by 2 1s implemnented as a s imple left-shift of t h e bits ofX , and the negation is implemented \la bil cowplemen-tation and adding 1 to the least-significant bit position.The add-one genArators form the 13th Sit pcisitiori in eachrow. and are used as the a d h t i o n a l -*I term needed toform th e 2's complement of tne X operand.

decoders nmpieFenrs

-2,1 0 , +I, '-2The TQiSelCCt boxes (13 per TOW) l l S C the

that occurs over mi:lisecorld time-scales.) The probabil-i t y that a, prechazged node 1s didhssrgzd during a givenevaluate phase deprnds oil thc senes/pardl4 st,sucture ofthe pull-down network and the probabilistic values of theinput logic levels. For example, assininin0 OP valueg at d l three i c y ~ . t s o a MI-adder, there isa 50% probability a,t the "sum" outpiit node will dis-charge a n d a. 50% isability that the "carry outP' outputnode will discharge. T k e f o s e , each output node onlyrequires !>recharge current on Eialf of t h e docl< cycles. Fi-naily, one must i c ~ n s ~ d e ~he charging sad discharging ofthe clock node, whi4 is comected to the gates of th epreclkarge and evaluate traas,stors in t5e circuit, 'khiswill ad d uo + h e power dissipation. but its effect may beredaced by cs4ng m i n x " s z'ge devices. W i t hprooer des:& t h e puritiire as reduced parasitic,capacitance a d h e elrmir-sat; pl3 ed ia k switchingevents) will usitweigh u tie ne'g~tive spects (charging anddischarging of the clock note) , resulting in a very powerefficient dynamic CMOS irrtplernentation

TPtreref~re, et IUS consider a dynamic multiplier designin which the Booth decociz;~,select boxes, add-one gener-the CPA4axe ah implemented using dy -ncc this is a multi level logic circuit , weway of cascading the levels ef logic in a

reliable fas him Typlcal:y, Domino CMOS [% I can be usedto cascade several lclgrc levels, but there is a fundamental

with that apprcach in this design. A Dominoonly implement functions which a-c positive ina:i el' &err Lpui vai-iakles. In the present ~ase , he sum2dsr in each hl t pos;tioss of a,CS A is nota posrtive func t i o i i of the &re -valued inputs, a9 it i s3. 3 way XOR funpt..rn One c re a k both true anddder input a d hen create a

onnino ciicuit in that way, "U 6171s would have t o b-. re-other words, we would need.r t sum and carry outputs asThis type of wkeme coiilcl be irriplenzented usingel lthe d jna n i i c CVS circuit design methodology 151.

ir eh-. strc-iait, Th:'~efme, d ~ ~ l o u p T :k ~ y n a n , r cC'VS ap-letel of Iogic. i t wa s ncrt a dop tz d iri GP design because c ith e piesenice of th e additional s w i t c h n p r,odesproach woiilld 21lc)w for 6 slngie OL l l p t l l '-aWh,or3 a t each


3/4

IV . D ELA Y ED V A LU A TI ONE C H N I Q U EThe fully dynamic circuit implementation is seen to of-

f~ the possibility of significantly reduced power dissipa-hoe. However, a practical timing problem must be over-e.-rlPle in order for the method to be applied: One must

ure th at the sequence of evaluations within the array~ c c u r sn the proper orde r. We make use of a self-timedpecharge/evaluate signal that delays the evaluation of aCS A or a CPA until all of its inputs have become stable

As shown in Figure 2 , we have used a simple delay linecnmposed of a tapped inverter string in order to create$he delayed clock signals for each adder row in the array.there are a total of 6 such delayed clock signals (one each4%cr the 5 carry-save adders and 1 for the carry-propagateadder) . We refer to the primary clock phase which con-t;ols the Booth decoder and select logic as q5 and we referk 9 the 6 delayed clock lines as $1 through $ 6 , respectively.!-he propagat ion delay through a string of four invert-

W ~ S as found to provide adequate timing margin for thertrputs to a given unit to become stable. Alternatively,:b would also be possible to construct a delay line usingtracking cells 173,which are replicas of the actual critical,&h delay circuitry. Such an approach would be slightly.,lore complex in it s implementation but wouId provide asomewhat more robust tracking of the circuit delay underXIY given environmental operating conditions

Another benefit of the delayed evaluat ion technique re-ates to smoothing of the instantaneous power dissipationcharacteristics. In a standard dynamic CMOS design, theprecharge signal causes all pfecharge nodes to change at

f course, nodes that were not discharged duringrhe previous evaluate phase are already at the precharge->veland therefore do not contribute to the power dissi-?ation.) This large current spike can lead to a significantottage drop at the power supply node and may also be~ n s u s t ~ ~ ~ a b ~ eithin the hmlsations of a battery-powered

qystem. In contrast t:: this, the delayed evaluation methodwtomatically produces a delayed precharge effect. Thedame delay line that separates the evaluation of each CS Aand th e C PA from the others also provides a correspond-mg time-separation of the precharging events for eachunit.

[GI.

V. P E R F O R M A N C EESULTSWhile the delayed-evaluation technique can be applied

for 3 volt or 5 volt system operation, our primary inter-est is in single-cell battery-powered systems with it typicaIoperating suppIy voltage of 1.2 volts. In order to obtainreasonable switching speeds at this voltage level, it is nec-essary to use a low-threshold CMOS process where the

magnitude of the NMOS and PMOS transistor thresholdvoltages are on the order of 0.5 volt. In our circuit simula-tions, we have used the model parameters of a generic 2pCMOS process but with the magnitude of th e V T O pa-rameters set to 0.5 volt. Th e default device size is 4p/2p1and the default source/drain area is 2 0 p 2 .

HSpice simulations of the multipl ier showed it to workcorrectly for all input patterns tested. The simulation wasdone using a clock frequency of 1 MB z and a power sup-ply voltage of 1. 2 volts. Th e average power dissipationover one clock cycle was found to be approximately 100pW. Figure 3 shows the instantaneous power dissipationfor a typical multiplication sequence. The first group ofpeaks occurs during the evaluation phase and the secondgroup of peaks occurs during the precharge phase. Withinthe evaluation group of peaks, one can make the follow-ing identifications: Th e first very narrow peak is due tothe gate capacitance of the precharge and evaluate tran-sistors in the Booth decoders, select boxes and add-onegenerators as the primary clock signal q5 goes from lowto high. The next, broader peak is due to the evalua-tion of those same blocks. There is a series of 5 smallerpeaks that represent the sequential evaluation of the 5carry-save adders, followed by a final broad peak f rom thecarry-propagate adder. Beginning at 500 nanoseconds, asimilar set of peaks is visible (although not as distinctlyseparated), which demonstrates the action of the delayedprecharging effect.

V I. CONCLUSIONSWe have shown that the delayed evaluation technique

based on the se!f-timed precharge/evaluate liming chaincan achieve extremely lo w power dissipat ion in the contextof a parallel multiplier design, Th e delay line provides atirne-separation for the sequence of precharge and eval-uate events, thereby also smoothi ng the fluctuations inthe instantaneous current drawn from the power supply.Simu!ation resuits confirm the presence of this sequenceof events, and show that the average power dissipation isonly 10 0 pW,

While this work has focused on the design of a paral-lel multiplier, the delayed evaluation technique may alsofind application in the imp lementa tion of other low-power,multi-level logic functions as well.

R EF ER EN C ES[l] A. Chanrakasan, S. Sheng and R. Brodersen,

Low-Power CMOS Digital Design, IEEE Journalof Solid-s tate Circuits,Vol. 27 , pp . 473 - 484 (1992).[ a ] M . Santoro and M. Horowitz, SPIM: A Pipelined64x64-bit Iterative Multiplier, IEEE Journal ofSolid-State Circuits,Vol. 24 , pp. 487-493 (1989).

1566


4/4

[3] M. Annaratone, Digiial CMOS Circuit Design,Kluwer (1986). *[4] R. Krambeck, C. Lee and H.-F. Law, High-speedCompact Circiiits with CMOS, IEEE Journal ofSolid-State Carcuiis,V d . SC-17, pp. 614-619 (1982).[5] G . Heller et al, Cascode Voltage Switch Logic:A Differential C M OS Logic Family, IEEE nterna-tional Solid-State Circuits Con.ference, pp. 16 - 17(1984).[EI] D. Raatz a.nd G . Sobelman, U. S. Patent WO.5,333,119[7] M. Dean, STR,iP: A Self-Timed RISC Proces-sor, Technical Report No. CSL-TR-92-543, Com-puter Sys tems Laboratcry, Stanford Univ. (1992) to811 e

I P R o : c c i n u L T9 9 / 1 0 / 1 4 1 4 : 0 6 : 2 2

0 . T I M E C L I N I

Figure 9. Instantaneous Power Dissipation

Documents

Low Power Multiplier