Use of Selective Precharge for Low-power Content-Addressable Memories

  • Upload
    hao-wu

  • View
    225

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 Use of Selective Precharge for Low-power Content-Addressable Memories

    1/4

    1997 IE EE International Sym posium on Circuits and Systems, June 9-12,1997,Hong Kong

    Use of Selective Precharge forLow-Power Content-Addressable Memories

    Charles A. Zukowski, Shao-Yi WangDepartment of Electrical EngineeringColumbia UniversityNew York, N.Y. 10027USA

    AbstractA general technique to reduce the energy used by

    individual CMO S logic gates in large fan-in logic arraysis derived, and applied to the comparators in a content-addressable memory (CAM), an important applicationwhe re power dissipation is often large and the techniq ueworks particularly well. A small subset of the inputs areremoved from the large parallel pulldown switch andused to control the precharge instead, greatly reducingthe number of cycles requiring a full charge/dischargesequence in many cases, with only a modest delay pen-alty. Estima tes of the optimal num ber of bits to removeand the performance gain as a function of variousparameters are provided.

    1. IntroductionContent-addressable memories (CAMs [1 ) areimportant because they are becoming necessary to

    quickly searc h routing tables and packet queue s in real-time communication networks [2], and they are alsoneeded to search growing cache and ad dress translationtables in high performance (and highly integrated) pro-cessors [3]. The design of C AM comparators is particu-larly challenging, though, because the com parators mustall be active for each search, causing a particularly badpower problem, they contain very large fan-in logicgates, and spe ed is always a primary goal.

    Large fan-in logic gates are common in CMOSarray structures, such as decoders, encoders, PEAS,ROM s, and compa rator arrays. The pulldown transis-tors in these logic gates can be c onnec ted in either seriesor parallel [4], based on performance considerations,and the desired logic function (AND or OR) can beachieved by choosing the polarities of the inputs andoutputs. A single parallel n-channel switch is gener allypreferred due to its speed and area, and precharging isoften used between e valuation s to avoid the need for aseries pullup sw itch or a resistive load. Unfo rtunately,in many applications such as the CAM comparatorarray, the parallel switch turns on in almost all cases,

    drawing energy on almost every cycle. A series switchwould be off in most cases, so its use could drasticallyreduce energy use, but its speed is generally unaccept-able

    One common technique that has been used toaddress the energy problem in some array structures isto power down inactive blocks [ 5 ] . Precharging is onlydone in the one block that is active during a particulartime period. This approach is generally not possible in aCAM, where the goal is to do a parallel search of anentire array at once , but a similar idea can be a pplied toeach comparator individually, trading a little speed forvastly im proved power performance. A small subset ofthe inputs to a logic gate can be used to do a precalcula-tion, and the results used to decide if the remainingmajority of the logic gate needs to be precharged at all.This approach is referred to here as selective p recharge,and it can be applied to any large fan-in logic gate. Inthis paper we investigate how selective precharge canbest be applied to C AM design, and how much it mightbe ab le to imp rove performance"

    In the second section, the selective prechargeapproach for large fan-in gates is presented. In the thirdsection, the basic structure and constraints in a CAM arereviewed. The fourth section contains a discussion ofhow selective precharge can be applied to C AM s, andsome estimates of the performance improvements. Weshow that po wer associated1 with the m atch lines can bereduced by an order of magnitude in a large CAM byusing only a few inputs in the precha rge calculation.

    2. Selective Precharge'The basic precharged (NO R) gate com monly found

    in logic arrays is pictured i n fig. la . Precharging is donewith a p-channel pullup transistor (MA), and duringevaluation, if one (or more) of the pulldown transistorsis on, the large capacitance at the gate output is dis-charged A ground switch (MB) can be used to disablethe pulldown path during ~precharge f the inputs (Di)cannot be forced low during that time. Since only oneof the D inputs musi be high to cause a transition, one

    1788

  • 8/3/2019 Use of Selective Precharge for Low-power Content-Addressable Memories

    2/4

    would expect that energy would b e needed to prechargemost gates during most cycles. In a decoder, only oneoutput does not have a transition on a given cycle. In aPLA with random and independent inputs, a gate imple-menting a product term with 6 inputs could be expectedto switch on 63 out of 64 cycles. Even if the swing isreduced on the output, the frequency of transitions leadsto a potential power problem.

    YI

    (b )Figure1 . Conventional large fanin precharged logic gates.The alternate precharged N AND gate is pictured in

    fig. lb. Power use in this case is much lower becausethe frequency of transitions is generally much lower(e.g. 1 out of 64 cycles for the 6-input PL A gate withrandom and independent inputs). Th e output capaci-tance is smaller as well, but this is roughly replaced byinternal switch capacitance. This is not generally a fea-sible solution to the power dissipation problem, though,because its delay grows (asymptotically) as the squareof the number of inputs. There are also potential prob-lems with charge sharing at the output. The basic prob-lem with a large fanin logic gate cannot be solved byavoiding precharging either. If a standard static CMOSgate is used, one of the switches must be a slow seriesone (NAND or NOR) . A pseudo-nMOS approach haslarge d.c. pow er dissipation.The basic selective precharge approach is pictured

    in fig. 2. In this case, a small subset of the inputs { DO -Dk -l} s used in the precharge circuit to do a conditional(selective) precharge, while the rest are used in a con-ventional parallel configuration. This performs thesame logic function as the circuits in figure 1 if the out-put is initially at a low voltage, so provision must bemade for a predischarge of the output for the (rare) casewhere a precharge occurs in the previous cycle that isnot followed by a discharge. Depending on the applica-tion, this could be done with an additional clock phaseand a predischarge transistor (MC), or it could be do neby the precharge circuit whenever a precharge does notoccur (fig. 3). There are many other straightforwardvariations of this basic approach o nce a subset of inputsis isolated for earlier processing. For examp le, it can becombined with reduced swing techniques, used withdiode (or diode connected) pulldown switches, andcombined with buffering. Fig. 3 shows a variationwh ere the basic cell s for the inpu ts DO-Dk.1 are identica lto the rest, connected with a fast parallel sw itch (at theexpense of a small amount of energy savings in the pre-charge circuit). All of these variations fit well within aregular array structure.The rough goal of selective precharge is to increasethe delay slightly by pre-processing a small number ofinputs, but at the sam e time greatly reduce the numberof large logic gate outputs that must switch. For exam-ple, in a large NOR gate with random inputs, if twoinputs are moved to the precharge circuit, a precharge isonly done on 1 out of 4 cycles, roughly reducing theenergy use by 75%. Th e main cost is a slightly slowerprecharge that cannot occur until at least some o f th einputs have arrived. As more inputs are moved, theenergy use in the main part of the array continues todrop quickly, but the energy used in the precharge cir-cuit eventually becomes significant, so k should have anoptimal value between 0 and n-1. The technique could

    Figure 2. Large fanin gate using selective precharge.

    Y

    Figure 3 . Selective precharge using two NO R stages.

    1789

  • 8/3/2019 Use of Selective Precharge for Low-power Content-Addressable Memories

    3/4

    also be applied recursively, i.e., there could be morethan one stage of selective precharge, but the large ini-tial gains and the overhead of extra clock phases andbuffering would quickly lead to diminishing returns.

    3. CAM StructureA general CAM has the basic structure illustrated in

    fig. 4. Each row of the memory array (register) must becompared to the contents of the (vertical) data bus {Bo-Bn - , } during a search, producing a number of wordmatch signals. Words are often quite wide, i.e., n islarge. Each comp arator is generally formed with a largewired-AND (i.e. NOR) gate that extends across an entirerow. Since these generally use an approach along thelines of fig. l a to obtain speed, on any given cycle onlyin the (generally few) rows that do achieve a match isthe large capacitance associated with each match linenot switching. Most of the match lines must berecharged on every cycle. Even if a sense amp is used toreduce the swing, a large number of words in the mem-ory generally leads to significant pow er dissipation.Address-in B o B1 Address-out

    Figure 4. Basic structure of a content-addressable mem ory

    Another source of power dissipation in the CAMarray is the switching of the data lines that must extendalong each column. On average (assuming randomdata), during each new search half of these lines mustswitch state, so one quarter must be charged duringevery cycle. Furthermo re, a significant swin g is neededto drive the logic in each comparator cell.At the edge of the CAM array there is a decoder(used to load the memory) and an encoder, which canboth also be constructed w ith a num ber of large wired-AND circuits. In som e applications, multiple matchesare possible and a priority blanking circuit is needed tofilter out all but the highest priority word m atch signa ls.Such a filter can be implemented with a lookahead treestructure [6], a ripple-carry approach, or som e combina-tion of the two.

    4. Using Selective Precharge in CAMSIn this section we primarily investigate the use of

    selective precharge in the CAM comparators. Animportant initial observation is that selective prechargedoes not increase the worst-case cycle time of a comp ar-ator very much . Addin g sellective precharge, in a man-ner similar to that pictured in fig. 3, simply partitions thelarge match line capacitance into two parts, each con-nected to similar drivers. For both the total prechargetime and the total discharge time (whose su m equals theworst-case cycle time), the partition roughly ch anges thedelay from R . to R . C M A + R . M B , , IfC , =: C,, + C,, , the total delay is roughly thse same.Of course there is some extra delay overhead for thenew buffer in the middle, unless the technology hasscaled to the point where resistance on the line is signif-icant, in w hich case the buffer could act as a repe,ater. Inaddition, selective precharge precludes the overlap ofprecharge with the arrival of the data inputs, at least forsom e bits, increasing the latency.To find the best choice for k (number of bits usedfor selective precharge) in a CA M com parator, amd esti-mate the ene rgy saving s, we: designed a word m atch cir-cuit similar to the one shown in fig. 3 using a 0.35micron CMOS technology. Parasitics that w o d d arisein a full layout were carefully estim ated. Transistor siz-ing for the buffer and precharge transistors wais scaledso that total cycle time was kept constant as k varied.Based on a macromodel for the energy use as a functionof k, verified by simulation assuming random and inde-pendent inputs, fig. 5shows how energy per cyclie varieswith k fo r n=128 and two different loadings from theencoder. Fig. 6 indicates the optimal value of k for arange of values for n.

    There are som e interesting ways that selective pre-charg e could be applied across the array of comp arators.The priority blanking circuit might be able to takeadvaintage of early inputs for the highest priority m atchlines, in which case selective precharge might not beused in a priority subset of the array. As a result, som eoverlap between data arrival and precharge woluld thenbe possible, and most of the power savings would stillbe realized. Alternately, selective precharge: allowsdata-dependent early completion for some of the sig-nals, and a self-timed priority encoder might be able totake advantage of this. Anoth er variation of a CAM ispictured in fig. 7. He re the: array is div ided inti:, block sthat are searched sequentially, so the energy to send dataacross low priority blocks i s only expended if nccessary,In this case, selective precharge can be done in laterblocks, possibly with different values of k (indicatedwith shading), while earlier blocks are evaluating.

    1790

  • 8/3/2019 Use of Selective Precharge for Low-power Content-Addressable Memories

    4/4

    Selective prech arge can also be used in the decoder,but here it becomes similar to block power-up in a stan-dard memo ry. In the encoder, selective prech arge is notas useful unless matches usually occur in the few high-est priority locations, due to highly non-independentinput statistics. Fortunately, the encod er array is gene r-ally much smaller than the C AM comparator array.

    n =1 28 , C I d 40 fF, selsctlve precharge (s ) & COnVenIlonal approach (c )

    Ci=dO(c)

    700 C1=4(c)

    0 I20 40 60 a0 l o o 320 140k number of input bits used for eIeCtiVe piecharg eFigure 5. Average energy use vs. number of inputbits used for selective precharge, k.

    n=a - 128, CI= 4,4 0 FI

    3 I20 40 60 80 100 120 140n: total number of input biU

    Figure 6. Optimal k vs. total number of input bits, n.

    Figure 7. Us e of varying amount of selective prechargein a CAM comparator array.

    5. ConclusionThe comparator array in a CAM is a particularlyattractive applicatio n for selective prech arge. W ithout

    this technique, on any given cycle most of the wordmatch lines will switch, dissipating significant power.By partitioning the word line, and doing a partial calcu-lation first, e.g. sep arating 7 ou t of 128 bits in the arrayif C,=4 fF, os t of this energy (about 85% in our exam-ple) can be saved unless the chosen bits correspo nd to adata field that happens to usually match. Furthermore,the delay does not have to suffer much to achieve thesegains. Various CAM architectures can even allow asmall increase in delay for some portions of the arraywithout hurting overall perform ance. Wh ile the selec-tive precharge technique can be applied to any array oflarge fan-in gates with so mew hat rando m input patterns,the CAM array appears to be a particularly interestingone.

    AcknowledgmentThis work was supported in part by an IBM Part-

    nership Award and the C olumbia University Center forTelecomm unications Research. We would also like tothank NeoParadigm Labs for design environment sup-and Dr. Ben S . Wu for inva luable discussion.

    ReferencesGrosspietsch, K. E., Associative Processors andMem ories: A Survey, IEEE Micro, June 1992, pp.Pei, T. and Zukowski, C., P utting Routing Tablesinto Silicon, IEEE Ne twork, Jan. 1992, pp. 42-50.McAuley, A. J. and C otton, C. J., A Self-TestingReconfigurable CAM, IEEE Jour. of Solid-stateCirc ., Vol. 26, No. 3, Mar. 1991, pp. 257-261.Weste, N and Eshraghian, K., Princ. of CM O SVLSI Des.: A Sys. Perspective 2nd Ed., Addison-Wesley, USA , 1993.Amrutur, B.S. and Horowitz, M., Techniques toreduce power in fast wide memories (CMOSSRAMs), 1994 IEEE Symposium on Low PowerElectronics. Digest of Technical Papers, pp. 92-93.Yam agata, T., et. al., A 288-kb Fully Parallel Con-tent Ad dressable Mem ory Using a Stacked-Capaci-tor Cell Structure, IEEE Jour. of Solid-state Circ.,Vol. 27, No. 12, Dec. 1992, pp. 1927-1933.

    12-19.

    179s