Serial Serial Mul

Embed Size (px)

DESCRIPTION

multiplier

Citation preview

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

    A High Bit Rate Serial-Serial Multiplier WithOn-the-Fly Accumulation by Asynchronous Counters

    Manas Ranjan Meher, Student Member, IEEE, Ching Chuen Jong, and Chip-Hong Chang, Senior Member, IEEE

    AbstractA novel approach of designing serial-serial hybridmultiplier is proposed for applications with high data samplingrate ( 4 GHz). The conventional way of partial product formationis revamped. Our proposed technique effectively forms the entirepartial product matrix in just sampling cycles for anmultiplication instead of at least cycles in the conventionalserial-serial multipliers. It achieves a high bit sampling rate byreplacing conventional full adders and 5:3 counters with asyn-chronous 1s counters so that the critical path is limited to only anAND gate and a D flip-flop (DFF). The use of 1s counter to columncompress the partial products preliminarily reduces the heightof the partial product matrix from to

    , resultingin a significant complexity reduction of the resultant adder tree.The proposed hybrid column compressed multiplier consists ofa serial-serial data accumulation unit and a parallel carry saveadder (CSA) array that occupies approximately 35% and 58%less silicon area than the full CSA array multiplier with operandsof wordlength 32 32 and 64 64, respectively. The post-layoutsimulation results based on 90-nm seven metal single poly CMOSprocess technology shows that our 64 64 multiplier dissipates39% less average power at a sampling rate of 4 GHz, and hasonly 11% additional delay penalty to complete a multiplicationcompared to the conventional fully parallel CSA array multiplier.

    Index TermsBinary multiplication, on-chip serial-link bus ar-chitecture, on-the-fly accumulation, parallel multipliers, serial-se-rial and serial-parallel multiplier.

    I. INTRODUCTION

    M ULTIPLIERS are the fundamental and essentialbuilding blocks of VLSI systems. The design and im-plementation approaches of multipliers contribute substantiallyto the area, speed and power consumption of computational in-tensive VLSI systems. Often, the delay of multipliers dominatesthe critical path of these systems and due to issues concerningreliability and portability, power consumption is a criticalcriterion for applications that demand low-power as its primarymetric. While low power and high speed multiplier circuitsare highly demanded, it is not always possible to achieve bothcriteria simultaneously. Therefore, a good multiplier design

    Manuscript received August 15, 2009; revised November 27, 2009, April 11,2010, and July 05, 2010; accepted July 06, 2010. The work of M. R. Meher wassupported by Nanyang Technological University under a Research Scholarship.

    M. R. Meher is with the Integrated Systems Research Laboratory, Schoolof Electrical and Electronic Engineering, Nanyang Technological University,639798 Singapore (e-mail: [email protected]).

    C. C. Jong and C. H. Chang are with the School of Electrical and ElectronicEngineering, Nanyang Technological University, 639798 Singapore (e-mail: [email protected]; [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TVLSI.2010.2060374

    requires some tradeoff between speed and power consumption.As the device size shrinks, there are other side effects and itbecomes increasingly difficult to achieve a good tradeoff bydevice scaling or sizing the transistors [1]. A more effectivedriver to optimize the area, delay and power consumption ofarithmetic computations in battery powered VLSI circuits isto explore alternative architectural concepts for the design ofdigital multipliers.

    Typically, hardware implementation of a multiplication oper-ation consists of three stages, specifically the generation of par-tial products (PPs), the reduction of PPs and the final carry-prop-agation addition [2]. The partial products can be generated ei-ther in parallel or serially, depending on the target applicationand the availability of input data. The partial products are gen-erally reduced by carry-save adders (CSAs) using an array ora tree structure. Carry propagation addition is inevitable whenthe number of partial products is reduced to two rows. This finaladder can be a simple ripple carry adder (RCA) for low poweror a carry look-ahead adder (CLA) for high speed [2]. As theheight of PP tree increases linearly with the wordlength of themultiplier, it aggravates the area, delay and power dissipation ofthe two subsequent stages.

    Therefore, it is highly desirable to reduce the number of par-tial products before the CSA stage. This can be achieved byModified Booth algorithm to reduce the height of the PP matrix[3]. Another approach is to use high order column compressorsinstead of full adders (FAs) to increase the PP reduction ratio ofthe CSA stage [4], [5]. The drawback is that Modified Booth en-coder adds both area and delay overheads to the simple partialproduct generation process, and higher order compressors areslower and consume more power than the full adders. Hence ahybrid combination of both techniques is often considered.

    To reduce the wiring cost, it has been a common practiceto transmit data over a communication channel through a highspeed serial link [6]. For some integrated circuits constrainedby I/O pins, the designers often try to reduce the number ofI/O pads by serializing I/O data because I/O pads occupy largesilicon area and consume high power [7]. Therefore, effort hasbeen made to design high speed serial interface [6], [8][12] inorder to facilitate on-chip buffering and parallel processing.

    Parallel multipliers are popular for their high speed operationbut long wordlength multiplications are often constrained by thehardware cost and power consumption of the applications. Inpublic key cryptography like RSA encryption and decryption,integer multiplications of 1024 bits are typical [13]. In EllipticCurve Cryptography (ECC), key lengths of 112 bits and 109bits are commonly used for prime field and binary field multi-

    1063-8210/$26.00 2010 IEEE

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 1. On-chip communication among parallel functional modules in SoC. (a) Conventional bus structure. (b) Serial-link bus structure [16].

    Fig. 2. Suitability of serial-serial multiplier for upcoming on-chip serial-linkbus architectures in complex SoC.

    plication, respectively. Therefore, low-cost serial multipliers arewidely adopted in hardware cryptography [14], [15].

    Serial multipliers also find applications in system-on-chip(SoC) design. As technology scales, more intellectual propertycores and logic blocks will be integrated in an SoC, resulting inlarger interconnect area and higher power dissipation [17]. Theincrease in integration density of the on-chip modules causes thebuses connecting these modules to become highly congested.To overcome this problem, new techniques have been evolvedrecently to have on-chip data transfer in a high speed serial linkinstead of conventional bus [16][18]. Fig. 1(a) and (b) depictthe conventional on-chip bus and alternative on-chip serial-linkbus structures, respectively. In Fig. 1(b), the serializer at thesource module converts the parallel outputs to a bit streamthat can be transferred in a simple routing network and at thedestination module they are converted back to parallel data bythe deserializer. The on-chip serial-link is capable of transmit-ting data at Gb/s so that a chunk of parallel data is availablewhen the destination module finishes the previous computation.Under the new on-chip communication paradigm for digitalsignal processing, it is desirable to have a low complexity dataprocessing unit as the destination module that is able to performpartial computation on the incoming data stream at high speedwhile the data is being buffered. Fig. 2 illustrates a potentialuse of a serial-serial multiplier as a destination module in a SoCwith serial-link bus architecture. The low complexity precom-putation unit forms part of the serial-serial multiplier and couldperform partial computation on the high speed serial bit stream.

    The unit doubles as a buffer and eliminates the deserializer.As the data has been partially processed and buffered, thecompletion of the multiplication can be done at a lower speedwith a less complex parallel multiplier. The challenge in sucha scheme lies in reducing the critical path delay of the precom-putation unit to that of the deserializer, which usually has bitrate in the order of several Gb/s. We introduce this new schemefor the design of serial-serial multiplier suitable for SoCs withon-chip serial-link bus architecture [16]. The proposed schemecould also be used as an alternative to embedded multipliersin the future field-programmable gate array (FPGA), whereconfigurable logic blocks (CLBs), embedded multipliers andmemory blocks are integrated with serializers/deserializersto facilitate on-chip serial data transfer in order to reduceinterconnect complexity [19].

    The rest of this paper is organized as follows. Section IIrevisits some of the existing serial multiplier architectures inthe literature. In Section III, a serial accumulator developedbased on the new design paradigm is proposed to deal withvery high-speed data sampling rate of above 4 GHz. Theaccumulator employs asynchronous counters1 to perform bitaccumulation at each bit position of the PP matrix, resulting inlow critical path delay and small area, especially for operandswith long wordlength. Asynchronous counter has a low hard-ware complexity but the outputs are not synchronized with theclock which leads to a timing delay before all output bits of thecounter have settled to their final states. The correct output ofthe counter is read after a timing delay to be analyzed from thetiming diagram in Section VI-B. The data dependent counterschange states only when the input bit is 1, which leads tolow switching power dissipation. The height of the PP matrixafter buffering by the asynchronous counters is reduced loga-rithmically to before it is further reduced by theCSA tree. The details of the newly proposed serial-serial mul-tiplier are described in Section IV. The method is extended tosigned multiplication of 2s complement numbers in Section V.Section VI presents the application-specific integrated circuit(ASIC) implementation results and their comparisons withother designs on various performance measures. The energyefficiency of the proposed serial-serial multiplier design isdemonstrated also by the post-layout simulation results of

    1An asynchronous counter is also called a ripple counter, where the clockinput of each flip-flop of the counter is driven by the output of its precedingflip-flop, except for the first flip-flop, which is driven by the clock signal.

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MEHER et al.: HIGH BIT RATE SERIAL-SERIAL MULTIPLIER 3

    Fig. 3. Bit serial multiplier and its basic bit-cell [23].

    the designs implemented with STM 90-nm CMOS process.Section VII concludes this paper.

    II. REVIEW OF SERIAL MULTIPLIERSSerial multipliers are popular for their low area and power

    [20][35], and are more suitable for bit-serial signal processingapplications with I/O constraints and on-chip serial-link busarchitectures. They are broadly classified into two categories,namely serial-serial and serial-parallel multiplier. In a serial-serial multiplier both the operands are loaded in a bit-serialfashion, reducing the data input pads to two [20][32]. On theother hand, a serial-parallel multiplier loads one operand in abit-serial fashion and the other is always available for paralleloperation [27], [33][35]. Lyon [21] proposed a bit-serial inputoutput multiplier in 1976 which features high throughput at theexpense of truncated output. A full precision bit-serial multi-plier was introduced by Strader et al. for unsigned numbers [22].The rudimentary cell consists of a 5:3 counter and some DFFs.Later, Gnanasekaran [28] extended the work in [22] and devel-oped the first bit-serial multiplier that directly handles the nega-tive weight of the most significant bit (MSB) in 2s complementrepresentation. This method needs only cells for an -bitmultiplication but it introduces an XOR gate in the critical path,which ends up with a more complicated overall design. Lenneet al. [23] designed a bit-serial-serial multiplier that is modularin structure and can operate on both signed and unsigned num-bers. The 1-bit slice of a typical serial-serial multiplier, called abit-cell (BC), is excerpted from [23] and shown in Fig. 3. suchcells are interconnected to produce the output in a bit-serialmanner for an serial-serial multiplier. The operand bitsand are loaded serially in each cycle and added with the farcarry , local carry , and the partial sum in the 5:3counter. The cascaded cells form a shift register chain withof one cell connected to the of the next cell. The addition ofthe symmetric partial product bits (i.e., and ) is seriallyenabled by pulling first bit (FB) signal low after the first cycle.The partial sum is shifted out serially through another shift reg-ister chain formed by the cascaded to connection. Theregisters are cleared by activating last bit (LB) before the start

    Fig. 4. CSAS serial-parallel multipliers [34]. (a) Unsigned CSAS. (b) 2s com-plement CSAS.

    of a new multiplication. Aggoun et al. [30] proposed a new ar-chitecture for serial-serial multiplication with 50% reduction inhardware without degrading the speed. It is observed that all thereported multipliers have a common computational unit knownas the 5:3 counter in the critical path. In addition, there are alsoDFFs and AND gates in the critical path, which lower the opera-tion speed and limit the input bit rate. Many attempts have beenmade to reduce either the hardware cost or latency [22], [23] butthere is no improvement on the critical path. To overcome thisproblem, Pekmestzi et al. [26] designed a systolic serial multi-plier with a critical path comprising an FA, a 2:1 MUX, a DFFand an AND gate. Nibouche et al. [31] proposed two serial-se-rial architectures, Structure I and Structure II, which can handlea sampling frequency close to that of the serial-parallel multi-plier. The critical path of Structure I consists of an FA, a DFF,and an AND gate but it has a latency of cycles for anmultiplication and requires cycles to complete one multipli-cation.

    To reduce the number of computational cycles from toin an serial multiplier, several serial-parallel multi-

    pliers have been developed over the years [27], [33][35]. Mostof them are based on a carry save add shift (CSAS) structure.Fig. 4 shows the unsigned and 2s complement serial-parallelmultiplier based on the CSAS structure. It can be observed thatthe critical path consists of an FA, a DFF, and an AND gate forthe unsigned multiplier in Fig. 4(a) and an extra XOR gate forthe 2s complement multiplier in Fig. 4(b). Gnanasekaran [34]proposed a fast CSAS multiplier capable of producing -bitoutput in clock cycles at the expense of an extra RCA. Alow complexity 2s complement serial-par-allel multiplier was proposed by Sunder et al. [35]. It used theBaughWooley algorithm to avoid the sign extension problem[2]. Saleh et al. [27] designed many serial-parallel systolic andnon-systolic multipliers with low complexity based on theBooths multi-bit recoding. Although the sampling frequencyhas been improved in the serial-parallel multipliers and the totalnumber of computational cycles is halved, one of the operandsneeds to be loaded in parallel. Recently, there has not been any

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    new development in serial-serial multiplier design due to thematurity of conventional architectures.

    Parallel multipliers are more popular as the size is less criticaldue to technology scaling. However, with the emerging devel-opment of the on-chip serial-link bus architectures [16], [17],[19], serial-serial multipliers could find their potential roles inthe new generation of SoCs and FPGAs. In the following sec-tions, an approach to the design of serial multiplier that is ca-pable of processing input data at Gb/s without input bufferingand with reduced total number of computational cycles is pro-posed.

    III. PROPOSED SERIAL ACCUMULATORAccumulation is an integral part of serial multiplier design.

    A typical accumulator is simply an adder that successively addsthe current input with the value stored in its internal register.Generally, the adder can be a simple RCA but the speed of accu-mulation is limited by the carry propagation chain. The accumu-lation can be speed up by using a CSA with two registers to storethe intermediate sum and carry vectors, but a more complex fastvector merged adder is needed to add the final outputs of theseregisters. In either case, the basic functional unit is an FA cell.A new approach to serial accumulation of data by using asyn-chronous counters is suggested here which essentially count thenumber of 1s in respective input sequences (columns). A rudi-mentary version of our proposed serial accumulator was firstintroduced in [36].

    An accumulation of integers forcan be mathematically expressed as

    (1)

    For ease of exposition, let be an unsigned integer repre-sented by binary weighted bits. Hence, (1) can be rewrittenas

    (2)

    where is the th bit of the th operand and isassociated with a positional weight . By changing the orderof accumulation of and , (2) becomes

    (3)

    The inner-summation of (3), , represents thesum of all the 1s presented in the th column and it can easilybe accomplished by a simple serial 1s counter of width

    . Thus, an -bit accumulator can be realized withsuch dedicated counters to concurrently accumulate the 1s in

    each bit position. The dependency graph (DG) of such a schemewith and is shown in Fig. 5. The nodes in the DGperform , which is equivalent to an increment ofif . An example illustrating the serial accumulation ofthe inputs, 10110001, 00000001, 11001101, 11100011,01110001, 11001101, and 10001101 in the DG is shown

    Fig. 5. Dependency graph of the proposed accumulator for 7 8-bit operands.

    in Fig. 6. The accumulated output after each cycle is indicatedby the integer in each node and the final accumulated output ineach column can be represented by a 3-bit binary number. Atthe end of the th iteration, each counter produces an -bit bi-nary weighted output. By arranging all counter output bits of thesame positional weight in the same column with the least signif-icant bit (LSB) of each counter output aligned at the th column,the result of the accumulation can be obtained by column com-pressing the counter outputs with a CSA tree structure of height

    . Note that the height of the CSA tree would be a binary expo-nent of should the summation be carried out without the 1scounters. The architecture of the accumulator corresponding toFig. 5 is shown in Fig. 7. The bits of the input operands areserially fed into their corresponding counters from column 0(right-most in Fig. 5) to column 7. These counters execute in-dependently and concurrently. In each cycle of accumulation,a new operand is loaded and the counters corresponding to thecolumns that have a 1 input are incremented. The counters canbe clocked at high frequency and all the operands will be ac-cumulated at the end of the th clock. The final outputs of thecounters need to be further reduced to only two rows of partialproducts by a CSA tree, such as a Wallaces or Daddas tree [2].A carry propagate adder is then used to obtain the final sum.

    In Fig. 7, the counters C are used to count the number of 1s ina column. Each of them is a simple DFF-based ripple counter.The clock is provided to the first DFF and all the other DFFsare triggered by the preceding DFF outputs. A typical 3-bit 1scounter is shown in Fig. 8. The clock input is synchronized withthe input data rate and thus the operands can be accumulatedwith a high frequency defined by the setup time and propagationdelay of a DFF. Moreover, the counters change states only whenthe input is 1, which leads to low switching power. This simple

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MEHER et al.: HIGH BIT RATE SERIAL-SERIAL MULTIPLIER 5

    Fig. 6. Example for Fig. 5.

    Fig. 7. Architecture of an accumulator with and .

    Fig. 8. Hardware architecture of a 3-bit 1s counter.

    and efficient bit accumulation technique is used to design theproposed serial-serial multiplier.

    IV. PROPOSED SERIAL-SERIAL MULTIPLIER

    This section proposes a new technique of generating the in-dividual row of partial products by considering two serial in-puts, one starting from the LSB and the other from MSB. Usingthis feeding sequence and the proposed counter-based accumu-lation technique presented in Section III, it takes only cycles tocomplete the entire partial product generation and accumulationprocess for an multiplication. The theoretical underpin-ning of this design is elaborated as follows.

    The product of two -bit unsigned binary numbers andcan be expressed as

    (4)

    where and are the th and th bits of and , respectively,with bit 0 being the LSB.

    By reversing the sequence of index of (4)

    (5)

    By decomposing and rearranging (5)

    (6)

    where

    .

    In (6) the partial product row, , can be generated in. If is fed MSB (bit ) first and is fed LSB

    (bit 0) first, then in , is a partial product bitgenerated by the current input bits and , arepartial product bits of the current input bit, and each of thepreceding input bits of , i.e., , for ,and are the partial product bits of the input bit,and each of the preceding input bits of , i.e., for

    . By appropriately sequencing the input bitsof and into a shift register, one PP in each cycle

    can be generated. Consequently, can be obtainedin cycles.

    Fig. 9 illustrates the PP generation of an 8 8 multiplier fortwo unsigned numbers and . Fig. 9(a) shows the conven-tional partial product formation and Fig. 9(b) shows the genera-tion sequence of the PPs according to (6), with generatedin , for . is represented by thecenter column, by the columns to the right of the centercolumn and by the columns to the left.

    The PPs in Fig. 9(b) are generated in such an unconven-tional way in order to facilitate their accumulation on-the-flyby the proposed counter-based accumulation technique. A bankof counters is deployed, one in each column, to accumulate thebits arriving in the respective column. The DG shown in Fig. 10illustrates the complete operation of the PP generation and accu-mulation for a general multiplication. Each node (circle)in the DG represents a binary counter and an ancillary AND gateto generate a partial product bit. All nodes are identical in func-tionality and have three inputs and three outputs, except that thedata are propagated in different directions as depicted in Fig. 10.It can be seen from Fig. 10 that the nodes (L) on the left of themiddle column have the identical properties of shiftingto the immediate left node in , computing thePP bit and updating the counter output. Similarly, the nodes

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 9. Partial product generation schemes for an 8 8 serial-serial multipli-cation: (a) conventional and (b) proposed.

    Fig. 10. Dependency graph of serial-serial partial product generationand accumulation.

    (R) on the right of the middle column exhibit identical prop-erty but shift instead. The nodes (C) on the middle columnreceive a new pair of input bits, and , in each clockcycle, process the pair to form the PP bit and propagateand to the left and right nodes respectively. At theregistered output of the nodes contains the sum of all 1s inthe respective columns. The complete architecture of the pro-posed counter-based serial-serial multiplier (8 8) is shownin Fig. 11. The functionality of the nodes in Fig. 10 is mappedto the structure of Fig. 11 with two shift registers to performthe left and right shift, counters of different bit widthssufficient to accumulate the number of 1s in their respectivecolumns and an array of AND gates to generate the PP

    Fig. 11. Proposed architecture for 8 8 serial-serial unsigned multiplication.

    bits serially. It can be seen from Fig. 9 that for an 8 8 multi-plication, the middle column has a maximum height of 8 and a4-bit counter is sufficient to account for the maximum columnsum. The column height decreases gradually to either side andthe counter width also decreases correspondingly.

    The operation of the multiplier can be explained with the aidof the structure in Fig. 11. A PP bit corresponding to the middlecolumn of the PP is produced by the center AND gate when a newpair of input bits ( and ) is latched by the two DFFs (topmiddle) in each clock cycle. In the next cycle, and areshifted to the left and right, respectively, to produce the partialproduct bits with another pair of input bits and bythe array of AND gates. The AND gates are gated by toensure that the outputs driving the counters are free of glitches.Each counter changes state at the rising edge of the clock lineonly if a 1 is produced by its driving AND gate. After cy-cles, the counters hold the sums of all the 1s in the respectivecolumns and their outputs are latched to the second stage forsummation. The latched outputs are wired to the correct FAsand HAs (half adders) according to the positional weights ofthe output bits produced by the counters. From Fig. 11, it is ob-served that the column height has been reduced from 8 to 4 andthe final product, , can be obtained with two stages of CSAtree and a final RCA. Similarly, for 16 16, 32 32 and 6464 multipliers the column heights are reduced logarithmicallyfrom 16, 32, and 64 to 5, 6, and 7, respectively. This drasticreduction in column height leads to a much simpler CSA tree,and hence reducing the overall hardware complexity and powerconsumption. The latching register between the counter and theadder stages not only makes it possible to pipeline the serial dataaccumulation and the CSA tree reduction, but also prevents thespurious transitions from propagating into the adder tree. Twoclocks, namely and , are employed to synchro-nize the data flow between the two stages. The counter stageis driven by to process the inputs at high speed as thecritical path is defined by the delay of only an AND gate and a

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MEHER et al.: HIGH BIT RATE SERIAL-SERIAL MULTIPLIER 7

    Fig. 12. Snapshot of the output sequence of the proposed multiplier (withoutactual timing).

    DFF. The counters and shift registers are reset to 0 in everycycles to allow a new set of operands to be loaded. is de-rived from to drive the latching register. The output se-quence of a 16 16 multiplication is depicted in Fig. 12 withoutthe complication of the actual timing. The detailed performanceand timing analysis will be discussed in Section VI. The -bitproduct of an multiplication is produced in parallel by thefinal RCA and can be interfaced with other logic circuitry eitherin parallel or in serial depending on the requirement.

    V. EXTENSION FOR 2S COMPLEMENT NUMBERSMost digital systems operate on signed numbers commonly

    represented in 2s complement. In this section, the unsigned se-rial-serial multiplication architecture of Fig. 11 is extended todeal with signed multiplication. Using the BaughWooley al-gorithm [2], the multiplication of two signed number and

    represented in 2s complement form is given by

    (7)

    The multiplication of two unsigned numbers by the multiplierof Fig. 11 can be expressed as

    (8)

    The difference between (7) and (8) is given by

    (9)As and , (9) canbe simplified to

    (10)

    To extend the unsigned multiplier architecture of Fig. 11for signed multiplication without introducing a high over-head, the difference expressed in (10) must be simplified.Since , the following summation termsembedded in (10) can be simplified by the closed formexpression for the sum of a geometric progression, i.e.,

    (11)By substituting (11) into (10), the constant termscancel out and the difference, is reduced to

    (12)

    The difference of (12) is added to so that the structure ofFig. 11 can be modified to obtain , the product of the 2scomplement multiplication

    (13)

    In the proposed PP generation method, arrives only in. The generation of in (13) has to be

    delayed until . The remaining termscan be computed during the initial cycles. Hence, the dif-ference can be corrected in the CSA tree. It is trivial from (13)that a -bit shift register, a NAND gate and several FAs are

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 13. Proposed architecture for 8 8 serial-serial 2s complement multipli-cation.

    required for adding . The architecture of the proposed 2s com-plement serial-serial multiplier is depicted in Fig. 13. It is notedthat the input can be fed at the same speed as the unsigned mul-tiplier. A control input is required to latch in thefirst clock cycle to generate serially incycles. The bits of to be added in the CSA tree are marked inFig. 13. The addition of the term raises the height of CSA treeby only two bits regardless of the wordlength of the operands.

    VI. PERFORMANCE COMPARISON ANDIMPLEMENTATION RESULTS

    In this section, the performances of the proposed unsignedand 2s complement multiplier architectures are evaluated interms of area, speed, power, and other implementation factors.As the circuits being compared contain different types of logicgates and logic modules, the simplistic unit gate model that as-sumes all cells used are primitive two-input logic gate is inade-quate. It is better to preserve the basic modules as described inthe architectures as long as they correspond to the commonlyavailable cells in a typical standard cell library. The area anddelay of the proposed multiplier are derived and expressed interms of the area and delay of the basic logic modules that canbe found in a typical standard cell library for different operatorsizes. These theoretical estimates are then calibrated by the basiclogic modules from STM 90-nm CMOS standard cell libraryand used to benchmark the proposed multiplier design againstother serial-serial multipliers in Section VI-A. The proposed andanother parallel CSA array multipliers are also physically syn-thesized and laid out using the same cell library. The results arepresented and discussed in Section VI-B.

    A. Comparison With Serial-Serial MultipliersIn our proposed technique, PP bits are formed and accu-

    mulated by the 1s counters in just cycles to reduce the heightof the CSA tree logarithmically from to . The se-rial operands are input at a high frequency as the critical pathof the input stage consists of only an AND gate and a DFF. Thecritical path for the 2s complement multiplier remains the same,

    TABLE INORMALIZED AREA AND DELAY OF BASIC LOGIC MODULES

    Consisting of 2 FA and 1 HA

    which is in contrary to most CSAS 2s complement multiplierswhere an extra XOR gate delay is required in the critical path.For the proposed structures, the clock period, , is deter-mined by the delay of a three-input AND gate and aDFF

    (14)

    The latency required to accumulate all the partialproduct bits of an multiplication is cycles of

    (15)

    The counter outputs are latched to the adder stage. For properpipelining, must be times slower than . Thecritical path delay of the second stage is constrained bythe delay of the CSA tree and the final CPA

    (16)

    where and are the delays of an FAand the final CPA, respectively. Since(see Table I), .

    The area complexity of the proposed unsigned and signedmultiplier is derived from its basic logic modules. The 1st stageconsists of DFFs and AND gates only. The height of the partialproduct column is symmetrical around the center column anddecrements from to 1 towards both sides. Thus, the widthof the counter at columns away from the center column is

    , and the total number of DFFs required forthe counters in an multiplier is given by

    (17)

    where is the number of DFFs required by thecounter in the center column . By substitutinginto the power-series of (18) [37]

    (18)

    we have

    (19)

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MEHER et al.: HIGH BIT RATE SERIAL-SERIAL MULTIPLIER 9

    TABLE IIDELAY COMPARISON OF SERIAL-SERIAL MULTIPLIERS

    In addition, DFFs are also needed for the shift registers atthe top of the structure and DFFs are required to latch thecounter outputs to the CSA tree. Thus, the total number DFFsrequired is

    (20)

    The area complexity of the second stage can be determinedfrom the number of FAs and HAs required for the CSA treeand the CPA. The CPA is assumed to be implemented by thesimplest and lower power dissipation RCA. It should be notedthe pattern of the dot matrix in this stage is different from thatof the conventional multiplier (see Fig. 11). Owing to thepreliminary counter-based reduction, the height of the matrixis reduced logarithmically from to . In thefollowing derivations, the number of HAs and FAs is derivedfrom the number of stages, , that is required to reducethe height to 2. The number of HAs, , required for theproposed multiplier depends on only . is constantfor , . The number ofHAs in the first stage ( for the stage below the latchingregister) is , and for each of the subsequent stages.One HA is also required in the final RCA. Therefore, canbe expressed as

    (21)

    If is a power of two, the number of FAs in each stage of theCSA tree with reduction stages follows the pattern of

    . If is not a power of two, the number of FAs ineach stage is increased by 2 for every increment of with thesame . Hence, extra FAs are needed. Therefore,the total number of FAs, , required by an multipliercan be expressed as

    (22)

    where the first two terms account for the FAs in the CSA treewhen is an exact power of two, and the third term accountsfor the FAs if is deviated from the power of two for the same, and the last term accounts for the FAs in the final RCA. Note

    that the same power series in (18) is also applied here to simplifythe expression.

    In general the area of the proposed serial-serial unsigned mul-tiplier in terms of basic logic modules can be expressed as

    (23)where , , , and are the area of a DFF, FA,HA, and three-input AND gate, respectively.

    For the proposed 2s complement serial-serial multiplier ofFig. 13, additional DFFs, one two-input AND gate, onetwo-input NAND gate, inverters, and FAs are re-quired. Thus, its total area is

    (24)where , , and are the area of an inverter,two-input NAND, and two-input AND gate, respectively.

    To comprehensively evaluate the area-time complexity of ourproposed design against some other serial-serial multipliers, thearea and delay of the logic cells are normalized with a basic in-verter from the same standard cell library. This benchmarkingmethod has been widely adopted in the literature as a fair al-ternative practice when it is impractical to implement all com-petitor circuits for different dimensions [27], [35]. The normal-ized areas and delays of various logic cells are tabulated inTable I using the STM 90-nm standard cell library datasheet[38], where the delay and area of an inverter are 7.2 ps and1.56 m , respectively. As an inverter has a normalized delayof unity, a two-input NAND gate with a normalized delay of 1.57in Table I implies that its actual delay is 11.3 ps.The data for the logic modules which are not available in thelibrary (e.g., 5:3 Counter) is interpolated from their constituentlogic gates.

    Table II compares the critical path delay, cycles requiredand computation time of the proposed design with the existingserial-serial multipliers. In the Area Expression column of

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    TABLE IIIAREA COMPARISON OF SERIAL-SERIAL MULTIPLIERS

    Fig. 14. Normalized computation time for unsigned serial-serial multipliers.

    Table II, the critical path delay is expressed in terms of thewordlength and the normalized delays from Table I. Amongall, the proposed counter-based multiplier exhibits the lowestcritical path delay of 21.41. This is achieved by eliminating thecomplex 5:3 counter and replacing it by a simple 1s counter.In addition, all the PP bits can be accumulated in just cyclesin the proposed method while the others need cycles. Thelogarithmically reduced partial product bits are summed ina single cycle of which can be pipelined with theaccumulation stage. The period of is approximatedtimes as that of . Thus, an initial latency of cyclesof is used to produce the first output and thereafter oneoutput is generated in every cycle. All other serial-serialmultipliers are not easily pipelinable due to their tight integra-tion of the PP formation, summation and carry propagation.The critical path delay remains the same for the proposed 2scomplement multiplier while the others have an additional XORgate or MUX in the critical path [23], [28], [35].

    Fig. 14 shows the total computation time (normalized) of theexisting and proposed unsigned serial-serial multipliers for8, 16, 32, 64, 128 in a bar chart. It is evident from Fig. 14 thatthe proposed multiplier is much faster than all the other existingserial-serial multipliers, making it suitable for on-the-fly mul-tiplication for high data rate applications, especially those withon-chip serial-link bus architectures. To the best of our knowl-edge this is the first serial-serial multiplier that can operate at aninput data sampling rate that is even higher than the serial-par-allel multipliers [27], [33][35].

    The proposed multiplier has an area penalty over othermethods due to its semi-serial and semi-parallel structure. Itneeds a small adder tree of height in the secondstage which is not required in the other methods. It also needsmore flip flops to accumulate and latch the PP bits. The areacomplexities of different serial-serial multipliers are comparedin Table III, where the Normalized Area column lists the areain terms of the wordlength and the normalized values fromTable I. The approximation of is used to simplifythe normalized area expression of our design in the last columnof Table III. The cost of performing the accumulation as fastas the input data buffering required by an I/O limited designby using mixed execution modes in the two stages is the hard-ware overhead. The tradeoff between area and delay is betterillustrated by the area-delay product (ADP) for 8, 16, 32,64, 128 in Fig. 15. The ADP is shown in logarithmic scale dueto the large variation in ADP for different . The area-timetradeoff of our proposed multiplier is acceptable based on themoderate ADP from Fig. 15. In comparison with the parallelmultiplier, the proposed multiplier has better ADP, which isdiscussed in the next subsection.

    B. ASIC Implementation and Post-Layout Simulation Result

    The designs of the proposed 16 16, 32 32, and 6464 multipliers were coded in structural VHDL, synthesized andtechnology mapped to the STM 90-nm standard cell library bySynopsys Design Compiler to obtain the gate-level netlists andthen placed and routed using Cadence SoC Encounter. The lay-outs were generated using Cadence Virtuoso and they were DRC

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MEHER et al.: HIGH BIT RATE SERIAL-SERIAL MULTIPLIER 11

    Fig. 15. Normalized area-delay-product comparison.

    and LVS verified. The post-layout simulation results were ob-tained by extracting the RC from the routed designs using Syn-opsys StarRC-XT. A transistor level simulation was performedon the extracted netlist. The post-layout simulation results ofthe proposed multiplier are compared with those of the con-ventional parallel CSA array multiplier to ascertain its merits.For a fair comparison, I/O pads were not included in the layoutand the conventional 16 16, 32 32, and 64 64 parallelCSA array multipliers were implemented with on-chip serialdata input buffer.

    An important implementation issue associated with ourcounter-based serial-serial multiplier is that due to the asyn-chronous nature of the serial 1s counter, the counter output isnot available immediately after clock cycles as the last inputhas to propagate from the LSB to the MSB of the counter. Asetup and hold violation can occur if the counter output bits areread at the edge of before they are stabilized. In thiscase the latching of counter outputs to the second adder stagemust be delayed by in the worst case forthe longest counter of width at the center columnof Fig. 11. This is addressed by refraining to triggerthe counter for the last , which provides an additionalguard band of for clock uncertainty.The serial data inputs and are kept at 0 duringthese cycles to maintain the counter outputs.

    A min-max timing analysis is performed on the design afterplacement and routing to ensure that the worst-case delay hasbeen considered to avoid any timing violations. The timingwaveforms of a 16 16 multiplication captured from thepost-layout simulation with Synopsys NanoSim are shown atthe bottom of Fig. 16, where and are the inputs (all

    Fig. 16. Timing relation between , , and of asynchronouscounter and for a 16 16 multiplication.

    TABLE IVAREA m , DELAY (ns) AND ADP m s COMPARISON

    1s for the worse case) to the 5-bit counter at the center column(for , it is a 4-bit counter as in Fig. 11), Counter isthe output of the 5-bit counter, Latch is the output of thelatching register, and is set at 4 GHz. The timingwaveforms in the shaded window is zoomed in and general-ized for multiplication as shown at the top of Fig. 16.From the standard cell library, it is known that the setup time,clock-Q delay and hold time of a DFF are 70, 40, and 20 ps,respectively. It can be seen that there are almost two cyclesof 40 ps and one cycle of

    20 ps before and after the active edge of .Thus, the setup and hold time violations are avoided. It can beseen from Fig. 16 that the counter is able to count from 0 to 16(10 in HEX) with the outputs correctly latched to the secondpipeline adder tree on .

    The post-layout area, delay and ADP of the proposed and con-ventional array multipliers for wordlengths of 16 16, 32 32,and 64 64 are tabulated in Table IV. The delay of the CSAarray multipliers consists of the time required to buffer a com-plete set of operands and the time required for the PP reductionand final CPA. For the proposed counter-based multipliers, thedelay consists of the accumulation delay , the delay ofCSA tree and final CPA , and the additional delay re-quired for bit-propagation of the counters .From the post-layout simulation results of Table IV, our pro-posed multiplier achieves a delay comparable to parallel mul-tiplier but the area is significantly smaller, especially for largeroperands due to its simpler adder network. From Table IV, the

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 17. Power convergence result of the proposed 64 64 multiplier.

    TABLE VPOWER (mW) AND ENERGY (pJ) COMPARISON

    ADPs of the proposed 32 32 and 64 64 multipliers are ap-proximately 28% and 52% lower than the corresponding CSAarray counterparts.

    If a series of multiplications is to be performed consecu-tively, the proposed multipliers can be easily pipelined intotwo stages. Although and are defined by(14) and (16), respectively, and of (15) mustbe equalized in pipeline mode for proper synchronizationof the two stages and accordingly and arethen determined. In the current implementation, , with

    , is derived from usinga clock divider circuit. In the two-stage pipelined configuration,the proposed 16 16, 32 32, and 64 64 multipliers canachieve throughput rates as high as 216, 120, and 64 MHz,respectively. Note that both the proposed and the CSA arraymultiplier reported in Table IV are not pipelined at the addertree. For a fully pipelined CSA multiplier, the clocking fre-quency could be higher, and so is its throughput, but the areaand routing complexity would be significantly increased aseach stage of FAs requires a set of DFFs.

    The average power consumption was estimated by SynopsysNanosim with random serial inputs at 4 GHz and a supplyvoltage of 1 V. Monte Carlo statistical model [39] is adoptedto obtain the mean power dissipation of each design with99.9% confidence level that the error is bound below 0.25% atconvergence. The Monte Carlo simulation of the mean powerfor the proposed 64 64 multiplier is shown in Fig. 17. Itshows that the cumulative average power recorded after every0.25 s (Sampling Time) converges rapidly within the 0.25%error bound of the final estimate. The average power consump-tions of the proposed and the CSA multipliers were listed inTable V. The energy per operation in Table V is obtained by theproduct of the estimated mean power and the computation time

    of one multiplication operation. In the case of our proposedcounter-based multiplier, the additional cycles required forcompensating the bit propagation of the counters from the LSBto the MSB were included.

    VII. CONCLUSIONIn this paper, a new method for computing serial-serial mul-

    tiplication is introduced by using low complexity asynchronouscounters. By exploiting the relationship among the bits of a par-tial product matrix, it is possible to generate all the rows se-rially in just cycles for an multiplication. Employingcounters to count the number of 1s in each column allows thepartial product bits to be generated on-the-fly and partially ac-cumulated in place with a critical path delay of only an ANDgate and a DFF. The counter-based accumulation reduces thePP height logarithmically and makes it possible to achieve aneffective reduction rate of using an FA-based CSA tree.The post-layout simulation results show that the serial input canbe sampled at a rate as high as 4.54 GHz when the multiplier ismapped to an ASIC with the STM 90-nm CMOS process. Thesampling rate can be increased to 8 GHz if high speed cells fromthe same library are used. The proposed counter-based multi-plier outperforms many serial-serial and serial-parallel multi-pliers in speed but its hybrid architecture does carry an area over-head. Overall, the ADP is comparable to other serial-serial mul-tipliers. Comparing with the 32 32 and 64 64 parallel CSAarray multipliers, the proposed multiplier has comparable speedbut is 35% and 58% more area efficient, respectively. In addi-tion, our 64 64 multiplier consumes about 31% less energyper operation than its parallel counterpart. Last but not least, thecounter-based approach has clear advantage of low I/O require-ment and hence is most suitable for complex SoCs, advancedFPGAs and high-speed bit-serial applications.

    REFERENCES[1] D. Bailey, E. Soenen, P. Gupta, P. Villarrubia, and D. Sang, Chal-

    lenges at 45 nm and beyond, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), San Jose, CA, 2008, pp. 1118.

    [2] B. Parhami, Computer Arithmetic: Algorithms and Hardware De-signs. New York: Oxford Univ. Press, 2009.

    [3] A. D. Booth, A signed binary multiplication technique, Quarterly J.Mechan. Appl. Math., vol. 4, no. 2, pp. 236240, Aug. 1951.

    [4] R. Menon and D. Radhakrishnan, High performance 5:2 compressorarchitectures, IEE Proc-Circuits Devices Syst., vol. 153, no. 5, pp.447452, Oct. 2006.

    [5] C. H. Chang, J. Gu, and M. Zhang, Ultra low-voltage low-powerCMOS 4-2 and 5-2 compressors for fast arithmetic circuits, IEEETrans. Circuits Syst. I, Reg. Papers, vol. 51, no. 10, pp. 19851997,Oct. 2004.

    [6] M. Burzio and P. Pellegrino, Serilizing-parallelizing circuit for highspeed digital signals, U.S. Patent 5 790 058, Aug. 4, 1998.

    [7] STMicroelectronics, Geneva, Switzerland, IO-PAD librarydatasheet, 2006.

    [8] H. K. Mecklai and A. L. Webb, Compact buffer design for serial I/O,U.S. Patent 6 167 109, Dec. 26, 2000.

    [9] T. Gloekler, I. Holm, R. C. Koester, and M. W. Riley, High speedon-chip serial link apparatus, U.S. Patent 2008/0 133 800, Jun. 5, 2008.

    [10] C. Dudha, T. Schoenauer, and P. Wallner, Parallelization of serial dig-ital input signals, U.S. Patent 2008/0 055 126 A1, Mar. 6, 2008.

    [11] W. M. Pitio, Serilizer, U.S. Patent 2003/0 102 992 A1, Jun. 5, 2003.[12] M. Koga, Serial/parallel converter, U.S. Patent 6 339 387 B1, Jan. 15,

    2002.[13] A. K. Lenstra and E. Verheul, Selecting cryptographic key sizes, J.

    Cryptology, vol. 14, no. 4, pp. 255293, 2001.

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MEHER et al.: HIGH BIT RATE SERIAL-SERIAL MULTIPLIER 13

    [14] J.-Y. Lai and C.-T. Huang, Elixir: High-throughput cost-effectivedual-field processors and the design framework for elliptic curvecryptography, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.16, no. 11, pp. 15671580, Nov. 2008.

    [15] A. Hariri and A. Reyhani-Masoleh, Bit-serial and bit-parallel mont-gomery multiplication and squaring over , IEEE Trans.Comput., vol. 58, no. 10, pp. 13321345, Oct. 2009.

    [16] M. Ghoneima, Y. Ismail, M. Khellah, J. Tschanz, and V. De, Serial-link bus: A low-power on-chip bus architecture, IEEE Trans. CircuitsSyst. I, Reg. Papers, vol. 56, no. 9, pp. 20202032, Sep. 2009.

    [17] R. Dobkin, A. Morgenshtein, A. Kolodny, and R. Ginosar, Parallel vs.serial on-chip communication, in Proc. ACM Int. Workshop Syst. LevelInterconnect Prediction (SLIP), Newcastle, U.K., 2008, pp. 4350.

    [18] R. Dobkin, M. Moyal, A. Kolodny, and R. Ginosar, Asynchronouscurrent mode serial communication, IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 18, no. 7, pp. 11071117, Jul. 2010.

    [19] P. Teehan, G. Lemieux, and R. Greenstreet, Towards reliable 5 Gbpswave-pipelined and 3 Gbps surfing interconnect in 65 nm FPGAs,in Proc. 17th ACM/SIGDA Int. Symp. Field-Program. Gate Arrays(FPGA), Newcastle, U.K., 2009, pp. 4352.

    [20] I. N. Chen and R. Willoner, An O(n) parallel multiplier with bit se-quential input and output, IEEE Trans. Comput., vol. C-28, no. 10, pp.721727, Oct. 1979.

    [21] R. F. Lyon, Twos complement pipeline multipliers, IEEE Commun.Lett., vol. 24, no. 4, pp. 418425, Apr. 1976.

    [22] N. R. Strader and V. T. Rhyne, A canonical bit-sequential multiplier,IEEE Trans. Comput., vol. C-31, no. 8, pp. 791795, Aug. 1982.

    [23] P. Ienne and M. A Viredaz, Bit-serial multipliers and squarers, IEEETrans. Comput., vol. 43, no. 12, pp. 14451450, Dec. 1994.

    [24] G. Bi and E. V. Jones, High-performance bit-serial adders andmultipliers, IEE Proc G-Circuits Devices Systs, vol. 139, no. 1, pp.109113, Feb. 1992.

    [25] A. Almiladi, M. K. Ibrahim, M. Al-Akidi, and A. Aggoun, Highperformance scalable bidirectional mixed radix- serial serial mul-tipliers, IET Proc-Comput. Digit. Tech., vol. 1, no. 5, pp. 632639,2007.

    [26] K. Z. Pekmestzi, P. Kalivas, and N. Moshopoulos, Long unsignednumber systolic serial multipliers and squarers, IEEE Trans. CircuitsSyst. II, Brief Papers, vol. 48, no. 3, pp. 316321, Mar. 2001.

    [27] H. I. Saleh, A. H. Khalil, M. A. Ashour, and A. E. Salama, Novelserial-parallel multipliers, IEE Proc-Circuits Devices Syst., vol. 148,no. 4, pp. 183189, Aug. 2001.

    [28] R. Gnanasekaran, On a bit-serial input and bit-serial output multi-plier, IEEE Trans. Comput., vol. C-32, no. 9, pp. 878880, 1983.

    [29] P. T. Balsara and D. T. Harpe, Understanding VLSI bit serial multi-pliers, IEEE Trans. Edu., vol. 39, no. 1, pp. 1928, Feb. 1996.

    [30] A. Aggoun, A. Ashur, and M. K. Ibrahim, Area-time efficient serial-serial multipliers, in Proc. IEEE Conf. Circuits Syst. (ISCAS), Geneva,Switzerland, 2000, pp. 585588.

    [31] O. Nibouche, A. Bouridarie, and M. Nibouche, New architecturesfor serial-serial multiplication, in Proc. IEEE Conf. Circuits Syst.(ISCAS), Sydney, Australia, 2001, vol. 2, pp. 705708.

    [32] S. Haynal and B. Parhami, Arithmetic structures for inner-product andother computations based on a latency-free bit-serial multiplier design,presented at the 13th Asilomar Conf. Signals, Syst. Comput., Geneva,Switzerland, 1996.

    [33] I.-C. Wu, A fast 1-D serial-parallel systolic multiplier, IEEE Trans.Comput., vol. C-36, no. 10, pp. 12431247, Oct. 1987.

    [34] R. Gnanasekaran, A fast serial-parallel binary multiplier, IEEETrans. Comput., vol. C-34, no. 8, pp. 741744, Aug. 1985.

    [35] S. Sunder, F. El-Guibaly, and A. Antoniou, Twos-complement fastserial-parallel multiplier, IEE Proc.-Circuits Devices Syst, vol. 142,no. 1, pp. 4144, Feb. 1995.

    [36] M. R. Meher, C. C. Jong, and C. H. Chang, High-speed and low-power serial accumulator for serial/parallel multiplier, in Proc. IEEEAsia-Pacific Conf. Circuits Syst. (APCCAS), Macau, China, 2008, pp.176179.

    [37] E. Kreyszig, Advanced Engineering Mathematics. New York, NJ:Wiley, 2006.

    [38] STMicroelectronics, Geneva, Switzerland, Standard cell librarydatasheet, 2006.

    [39] R. Burch, F. N. Najm, P. Yang, and T. N. Trick, A Monte Carlo ap-proach for power estimation, IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 1, no. 1, pp. 6371, Mar. 1993.

    Manas Ranjan Meher (S08) received the B.S.degree from G.M. College, Sambalpur University,Orissa, India, in 2001, the M.S. degree from UtkalUniversity, Orissa, India, in 2003, both in computerscience, and the Master of Technology degree inelectronics and communication engineering fromNational Institute of Technology (NIT), Rourkela,India, in 2006. He is currently, pursuing the Ph.D.degree from the School of Electrical and ElectronicEngineering, Nanyang Technological University,Singapore.

    He worked as a Research Engineer with the VLSI Design Laboratory, De-partment of Electronics and Communication Engineering, NIT. His researchinterest includes designing high-speed bit-serial arithmetic circuits, algorithmto hardware mapping of computational intensive DSP algorithms and digital ICdesign.

    Mr. Meher has been listed in Marquis Whos Who in the World since 2009.

    Ching Chuen Jong received the B.Sc. (Eng) andPh.D. degrees in electronic engineering from QueenMary, University of London, London, U.K.

    He had worked in the area of high-level synthesisof digital systems for over three years both in theacademics and the industries in U.K. before he joinedthe Nanyang Technological University, Singapore, asa faculty member, where he is currently an AssociateProfessor with the Division of Circuits and Systems,School of Electrical and Electronic Engineering.His current technical interest includes high-level

    synthesis, digital integrated circuit design, fast-prototyping of digital systemswith FPGA, reconfigurable systems and CAD for IC designs.

    Dr. Jong is a Chartered Engineer (CEng), a Chartered IT Professional (CITP),a member of the Institution of Engineering and Technology (IET), and a memberof the British Computer Society (BCS).

    Chip-Hong Chang (S92M98SM03) receivedthe B.Eng. (Hons) degree from National Universityof Singapore, Singapore, in 1989, and the M.Eng.and Ph.D. degrees from Nanyang TechnologicalUniversity (NTU), Singapore, in 1993 and 1998,respectively.

    He served as a Technical Consultant in industryprior to joining the School of Electrical and Elec-tronic Engineering, NTU, in 1999, where he is cur-rently an Associate Professor. He holds joint appoint-ments at the university as Assistant Chair of Alumni,

    School of EEE since June 2008, Deputy Director of the Centre for High Per-formance Embedded Systems (CHiPES) since 2000, Program Director of theCentre for Integrated Circuits and Systems (CICS) from 20032009, and Pro-gram Director of VLSI Laboratory since 2010. His current research interestsinclude low power arithmetic circuits, digital filter design, application specificdigital signal processing, and digital watermarking for IP protection. He haspublished three book chapters and over 150 research papers in refereed interna-tional journals and conferences.

    Dr. Chang serves as the Associate Editor of the IEEE TRANSACTIONS ONCIRCUITS AND SYSTEMSPART I: REGULAR PAPERS since 2010, Editorial Ad-visory Board Member of the Open Electrical and Electronic Engineering Journalsince 2007, and the Journal of Electrical and Computer Engineering since 2008,the special issue guest editor of the Journal of Circuits, Systems and Computersin 2010 and in several international conference advisory and technical programcommittee. He is a Fellow of the IET.