Download pdf - A 16Ã—16 MUX Based Multiplier Design

8/3/2019 A 1616 MUX Based Multiplier Design

1/9

International Journal of Electronic Engineering Research

Volume 1 Number 1 (2009) pp. 5361

Research India Publications

http://www.ripublication.com/ijeer.htm

A 1616 MUX Based Multiplier Design UsingOptimized Static CMOS Logic Style

Abhijit Asati* and Chandrashekhar**

* Lecturer, Electrical & Electronics Engineering Group, BITS, Pilani, India

** Director, Central Electronics Engineering Research Institute, Pilani, India

Abstract

Simpler VLSI implementation of array multipliers makes them preferable for

smaller operand sizes, in-spite of their linear time complexity. In general array

multipliers have bad space complexity O (n2), and it requires approximately n

2

cells to produce multiplication, therefore as the operand size grows the circuit

takes large area and power. In this paper we present a MUX based 1616unsigned multiplier circuit, which utilize an efficient partial product

generation and partial product addition technique. The time and spacecomplexity of such multiplier is much better than simpler array multiplier

techniques. The multiplier has been designed using optimized static CMOS

logic cells to provide best area, power and delay performance. The multiplier

circuit is implemented using conventional CMOS logic in 0.6m, N-wellCMOS process (SCN_SUBM, lambda=0.3) of MOSIS, and simulated after

parasitic extraction. The simulation result shows large reduction in

propagation delay and the average power compared to tree multiplier

implementation by [3].

Keywords: MUX based, array, Wallace tree, booth encoding, partial product,

complexity, operand size,

IntroductionIn Digital Signal Processor implementation like Standard Digital Signal Processors

and ASIC Digital Signal Processors, the multiplier is used as fundamental building

block. The performance of different signal processing algorithms like frequency

domain filtering (FIR and IIR), frequency-time transformations (FFT), Correlation etc

depend on performance of multiplier implementation. In most real-time DSP

processing task, the multiplier block must operate at high speed, consuming less

layout area and low Power. The multiplication algorithms differ in the means of


2/9

54 Abhijit Asati and Chandrashekhar

partial product generation and partial product addition [1]. The array multipliers

have linear time complexity i.e O (n) therefore their delay may degrade for multipliers

having larger operand sizes. Also array multipliers have bad space complexity O (n2),

and they requires approximately n2

cells to produce multiplication, therefore as the

operand size grows the circuit takes large area and power [2], [4], [5]. The reduction

in partial product row by factor of n can be achieved using a radix-m booth encoding,

(where m=2n). By using Booth radix-4 (m=4=2

2) encoding the partial product rows

can be halved [3]; therefore the number of logic cells required to generate partial

product are reduced to n2 /2 [2]. Further in Wallace tree accumulation, since ripple

effect is reduced it produces product in far less time, the time complexity is reduced to

O (log n) but requires large gate and routing area compared to regular array, hence

unsuitable for VLSI implementation [2]. The advantage of reduction in hardware

using Booth encoding scheme can be combined with, accelerated Wallace treeaccumulation of partial product to obtain the reduced time complexity of O (log n),

which are very much suitable for multipliers having large operand sizes [2], [3]. As

discussed earlier, for smaller operand sizes the tree based architectures may have

smaller gate delay but consume more silicon area due to increased routing and

encoding overheads, on the other hand array multipliers have larger gate delay but

consume smaller routing length. The MUX based array multipliers show faster and

compact implementation due to efficient partial product generation and efficient

partial product addition. In this paper we present, an implementation of 1616,multiplier design using MUX based array technique and static CMOS logic cells.

These static CMOS logic cells provide best area, power and delay performance as

described in [6]. The VLSI implementation of multiplier circuit is done using 0.6m,N-well CMOS process (SCN_SUBM, lambda=0.3) of MOSIS, using conventional

CMOS logic. Simulation results are compared with another faster Booth encoded

Wallace tree multiplier implementation as in [3]. Section II discusses the conventional

static CMOS logic design style, section III explains the design of MUX based

multiplier algorithm, Section IV describes the illustration of the Multiplication Logic;

Section V describes schematic 44 multiplier and 1616 multiplier. Physicalimplementation and results are described in section VI. Section VII concludes the

paper.

Conventional static CMOS Logic Design styleA static logic gate generates its output corresponding to the applied input voltages

after a certain time delay, and it can preserve its output level (or state) as long as the

power supply is provided. In steady state each gate output is connected to either Vdd

or Gnd through a low-resistive path therefore for a static input, the output levels are

preserved, while the operation dynamic logic circuits relies on temporary storage of

signal values on the capacitance of dynamic circuit nodes. Conventional static logic

style offers a versatile implementation of logic functions based on static or steady

state behavior of simple CMOS structures. It is most suitable and widely accepted for

many VLSI circuit implementations due to its important properties like high speed,

low power, large noise margins, no logic degradation and validity of logic design


3/9

A 1616 MUX Based Multiplier Design Using Optimized 55

style at scaled down technologies. A logic gate with fan-in of n requires 2n (n N-

type + n P-type) devices. Two logic blocks, N-block and P-block, form a CMOS

gate. The topology of N-block is the dual of that of the P-block. Since both the two

blocks have equal number of transistors, transistor count may increase. The channel

widths of series connected n-channel MOS transistors (NMOS) or p-channel MOS

transistors (PMOS) have to be increased to obtain a reasonable conducting current to

drive capacitive loads. The increase in size of PMOS results in a significant area

overhead, and also an increased gate input capacitance, which may lead to high

dynamic power dissipation. The higher gate input capacitance loads the previous stage

thereby increases the delay. The ratio of PMOS/NMOS transistor widths () shouldbe chosen optimally for achieving good, noise margin, higher speed and lower power

consumption as described in [7], [9]. The short-circuit currents of a static CMOS gate

can be minimized by appropriately sizing transistors for equal rise and fall times. Theschematic of 1-bit full adder, 2-input AND, 3-input AND, 2-input MUX, 2-input

function implemented using Conventional Static CMOS Logic design is shown in

Figure 1. The full adder cell is designed using principle of symmetry has 28

transistors as described in [6], [8]. The 28-transistor performs considerably better than

the 40-transistors version [6]. The 32-bit adder designed using complimentary CMOS

has a power delay product of less than half of the CPL version [6]. The 2-input AND

cell, 3-input AND cell, 2 input MUX and other cells also provide better a power delay

product.

(a)

(b) (c) (d)

Figure 1: schematic using conventional static CMOS logic design style of (a)complex Full adder cell using principle of symmetry (b) 2 input AND gate (c) 3 input

AND gate (d) 2 input MUX .


4/9


MUX based multiplier algorithmIt is unsigned multiplier algorithm in which one bit of the multiplier and one bit of the

multiplicand are processed in parallel. The algorithm is symmetric, i.e., the multiplier

and multiplicand can be interchanged. According to this algorithm, the sum of the two

operands, progressively computed, is a useful quantity that is used in the computation

of certain partial products. The different quantities are computed one bit at each step

of the algorithm and the appropriate quantity is then selected in the next step, if

required so. The parallel implementation of this algorithm yields an iterative type

array. Compared to the implementation based on the modified booths algorithm, it

consumes the same amount of circuitry but yields faster multiplication. This

multiplexer-based architecture performs parallel computation of the partial sums of

the two operands together, which simplifies the tasks such as compression and

accumulation. It also performs favorably well with regards to processing speed,compared to other regular array architectures. The multiplication logic can be

explained using equation 1, equation 2, equation 3, equation 4 and equation 5.

XYPLet

yyyY

xxxX

nn

nn

=

=

=

,

)1(021

021

K

K

Xj &Yj are binary nos. after truncation, up-to the (j+1)th

bit in X,Y respectively;

)2(10,021

021

+


5/9


)4(1,1

0,00

0,1

1,0

,

==+=

===

===

===+=

jjjjj

jjj

jjjj

jjjj

jjjjj

yxifYXZ

yxifZ

yxifYZ

yxifXZyXYxZ

where

Illustration of the Multiplication LogicThe example 1 shows the multiplication process for two binary 4-bit numbers using

MUX-based approach. The multiplication process shows that the numbers of rowsremain the same, but numbers of partial product bits to be compressed in a particular

column are now restricted to only 3-bits; this makes compression much faster and

easier. If carry bits C1, C2, C3 as shown by example 1 are taken care then the

number of bits to be added in particular column will be only 2-bits. The two columns

can be added simultaneously using 2 bit CLA, which also accepts carry input C1, C2,

C3 of particular column (this is possible because, these carries are occurring in

alternate columns). Thus the first step in algorithm is generation of partial product

rows and second step performs the addition of these partial products together with

compression. Thus compared to other regular array multiplier it will be faster. It

produces output in time T= (n+1)FA_2CLA

whereFA_2CLA

is delay of a 2 bit CLA

adder, with a timing overhead one 4:1 MUX delay, while regular array multiplier

takes approximate delay of T= (2n) FA. The large area overhead will be due torouting needed between these MUX.

Example 1: X0Y0, X1Y1, X2Y2 & X3Y3 at the positions shown below has be added with

appropriate term selected by 4:1 MUX based on select lines shown in first column.Let X= X3X2X1X0=0111=(+7)10 and Y= Y3Y2Y1Y0=0011=(+3)10

The uncolored portion explains the operation to be performed by algorithm and colored

portion show the application of algorithm on selected inputs X and Y.

Working of MUX: Select lines 00/01/10/11 corresponds to I1/I2/I3/I4.

Select

line for4:1

MUX

X3Y3 X2Y2 X1Y1 X0Y00 0 1 1

0/0/0/C1=0/0/0/1

0/X0/Y0/S0=0/1/1/0

X1Y1=11

1 0

0/0/0/C2=0/0/0/1

0/X1/Y1/S1=0/1/1/1

0/X0/Y0/S0=0/1/1/0

X2Y2=10

0 1 1

0/0/0/C3=0/0/0/1

0/X2/Y2/S2=0/1/0/0

0/X1/Y1/S1=0/1/1/1

0/X0/Y0/S0=0/1/1/0

X3Y3=00

0 0 0 0

0 0 0 1 0 1 0 1 =(21)10

P7 P6 P5 P4 P3 P2 P1 P0


6/9


Schematic 44 multiplier and 1616 multiplierThe logic explained in example 1 can be shown through a schematic, which use 4:1Multiplexers & AND gates as shown in figure 2. The multiplexers are used to choose

the Zj for the Zj2jterms (refer equation 5) while AND gates are used to produce the

xjyj22j

terms. The logic for MUX based multiplier implementation is shown in Figure

2. The complete logic structure to accumulate the partial product terms utilizes Cell-I

and Cell-II, which are shown in Figure 3 [2]. Similar technique can be used in design

of 1616 multiplier.

4:1MUX

X1

Y1

0 X0 Y0 S0

4:1MUX

0 0 0 C3

4:1MUX

X2

Y2

0 X0 Y0 S0

4:1MUX

0 Y1 X1 S1

4:1MUX

X1Y1

0 X2 Y2 S2

4:1MUX

X3Y3

0 X0 Y0 S0

4:1MUX

0 Y1 X1 S1

4:1MUX

0 0 0 C2

4:1MUX

0 0 0 C1

AND2

X0 Y0

20X0Y0

AND2

X1 Y1

22X1Y1

AND2

X2 Y2

24X2Y2

AND2

X3 Y3

26X3Y3

Z121

Z222

Z323

Figure 2: logic for MUX based multiplier implementation.

Xj Yj

4:1

MUXFA

CinSin

Cout

Xi=XjYi=YjSi=Sj

CELL-I

XiYi

SiXjYj

Sout

0 I

Xj Yj CinSin

Xj Yj Cout Sout

Xi=Xj

Yi=YjSi=Sj


7/9


Xj Yj

AND2

FA

FAAND2

Sout

Sin

Cout

Xj

Yj

Cj

Yi=Yj

Xi=Xj

Si=Sj

Cj+1

CELL-II

Cin

XiYi

II

Xj Yj Cin

C

Sin

Xi=Xj

Yi=YjSi=Sj

XiYiSoutCoutCj+1

Figure 3: Cell-I and Cell-II used in MUX-based multiplier implementation.

Figure 4: Photomicrograph of a 1616 MUX based multiplier.

Physical implementation and ResultLayout for a 1616 MUX based, unsigned multiplier circuit shown in figure (4) is

implemented in 0.6m, N-well CMOS process (SCN_SUBM, lambda=0.3) ofMOSIS, using conventional CMOS logic. A schematic library consisting of 7

functional cells is defined for static CMOS design styles comprising of 1-bit full

adder, 2-input AND, 3-input AND, 2-input MUX, 2-input XOR, 2-input OR and 3-

input OR function. Corresponding to the schematic library, physical libraries were

designed using conventional CMOS logic design styles using the design principles of

[7], [8], [9], [10]. Three different versions of each physical library were developed by

respectively sizing the W/L ratios of the NMOS transistor to values of 3,5 and 7 (W/L

values smaller than 3 were also experimented with but not considered further as they

resulted in parasitic dominated slower speeds due to weak drives of transistors and

were not considered good candidates for high performance. The layout assemblies for


8/9


the 16-bit multiplier were carried out using these cell libraries and automatic place

and route tool LEDIT (SPR) from M/s Tanner Research Inc. It was noticed that the

physical library utilizing W/L ratio of 3 for NMOS transistor gave the smallest

average switching energy-delay product.

The generated layouts were simulated after parasitic extraction using circuit

simulator, ELDO spice. Supply voltage VDD is kept at 3.3V. The table 1 shows the

comparison of important parameters like propagation delay and power dissipation at

20MHz data rate with tree based implementation as in [3]. Table 2 shows the

maximum power leakage power, transistor count, core area, total routing length and

number of vias.

Table 1

Algorithm

(technology)

VDD(V)

Propagation

delay () nsAverage

power (mW)

Proposed

(0.6m)

3.3 14.15 22.05

BEWM

ref [3]

(1.25 m)

5 60 100

Table 2

Algorithm

(technology)

Maximum

Power

(mW)

Leakage

Power

(nW)

Transistor

count

Core

area

(mm2)

Total

routing

length

(mm)

Number

of Via

Proposed

(0.6m)

623.46 53.34 10168 23.76 1386.71 3452

Comparing these two multiplier architectures shows that proposed MUX based

array multiplier architecture shows reduction in delay by a factor of 0.235 and

reduction in average power consumption almost by a factor of 0.22. The maximum

instantaneous power, leakage power, transistor count, core area, total routing length

and number of vias are also shown for judging the VLSI implementation

characteristics.


9/9


ConclusionThis paper present a 16-bit MUX based unsigned multiplier implementation using an

optimized static CMOS logic style. The multiplier algorithm performs efficient partial

product generation and addition; which makes its time and space complexity better

than other array multipliers. The simulation results are compared with faster tree

multiplier implementation shows reduction in propagation delay by a factor 1/4 and

average switching power by approximately by a factor 1/4.

References

[1] A. Hesham, Technology scaling effects on multipliers, IEEE Transactions

on Computers, Vol.47, No.11, pp. 1201-1215, November 1998.[2] Z. Kiamal, Multiplexer-based array multipliers, IEEE Transactions on

Computers, Vol.48, No.1, pp. 15-23, January 1999.

[3] F Jalil, M *N Booth encoded multiplier generator using optimized wallace

trees, IEEE Transactions on very large Scale Integration (VLSI) Systems,

Vol. 1, No.2, pp. 120-125, June 1993.

[4] V. Chanramouli, Self-Timed design in GaAs-case study on a high-speed,

parallel multiplier, IEEE Transactions on very large Scale Integration (VLSI)

Systems, Vol. 4, No.1, pp. 146-149, March 1996.

[5] P. Kornerup, A systolic, linear-array multiplier for a class of right-shift

algorithms, IEEE Transactions on Computers, Vol.43, No.8, pp. 892-898,

August 1994.[6] Reto Zimmermann and Wolfgang Fichtner, Low-Power Logic Styles: CMOS

Versus Pass Transisistor Logic IEEE Journal of solid state circuits, Vol. 32,

No. 7, pp. 1079-1090, July 1997

[7] Mohab Anis, Mohamed Allam and Mohamed Elmasry, Impact of

Technology Scaling on CMOS Logic Styles, IEEE Transaction on circuits

and systems-II, Analog and Digital Signal Processing, VOL. 49, NO. 8, pp.

577-587, August 2002.

[8] S.M. kang, Yusuf Leblebici, CMOS Digital integrated Circuits, Analysis and

Design, Third edition McGrawhill, 2003.

[9] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-

Wesley, 1994

[10] Jan M. Rabaey, Anantha Chandrakasan, Borivose Nikolic, Digital Integrated

Circuits, Second Edition PrenticeHall of India Private Limited, 2004.