Upload
sridhar-pramod
View
217
Download
0
Embed Size (px)
Citation preview
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 1/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 2/63
Department of ECE, JNTUHCEH 2
multipliers have been plagued by complicated switching systems and/or irregularities
in design.
1.2 Power Optimization
Power refers to number of Joules dissipated over a certain amount of time
whereas energy is the measure of the total number of Joules dissipated by a circuit.
In digital CMOS design, the well-known power-delay product is commonly
used to assess the merits of designs. In a sense, this can be shown as power ×
delay = (energy/delay) × delay = energy, which implies delay is irrelevant.
1.3 Low-Power Multiplier Design
Multiplication consists of three steps: generation of partial products or
(PPG), reduction of partial products (PPR), and finally carry-propagate addition
(CPA).In general there are sequential and combinational multiplier implementations.
Now consider combinational case here because the scale of integration now is large
enough to accept parallel multiplier implementations in digital VLSI systems.
Different multiplication algorithms vary in the approaches of PPG, PPR, and CPA.
For PPG, AND gate array is the easiest, also use array of multiplexers for PPG.
For PPR, two alternatives exist: reduction by rows, performed by an array of
adders, and reduction by columns, performed by an array of counters. The final CPA
requires a fast adder scheme because it is on the critical path. In some cases, final CPA
is postponed if it is advantageous to keep redundant results from PPG for further
arithmetic operations.
1.4 Languages and Tools Used
Considered Verilog HDL as our primary language. For simulation used
synopsys VCS compiler. For synthesis have used Synopsys Design Compiler
90nm process technology.
1.5 Research Approach
The basic motive of the project was to study and develop an efficient fast and
low power multiplier. As the name suggests had to go for faster and low power factor
optimization simultaneously. The basic building block of a multiplier is ADDER
circuit. Hence turned our focus to the adders first. Studied the area occupied and the
time delay consumed by different adders and found out a proper relation between time
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 3/63
Department of ECE, JNTUHCEH 3
and area complexity of all the adders under consideration.
Then turned our focus to the Multipliers. In Multipliers studied different
multipliers writing programs, verifying waveforms and then finally calculating area
along with power consumed by the circuit. After knowing all this also calculated
delay for different multipliers which helped us to determine the best multiplier. HPM
multiplier was found to be the best multiplier among all with less power
consumption and proper area, delay trade-off. Future work will be to optimize power
consumed by different multipliers there by reducing number of gates used and area
occupied by them.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 4/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 5/63
Department of ECE, JNTUHCEH 5
Complexity (A) Delay (T) Product (A x T) Adder Schemes
O(n)
O(n)
O(n)
O(n)
O(n1/*1+1)
O(log n)
O(n2)
O(n*1+2/*1+1)
O(n log n)
Ripple-Carry
Carry-Select
Carry-Look ahead
Table 2.1 Categorization of adders with respect to delay time and capacity
2.2 Half and Full AdderThe basic building block of a multiplier is ADDER circuit. An adder or
summer is a digital circuit that performs addition of numbers. In many computers and
other kinds of processors, adders are used not only in the arithmetic logic unit, but also in
other parts of the processor, where they are used to calculate addresses, table indices, and
similar operations.
2.2.1 Half Adder
A half adder adds two one-bit binary numbers “ain” and “ bin”. It has two
outputs,“Sout” and “cout” (the value theoretically carried on to the next addition); the
final sum is, “2cout + sout”. The simplest half-adder design, pictured below in figure
2.1, incorporates an XOR gate for “sout” and an AND gate f or “cout”. Half
adders cannot be used compositely, given their incapacity for a carry-in bit. TABLE
2.2 shows the truth table of half adder.
Figure 2.1 Logic Diagram of Half Adder
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 6/63
Department of ECE, JNTUHCEH 6
HALF ADDER INPUTS HALF ADDER OUTPUTS
ain bin sout cout
0
0
1
1
0
1
0
1
0
1
1
1
0
0
0
1
Table 2.2 Truth Table of Half Adder
2.2.2 Full Adder
A full adder adds binary numbers and accounts for values carried in as well as
out. A one-bit full adder adds three one-bit numbers, often written as “ain”, “ bin”, and
“cin” where “ain” and bin are the operands, and “cin” is a bit carried in (in theory from
a past addition). The circuit produces a two-bit output sum typically represented by
the signals „cout‟ and „sout‟. The one-bit full adder's truth table shown in TABLE 2.3.
FULL ADDER INPUTS FULL ADDER OUTPUTS
Ain bin cin sout cout
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
0
1
0
0
0
1
0
1
1
0
Table 2.3 Truth Table of Full Adder
A full adder can be implemented in many different ways such as with a
custom transistor-level circuit or composed of other gates. One implementation is
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 7/63
Department of ECE, JNTUHCEH 7
= ⊕ ⊕
= · + · + ·
In this implementation for generation of carry, using three AND gates and two
OR gates as shown in figure 2.2. As per the logic diagram this adder has more number of
gates. The results from the synthesis tool are shown in table 2.4.
Figure 2.2 Logic Diagram of Full Adder1
Full adder can be implemented using only two types of gates and is convenient
if the circuit is being implemented using simple IC chips which contain only one gate
type per chip. In this light, „cout can be implemented as shown in figure 2.3.
Figure 2.3 Logic Diagram of Full Adder2
A full adder can be constructed from two half adders by connecting “ain” and“ bin” to the input of one half adder, connecting the sum from that to an input to the
second adder, connecting „cin‟ to the other input and OR the two carry outputs.
Equivalently, „sout‟ could be made the three-bit XOR of „ain‟, „ bin‟, and „cin‟, and
„cout‟ could be made the three-bit majority function of „ain‟, „ bin‟ and „cin‟ as shown in
figure 2.4 where h1 and h2 half adders shown in figure 2.1.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 8/63
Department of ECE, JNTUHCEH 8
Figure 2.4 Logic Diagram of Full Adder3
T he full adder can be viewed as a 3:2 compressor it sums three one-bit
inputs, and returns the result as a single two-bit number. Thus, for example, a binary
input of 101 results in an output of 1+0+1=10 (decimal number '2'). The carry-out
represents bit one of the result, while the sum represents bit zero. Likewise, a half
adder can be used as a 2:2 compressor. The results from the synthesis tool for the
three full adders are shown in TABLE 2.4. From figure it is clear that of all the full
adders shown the full adder1 is the better choice. So throughout the project used full
adder1 both as adder and 3:2 compressor.
Type of Full Adder Area (µm2) Delay (ps) Power (uw)
Full adder 1
Full Adder 2
Full Adder 3
30.1
41.1
51.5
0.45
0.45
0.53
6.6
12.64
14.72
Table 2.4 Comparisons of Different Implementations of Full Adders
2.3 Ripple Carry Adder
The well known adder architecture, ripple carry adder is composed of
cascaded full adders for n-bit adder, as shown in figure 2.5. It is constructed by
cascading full adder blocks in series. The carry out of one stage is fed directly to the
carry-in of the next stage. For an n-bit parallel adder it requires “n” full adders.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 9/63
Department of ECE, JNTUHCEH 9
Figure 2.5 4-Bit Ripple Carry Adder
Multiple full adder circuits can be cascaded in parallel to add an N-bit number.For an N- bit parallel adder, there must be N number of full adder circuits. A ripple carry
adder is a logic circuit in which the carry-out of each full adder is the carry in of the
succeeding next most significant full adder.
It is called a ripple carry adder because each carry bit gets rippled into the next
stage. In a ripple carry adder the sum and carry out bits of any half adder stage is not
valid until the carry in of that stage occurs. Propagation delays inside the logic circuitry
is the reason behind this. Propagation delay is time elapsed between the application of an
input and occurrence of the corresponding output. Not very efficient when large number
bit numbers are used. Delay increases linearly with bit length.
2.3.1 Delay
Delay from Carry-in to Carry-out is more important than from A to carry-out
or carry-in to SUM, because the carry-propagation chain will determine the latency
of the whole circuit for a Ripple-Carry adder.
2.4 Carry Select Adder
In Carry select adder scheme, blocks of bits are added in two ways: One
assuming a carry-in of 0 and the other with a carry-in of 1.
Because of multiplexers larger area is required. Have a lesser delay than Ripple
Carry Adders (half delay of RCA). Hence always go for Carry Select Adder while
working with smaller no of bits.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 10/63
Department of ECE, JNTUHCEH 10
Figure 2.6 Carry Select with 1 Level using n/2- bit RCA
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 11/63
Department of ECE, JNTUHCEH 11
As shown in the figure 2.6, is the basic building block of a carry-select adder, the
carry-select adder generally consists of two ripple carry adders and a multiplexer. Adding
two n-bit numbers with a carry-select adder is done with two adders (therefore two ripple
carry adders) in order to perform the calculation twice, one time with the assumption of the
carry being zero and the other assuming one. After the two results are calculated, the correct
sum, as well as the correct carry, is then selected with the multiplexer once the correct carry
is known.
The block size should have a delay, from addition inputs a and b to the carry out,
equal to that of the multiplexer chain leading into it, so that the carry out is calculated just in
time. where the resulting carry and sum bits are selected by the carry-in. Since one ripple
carry adder assumes a carry-in of 0, and the other assumes a carry-in of 1, selecting whichadder had the correct assumption via the actual carry-in yields the desired result.
2.5 Carry Look Ahead Adder
Carry Look Ahead Adder can produce carries faster due to carry bits generated in
parallel by an additional circuitry whenever inputs change. This technique uses carry
bypass logic to speed up the carry propagation. Let ai and bi be the augends and
addend inputs, ci the carry input, si and ci+1, the sum and carry-out to the i
th
bit position.If the auxiliary functions, pi and gi called the propagate and generate signals, the
sum output respectively are defined as follows.
Figure 2.7 4 BIT CLA Logic equations
pi = ai + bi……..(2.1) gi = ai bi……… (2.2)
si = ai xor bi xor ci .....(2.3) ci+1 = gi + pici .....(2.4)
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 12/63
Department of ECE, JNTUHCEH 12
ACLA = O (n) = 14n ……………(2.5)
TCLA = O (log n) = 4 log2n. ……….(2.6)
2.6 Binary To Excess - 1 Code Converter (BEC)
Binary to Excess One Converter can also be used as a adder where are having
results available and waiting for only carry from the previous stage. Then can use this BEC
converter can calculate the result with one as carry. Then by using the carry coming
from the previous stage as a select signal to the multiplexer and can get the original
result.
Figure 2.8 5 Bit Binary to Ecess – 1 Code Converter with out carry with carry
The BEC gets n inputs and generates n output; the BECWC (BEC with Carry)
gets n input and generates n+1 output to give the carry output.
The output value is one more than the given input. The detailed structures of the
5-bit BEC without carry (BEC) and with carry (BECWC) are shown figure 2.8. The
function table of BEC and BECWC are shown in TABLE 2.5.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 13/63
Department of ECE, JNTUHCEH 13
Table 2.5 Functional Table of 5 Bit BEC and BECWC
Input BEC without carry BEC with carry
b[4:0] x[4:0] cy x[4:0]
00000
00001
00010
00011
00100
11011
11100
11101
11110
11111
00001
00010
00011
00100
00101
11100
11101
11110
11111
00000
0
0
0
0
0
0
0
0
0
1
00001
00010
00011
00100
00101
11100
11101
11110
11111
00000
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 14/63
Department of ECE, JNTUHCEH 14
Chapter -3
The Multipliers
3.1 Introduction
High speed multiplication is a primary requirement of high performance digital
systems. In recent trends the column compression multipliers are popular for high speed
computations due to their higher speeds. The data flow in a column compression
multiplier is shown in figure 3.1.
Figure 3.1 Data flow in a column compression multiplier
As shown in the figure 3.1 it is clear that the total delay of the multiplier can be
split up into three parts: due to the Partial Product Generation (PPG), the Partial
Product Summation Tree (PPST), and finally due to the Final Adder. Of these the
dominant components of the multiplier delay are due to the PPST and the final adder. The
relative delay due to the PPG is small. Therefore significant improvement in the speed
of the multiplier can be achieved by reducing the delay in the PPST and the final adder
stage of the multiplier.
The first column compression multiplier was introduced by Wallace in 1964. He
reduced the partial product of N rows by grouping into sets of three row set and two row
set using (3, 2) counters and (2, 2) counters respectively. In 1965,
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 15/63
Department of ECE, JNTUHCEH 15
Dadda altered the approach of Wallace by starting with the exact placement of the
(3, 2) counters and (2, 2) counters in the maximum critical path delay of the multiplier.
Since 2000‟s, a closer reconsideration of Wallace and Dadda multipliers has been done
and proved that the Dadda multiplier is slightly faster than the Wallace multiplier and the
hardware required for Dadda multiplier is lesser than the Wallace multiplier. In 2006, H.
Eriksson along with his research team presented HPM reduction tree structure that has an
ease of layout compared to Dadda‟s approach . Compared to Dadda, HPM is slightly
faster and consumes lesser power while area being the same.
3.2 Wallace Tree Multiplier
The Wallace tree multiplier is considerably faster than a simple array multiplier
because its height is logarithmic in word size, not linear. However, in addition to the large
number of adders required, the Wallace tree‟s wiring is much less regular and more
complicated. As a result, Wallace trees are often avoided by designers, while design
complexity is a concern to them.
3.2.1 WALLACE Column Compression Algorithm
1. The N rows of partial products are together in sets of three each. Any additional
rows that are not a member of a group of three are transferred to the next level
without modification.
2. Within each group of three rows, (3,2) compressors are applied to the columns
containing three bits and (2,2) compressors are applied to the columns containing
two bits.
3. Columns containing only a single bit are transferred to the next level unchanged.
d0 = N
d j+1 = 2*[d j/3] + d j mod 3 ………. (3.1)
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 16/63
Department of ECE, JNTUHCEH 16
full
adder
full
adder
x7
y7
x6
y6
x5
y5
x4
y4
x3
y3
x2
y2
x1
y1
x0
y0
final adder
half half 7 6 5 4 3 2 1 0 adder adder 15 14 13 12 11 10 9 8
23 22 21 20 19 18 17 16 reduction 31 30 29 28 27 26 25 24 stage 1 39 38 37 36 35 34 33 32
47 46 45 44 43 42 41 40 55 54 53 52 51 50 49 48
63 62 61 60 59 58 57 56
23
c7
s7
c6
s6
c5
s5
c4
s4
c3
s3
c2
s2
c1
s1
c0 s0 0
stage 2 47 s15 s14 s13 s12 s11 s10 s9 s8 24
c15 c14 c13 c12 c11 c10 c9 c8 55 54 53 52 51 50 49 48
63 62 61 60 59 58 57 56
c23 s23 s22 s21 s20 s19 s18 s17 s16 s0 0 reduction 47 s15 s14 c22 c21 c20 c19 c18 c17 c16 stage 3 63 c30 s30 s29 s28 s27 s26 s25 s24 c8
c31 s31 c29 c28 c27 c26 c25 c24
reduction c30 s41 s40 s39 s38 s37 s36 s35 s34 s33 s32 s16 s0 0
stage 4 63 c41 c40 c39 c38 c37 c36 c35 c34 c33 c32
c31 s31 c29 c28 c27 c26 c25 c24
s44 s43 s42 s32 s16 s0 0
c52 c51 c50 c49 c48 c47 c46 c45 c44 c43 c42
p15 p14 p13 p12 p11 p10 p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
Figure 3.2 8 x 8 Wallace Tree Multiplier Logarithmic Depth Hierarchy
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 17/63
Department of ECE, JNTUHCEH 17
3.2.2 Applying Wallace Compression Algorithm to 8 x 8 multiplier
Consider N- bit Multiplier X and N- bit Multiplicand. X and Y are represented as
Multiplicand, Y = yn-1 yn-2 yn-3 . . . . . . . . y3 y2 y1 y0 ……………..(3.2)
Multiplier,X = xn-1 xn-2 xn-3 . . . . . . . . x3 x2 x1 x0 ……………..(3.3)
The flow diagram below shows the intermediate state reductions of the multipliers are
being done by full adders and half adders while the final step additions being done by a RCA.
The flow diagram was done in Microsoft Excel sheet as shown in figure 3.2. The architecture of
the 8 x 8 Wallace Multiplier along with the theoretical delay values is shown in figure 3.3
where FA is full adder and HA is half adder.
Figure 3.3 Architecture of 8 x 8 Wallace Tree Multiplier with RCA as final adder
3.3 DADDA Multiplier
The Dadda multiplier is faster than Wallace Tree multiplier because the Wallace
tree‟s wiring is much less regular and more complicated. Dadda multiplier over comes
this disadvantage by placing the 3:2 compressors and 2:2 compressors in the critical path.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 18/63
Department of ECE, JNTUHCEH 18
The disadvantage in Dadda multiplier is that the layout is not regular.
However, unlike Wallace multipliers that reduce as much as possible on each layer,
Dadda multipliers do as few reductions as possible. Because of this, Dadda multipliers have
a less expensive reduction phase, but the numbers may be a few bits longer, thus requiring
slightly bigger adders.
Take any three wires with the same weights and input them into a full adder. The
result will be an output wire of the same weight and an output wire with a higher weight for
each three input wires. If there are two wires of same weight left, and the current number of
output wires with that weight is equal 2 (modulo 3), input them into a half adder.
Otherwise, pass them through to the next layer. If there is just one wire left, connect it to
the next layer.
3.3.1 Dadda Column Compression Algorithm
1. Let d1=2 and d j+1=[1.5*d j]. “dj: is the height of the matrix for the j th stage.
Repeat until the largest jth stage is reached in which the original N height
matrix contains at least which has more than “dj” partial products.
2. In the jth
stage from the end, place (3,2) and (2,2) compressors as required to
achieve a reduced matrix. Only columns with more than “dj” partial products as
they receive carries from less significant (3,2) and (2,2) compressors are reduced
3. Let j=j-1 and repeat step 2 until a matrix with a height of two is generated. This
should occur when j=1.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 19/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 20/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 21/63
Department of ECE, JNTUHCEH 21
63 55 47 39 31 23 15 7 6 5 4 3 2 1 0 62 54 46 38 30 22 14 13 12 11 10 9 8
61 53 45 37 29 21 20 19 18 17 16 N=8
60 52 44 36 28 27 26 25 24
59 51 43 35 34 33 32
58 50 42 41 40
57 49 48
56
63 55 47 39 31 23 29 21 6 5 4 3 2 1 0 62 54 46 38 30 36 28 13 12 11 10 9 8
61 53 45 37 43 35 20 19 18 17 16 N=7
60 52 44 50 42 27 26 25 24 59 51 57 49 34 33 32
58 c0 56 41 40
c1 s1 s0 48
63 55 47 39 31 44 50 42 20 5 4 3 2 1 0 62 54 46 38 51 57 49 27 12 11 10 9 8
61 53 45 58 c0 56 34 19 18 17 16 N=6
60 52 c1 s1 s0 41 26 25 24 59 c4 c3 c2 48 33 32
c5 s5 s4 s3 s2 40
63 55 47 39 52 c1 s1 s0 41 19 4 3 2 1 0 62 54 46 59 c4 c3 c2 48 26 11 10 9 8
61 53 c5 s5 s4 s3 s2 33 18 17 16 N=5
60 c10 c9 c8 c7 c6 40 25 24 c11 s11 s10 s9 s8 s7 s6 32
63 55 47 60 c10 c9 c8 c7 c6 40 18 3 2 1 0 62 54 c11 s11 s10 s9 s8 s7 s6 25 10 9 8
N=4
c19 s19 s18 s17 s16 s15 s14 s13 s12 24
63 55 c19 s19 s18 s17 s16 s15 s14 s13 s12 17 2 1 0 62 c28 c27 c26 c25 c24 c23 c22 c21 c20 24 9 8 N=3
c29 s29 s28 s27 s26 s25 s24 s23 s22 s21 s20 16
63 16 1 0 N=2
c41 s41 s40 s39 s38 s37 s36 s35 s34 s33 s32 s31 s30 8
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 22/63
Department of ECE, JNTUHCEH 22
Figure 3.6 8 x 8 HPM Multiplier Logarithmic Depth Hierarchy
3.4.2 Applying HPM Compression Algorithm to 8 x 8 multiplier
The flow diagram below shows the intermediate state reductions of the
multipliers are being done by full adders and half adders while the final step additions
being done by a RCA. The flow diagram is shown in figure 3.6 and the architecture
figure 3.7, where FA is full adder and HA is half adder.
Figure 3.7 Architecture of 8 x 8 HPM Multiplier with RCA as final adder
3.5 Analysis of Multipliers
The theoretical comparison of number of adders Wallace,Dadda and HPM
multipliers is shown in the table 3.1
Compression
Technique
Number of Full
Adders
Number of Half
Adders
Carry Propagation
Adder(CPA) Length
Wallace
Dadda
HPM
N2 - 4.N + 2 + S
N2 – 4.N + 3
N2 – 4.N + 3
> N
N – 1
N – 1
2.N – 1 – S
2.N – 2
2.N – 2
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 23/63
Department of ECE, JNTUHCEH 23
Table 3.1 Theoretical Comparison of Different Multipliers
Where, N = Multiplier and Multiplicand bit length and
S = Number of Reduction stages
Now TABLE 3.2 shows the synthesis results of three multipliers for area,
delay and power. simulated and synthesized 8 x 8 bit, 16 x 16 bit and 32 x 32 bit
multipliers using each column reduction technique. The 2:2 compressor used in
the design is the full adder1 shown in figure 2.1 and 3:2 compressor used is shown
in figure 2.2. All the results shown here are the combinational circuit results i.e.,
inputs and outputs are not registered.
Bit width of
Multiplier
PPST Area (µm2) Delay (ns) Power ( mW)
8 x 8
16 x 16
32 x 32
Wallace
Dadda
HPM
Wallace
Dadda
HPM
Dadda
HPM
2770.4
2240.6
2052.2
9273.5
9116.6
8934.3
47059.1
42724.7
7.62
6.63
5.81
15.46
13.94.
12.77
25.43
25.33
1.2
0.877
0.817
5.07
4.86
4.74
31.09
20.40
Table 3.2 Comparison of Implementation of Different Multipliers w.r.t area, delay and power
From the table 3.2 it is clear that HPM multiplier is the optimal multiplier
compared to the other two with respect to area, delay and power. So used HPM as our
column reduction technique in our proposed multiplier.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 24/63
Department of ECE, JNTUHCEH 24
Chapter - 4
TheProposed Design for High Speed and Low Power
4.1 Recursive Multiplication for High Speed
The architecture proposed in this project work is centered on a recursive
multiplication algorithm by Danysh and Swartzlander [15]. The authors present a
multiplication algorithm based on divide and conquer methodologies that introduces
greater regularity in design than standard column compression multipliers, while avoiding the
linear latency of array multipliers.
Recent studies have examined the consequences of technology scaling on arithmetic
circuitry. These investigations strongly support the need for the consideration of
interconnect layout as an integral part of future arithmetic circuitry. The predominant
advantage offered by the recursive multiplication scheme is the use of smaller multipliers to
implement a larger operation, which is in direct compliance with the presented results. This
structure promotes the notion of exploiting locally optimized arrays for reduced
interconnect power through shorter local interconnects, and a more regular integration of the
sub-components on a larger scale.
The recursive multiplier scheme works by executing an n-bit multiplication using 4
n/2-bit multipliers in parallel and adding up the results. The n/2-bit multipliers may he
further reduced, where each sub-multiplier carries out 4 parallel n/4-bit multiplications,
and so forth. In this manner a large multiplication is carried using recursions of simpler
base multiplier modules.
Mathematically, the recursive algorithm may be proved by first considering two
unsigned n-bit operands, the multiplier X and multiplicand A
=
and A may now he defined as:
X = XL + XH …………. (4.1) A = AL+AH …………….(4.2)
The overall multiplication of A and X is given by
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 25/63
Department of ECE, JNTUHCEH 25
= A . X ………..(4.3)
= (AL + AH) . (XL +XH)
= (AL.XL) + (AL.XH) + (AH.XL) + (AH.XH)
= P1 + P2 + P3 +P4 ……………..(4.4)
Figure 4.1 Pictorial Representation of RecursiveMultiplication
Therefore, the overall multiplication may be reduced to four smaller multiplications,
and this process may be repeated using even smaller base multipliers. In order to minimize
the delay introduced by subdividing the process, the result of the base multipliers, or the
intermediary products, will be kept in carry save form. Hence only one final fast adder will
be required to yield the final product. All the four N/2 multiplications derived above are
diagrammatically shown in figure 4.1 (for N = 8). The result from multiplier M1, M2, M3
and M4 are P1, P2, P3 and P4 respectively.
The architecture of recursive multiplier for N x N bit multiplier with RCA as
merging adder is shown in figure 4.2 Each N/2 multiplier in figure 4.2 uses HPM
algorithm for PPST.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 26/63
Department of ECE, JNTUHCEH 26
Figure 4.2 N – Bit Recursive Multiplier with RCA as Merging Adder
Figure 4.2 shows that all the inputs and outputs are registered, so the latency of
the multiplier becomes two i.e, it takes two clock cycles to get the first output and after
that output is obtained for each clock cycle.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 27/63
Department of ECE, JNTUHCEH 27
Now as mentioned earlier the partial products that are dependent (marked in black
in figure. 1(b) and 1(c)) are given to Multiplier M2 and M3 respectively. The products
obtained from M2 and M3 are given to a N- bit RCA and the obtained result is given to
N+1 bit RCA along with the MSB N/2 bits of product from M1 and the LSB N/2-bits of
product from M4. The MSB N/2 bits of M4 product are given to N/2-1 bit RCA with „1‟
as carry input and calculating the result before the actual carry arrives, and used a
multiplexer for selecting the product based on the actual carry generated by N+1-bit
RCA. This dependency and flow can be clearly observed in figure 4.2.
Now in order to improve the speed further replaced the N/2 – 1 RCA Adder
with BEC adder. The logical structure of a 7 – bit bec adder is shown in figure 4.3. So by
using this recursive multiplication and hybrid adder (combination of RCA and BEC)
for merging the products for four N/2 bit multipliers achieved speed. Now to reduce the
power have opted the twin precision multiplication which is described in next section.
Figure 4.3 7 – Bit BEC Adder without Carry
4.2 Twin Precision Multiplication For Low Power
Multiplier is a complex arithmetic operation, which is reflected in its relatively high
signal propagation delay, high power dissipation, and large area requirement.When choosing
a multiplier for a digital system,the bit-width of the multiplier is required to be at least as
wide as the largest operand of the applications that are to be executed onthat digital system.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 28/63
Department of ECE, JNTUHCEH 28
The bit-width of the multiplier is therefore, often much larger than the data
represented inside the operands,which leads to unnecessarily high power dissipation and
unnecessary long delay.
This resource waste could partially be remedied by having several multipliers ,each
with a specific bit-width,and use the particular multiplier with the smallest bit-width that is
large enough to accommodate the current multiplication. Such a scheme would assure that a
multiplication would be computed on a multiplier that has been optimized in terms of power
and delay for that specific bit-width.
Figure 4.4 (a) Partial Product Array for N = 8 (b) Partial Products showing the dependency.
In figure 4.4(b) the partial products are partitioned such that obtain four partial
product arrays of N = 4, of them the partial products that are marked as black are
dependent because the output storage bits are same for those arrays. So the partial
products that are in black cannot be operated simultaneously. Thus to increase the
throughput are using the independent partial products that are coated in ash are used
along with operand guarding. The architecture for N = 8 with HPM algorithm as
reduction technique, operand guarding and using the independent partial products for performing two N/2 bit multiplications simultaneously is shown in figure 4.5.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 29/63
Department of ECE, JNTUHCEH 29
Figure 4.5 Block Diagram of 8 x 8 Twin Precision Multiplier
4.3 Clock Gating and Recursive Multiplication
Clock gating is a popular technique used in many synchronous circuits for
reducing dynamic power dissipation. Recursive Multiplication is used to reduce the power.
4.3.1 Clock Gating
Clock gating saves power by adding more logic to a circuit to prune the clock tree.
Pr uning the clock disables portions of the circuitry so that the flip-flops in them do not
have to switch states. Switching states consumes power. When not being switched, the
switching power consumption goes to zero, and only leakage currents are incurred.
Clock gating works by taking the enable conditions attached to registers, and usesthem to gate the clocks. Therefore it is imperative that a design must contain these enable
conditions in order to use and benefit from clock gating. This clock gating process can
also save significant die area as well as power, since it removes large numbers of mu x‟s
and replaces them with clock gating logic. This clock gating logic is generally in the form
of "Integrated clock gating" (ICG) cells. However, note that the clock gating logic will
change the clock tree structure, since the clock gating logic will sit in the clock tree.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 30/63
Department of ECE, JNTUHCEH 30
Clock gating logic can be added into a design in a variety of ways:
1. Coded into the RTL code as enable conditions that can be automatically translated
into clock gating logic by synthesis tools (fine grain clock gating).
2. Inserted into the design manually by the RTL designers (typically as module level
clock gating) by instantiating library specific ICG (Integrated Clock Gating) cells
to gate the clocks of specific modules or registers.
3. Semi-automatically inserted into the RTL by automated clock gating tools. These
tools either insert ICG cells into the RTL, or add enable conditions into the RTL
code. These typically also offer sequential clock gating optimizations.
Sequential clock gating is the process of extracting/propagating the enable
conditions to the upstream/downstream sequential elements, so that additional registers can
be clock gated. Although asynchronous circuits by definition do not have a "clock",
the term perfect clock gating is used to illustrate how various clock gating techniques are
simply approximations of the data-dependent bheaviour exhibited by asynchronous
circuitry. As the granularity on which gate the clock of a synchronous circuit approaches
zero, the power consumption of that circuit approaches that of an asynchronous circuit:
the circuit only generates logic transitions when it is actively computing.
4.3.2 Recursive Multiplication with Clock Gating
In project clock gating is used for achieving operator isolation in the
Recursive multiplier for power reduction. As mentioned earlier recursive multiplier has
four N/2 bit multipliers of which M2 and M3 are dependent and M1 and M4 are
independent. So to achieve low power and double-throughput are using clock gating
technique and are isolating the M2 and M3 multipliers without transferring inputs
them. To perform twin precision multiplication an extra control input is needed. Here are
considering a two it input “Twin” as a control input. The “Twin” is passed through a 2:3
decoder which generates T[1], T[2] and T[3] as control signals and these signals are used
for the operator isolation. TABLE 4.1 shows the truth table of 2:3 decoder. As shown
in the TABLE 4.1 have four operating modes:
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 31/63
Department of ECE, JNTUHCEH 31
Mode 0 – Both M1 and M4 in operation for Twin Precision
Mode 1 – Only M1 in operation
Mode 2 – Only M4 in operation
Mode 3 – Full Mode operation
Operation
ModeT[1] T[2] T[3]
00 – Both M1
and M4 in
operation for
Twin Precision
1 0 1
01 – Only M1 in
operation1 0 0
10 – Only M4 in
operation0 0 1
11 – Full Mode
operation1 1 1
TABLE 4.1 Decoder Truth Table
Now the each output signal of the decoder is given to 2 input AND gate with
clock as another input thus generating three clocks namely clock1, clock2 and clock3 by
T[1], T[2] and T[3] respectively. Clock2 drives registers of M2 and M3 as shown in
figure 4.6. So only in mode 3 the multipliers M2 and M3 will be working and in
remaining all modes they are in off condition thus saving the switching power.
The advantage in this design compared to the regular twin precision multiplier in
is that are isolating the operator instead of operand guarding. So in this design can make
use of one multiplier at a time for one N/2- bit multiplication but in regular twin precision
have to give all zeros for MSB N/2 bits of multiplier and multiplicand in order to
operate the multiplier for same operation, so there is restriction in giving inputs which is
not feasible always. But the control circuit here provides the control to overcome this.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 32/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 33/63
Department of ECE, JNTUHCEH 33
The architecture shown in figure 4.6 has increased speed and also has the flexibility
for N/2 bit multiplication with less power consumption and double-throughput. This can
be clearly observed in the result analysis.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 34/63
Department of ECE, JNTUHCEH 34
Chapter - 5
ASIC Implementation of Proposed Design
5.1 Introduction
In this project implemented different types of multipliers but multipliers with
HPM column reduction technique, recursive multiplier and recursive multiplier with
clock gating are quite important here the implementation of these multipliers is being
described.
For VLSI (hardware) implementation followed ASIC design flow starting from
RTL description to the GDSII. The architecture is described using VERILOG HDL and
the functional simulation is done in VCS simulator , synthesis is carried out in DC
COMPILER.
5.2 ASIC Design Methodology
Application Specific Integrated Circuit (ASIC) Design, as the name suggests this
design focuses on the development of a hardware module which is completely dedicated
to that particular application or process. This type of design helps in the economical usage
of silicon and also has a good speed compared to the other implementations such as
FPGA and CPLD devices. In general for the development of an ASIC follow a flow
called ASIC design flow. ASIC design flow can be seen in figure 5.1, and the discussion
of each step is done in following sections.
Figure 5.1 ASIC Design
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 35/63
Department of ECE, JNTUHCEH 35
5.2.1 System Partitioning
This is the first step of the ASIC design flow; here the complex problem statement
is decomposed into smaller subsystems. The decomposition is carried out hierarchically
until each subsystem is of manageable size.
5.2.2 Design Entry
In any design, specifications are written first abstractly describing the
functionality, interface and overall architecture of the circuit. A behavioural description is
then created to analyze the design in terms of functionality, compliance to standards, and
other high-level issues. Typically behavioural (simulation model) descriptions are created in
HDLs. Here used VERILOG HDL for design entry.
5.2.3 Simulation
Simulation is carried out at this stage for the written code and this type of
simulation is called as behavioural simulation or functional simulation. Here,the
simulation is carried out by the help of “testbench”, testbench is a piece of code which
provide the required stimulus or inputs and control signals to the design, by observing
theoutputs in a waveform confirm the functionality. In t h design used VCS simulator
for functional simulation.
5.2.4 Synthesis
The next stage is synthesis, synthesis means converting the written code into gates
and its interconnections. In this stage the conversion of the code into gates and
interconnections is done by mapping to a particular technology i.e., either 0.35µm
technology or 0.18µm technology or 90nm technology. The technology here refers to the
gate length of the transistors used in our design. The output of this synthesis stage is the
“gate level netlist (.vg)” and “design constraints (.sdc)” files. Gate level netlist contains
information of the gates and interconnections and design constraints contain the
information such as the clock frequency, wire-load models used. This is the final stage of
the logical flow or the front-end flow. The output files i.e., .v and .sdc are taken as input
to the physical or backend flow.
RTL synthesis is an automated design task in which high-level design descriptions
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 36/63
Department of ECE, JNTUHCEH 36
written in Hardware Description Languages (such as VHDL, Verilog, or SystemVerilog) are
transformed into gate-level netlists. Gate-level netlist is basically a circuit implementation of
the design made of library components (both combinational and sequential cells) available
in the technology library and their interconnections. The netlist is generated by the synthesis
tool according to the constraints set by the designer.
Design Compiler is RTL Synthesis tool by Synopsys. It supports UNIX platforms
and is installed on Institute's computer systems (see here for available versions on each
platform linux). Design Compiler is not supported on Windows platform.
Synthesis with Design Compiler include the following main tasks: reading in the
design, setting constraints, optimizing the design, analyzing the results and saving the design
database. These tasks are described as follows
5.3 Synthesis Overview
Synthesis with Design Compiler include the following main tasks: reading in the
design, setting constraints, optimizing the design, analyzing the results and saving the design
database. These tasks are described below
5.3.1 Reading in the Design
The first task in synthesis is to read the design into Design Compiler memory.
Reading in an HDL design description consist of two tasks: analyzing and elaborating the
description. The analysis command (analyze) performs the following tasks
Reads the HDL source and checks it for syntactical errors Creates HDL library
objects in an HDL-independent intermediate format and saves these intermediate files in a
specified location
5.3.2 Constraining the design
The next task is to set the design constraints. Constraints are the instructions that the
designer gives to Design Compiler. They define what the synthesis tool can or cannot do
with the design or how the tool behaves. Usually this information can be derived from the
various design specifications (e.g. from timing specification).
There are basically two types of design constraints:
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 37/63
Department of ECE, JNTUHCEH 37
5.3.2.1 Design Rule Constraints
Design rules constraints are implicit constraints which means that they are defined
by the ASIC vendor in technology library. By specifying the technology library that Design
Compiler should use, also specify all design rules in that library and cannot discard or
override these rules.
5.3.2.2 Optimization Constraints
Optimization constraints are explicit constraints (set by the designer). They describe
the design goals (area, timing, and so on) the designer has set for the design and work as
instructions for the Design Compiler how to perform synthesis.
Design rule constraints comprise:
5.3.2.3 Maximum transition time
Longest time allowed for a driving pin of a net to change its logic value
5.3.2.4 Maximum fanout
Maximum fanout for a driving pin
5.3.2.5 Maximum (and minimum) capacitance
The maximum (and minimum) total capacitive load that an output pin can drive. The
total capacitance comprises of load pin capacitance and interconnect capacitances.
5.3.2.6 Cell degradation
Some technology libraries contain cell degradation tables. The cell degradation
tables list the maximum capacitance that can be driven by a cell as a function of the
transition times at the inputs of the cell.
5.3.2.7 System clock definition and clock delays
Clock constraints are the most important constraints in your ASIC design. The clocksignal is the synchronization signal that controls the operation of the system. The clock
signal also defines the timing requirements for all paths in the design. Most of the other
timing constraints are related to the clock signal.
5.3.2.8 Multicycle paths
A multicycle path is an exception to the default single cycle timing requirement of
paths. That is, on a multicycle path the signal requires more than a single clock cycle to
propagate from the path startpoint to the path endpoint.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 38/63
Department of ECE, JNTUHCEH 38
5.3.2.8 Input and output delays
Input and output delays constrain external path delays at the boundaries of a
design. Input delay is used to model the path delay from external inputs to the first
registers in the design. Output delay constrain the path from the last register to the
outputs of the design.
5.3.2.9 Minimum and maximum path delays
Minimum and maximum path delays allow constraining paths individually
and setting specific timing constraints on those paths..
5.3.4 Optimizing the Design
The following section presents the behavior of Design Compiler optimization step.
The optimization step translates the HDL description into gate-level netlist using the cells
available in the technology library. The optimization is done in several phases. In each
optimization phase different optimization techniques are applied according to the design
constraints.
5.3.4.1 Gate-level Optimizations
Gate-level optimizations work on the technology-independent netlist and maps it to
the library cells to produce a technology-specific gate-level netlist. Gate-level optimizations
include the following processes:
5.3.4.2 Area Optimization
Area optimization is the last step that Design Compiler performs on the design.
During this phase, only those optimizations that don't break design rules or timing
constraints are allowed.
5.3.5 Reporting and Analyzing the Design
Once the synthesis has been completed, need to analyze the results. Design
Compiler provides together with its graphical user interface (Design Vision) various means
to debug the synthesized design. These include both textual reports that can be generated for
different design objects and graphical views that help inspecting and visualizing the design.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 39/63
Department of ECE, JNTUHCEH 39
There are basically two types of analysis methods and tools:
5.3.5.1 Generating reports for design object properties
Reporting commands generate textual reports for various design objects:
timing and area, cells, clocks, ports, buses, pins, nets, hierarchy, resources,
constraints in the design, and so on.
5.3.5.2 Visualizing design objects (Design Vision)
Some design objects and their properties can be analyzed graphically. may
examine for example the design schematic and explore the design structure, visualize
critical and other timing paths in the design, generate histograms for various metrics
and so on.
5.3.6 Save Design
The final task in synthesis with Design Compiler is to save the synthesized design.
The design can be saved in many formats but should save for example the gate-level netlist
(usually in Verilog) and/or the design database. Remember that by default, Design Compiler
does not save anything when exiting.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 40/63
Department of ECE, JNTUHCEH 40
Chapter -6
Results
6.1 ASIC Results
In this chapter will see the simulation and synthesis results of the various
multipliers along with the proposed design.
6.1.1 Simulation Results
Figure 6.1 Simulation Result of 16 x 16 HPM Multiplier
Analysis
Signal In/Out Description
clk input Input to the multiplier
Rst input Input to the multiplier
datain1[15:0] input Input to the multiplier
datain2[15:0] input Input to the multiplier
dataout[31:0] output Output of the mulitplier
Table 6.1 Analysis of 16 x 16 HPM Multiplier
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 41/63
Department of ECE, JNTUHCEH 41
Figure 6.2 Simulation Result of 32 x 32 HPM Multiplier
Analysis
Signal In/Out Description
clk input Input to the multiplier
rst input Input to the multiplier
datain1[31:0] input Input to the multiplier
datain2[31:0] input Input to the multiplier
dataout[63:0] output Output of the mulitplier
Table 6.2 Analysis of 32 x 32 HPM Multiplier
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 42/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 43/63
Department of ECE, JNTUHCEH 43
Figure 6.4 Simulation Result of 32 x 32 Recursive Multiplier with clock gating
Analysis
Signal In/Out Description
clk input Input to the multiplier
Rst input Input to the multiplier
datain1[31:0] input Input to the multiplier
datain2[31:0] input Input to the multiplier
dataout[63:0] output Output of the mulitplier
Table 6.4 Analysis of 32 x 32 Recursive Multiplier with clock gating
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 44/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 45/63
Department of ECE, JNTUHCEH 45
6.2.2 Power report of 16 x 16 Basic HPM
The above report simply displays the total power of the design. Dynamic power is
the power dissipated when the circuit is active i.e. performing some function. Dynamic
power is further divided into two components: Switching power and Internal power.
Switching power is dissipated when charging and discharging the load capacitance at
the cell output. The amount of switching power depends on the switching activity (is related
to the operating frequency) of the cell. The more there are logic transitions on the cell
output, the more switching power increases.
Internal power is consumed within a cell for charging and discharging internal cell
capacitances. Internal power also includes short-circuit power. During logic transitions both
P and N type transistors are both on simultaneously for a short time causing direct
connection from Vdd rail to ground rail.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 46/63
Department of ECE, JNTUHCEH 46
6.2.3 Timing report of 16 x 16 Basic HPM
The delay report shows delay calculation in two sections: the first section for data
arrival time calculation and the second for data required time calculation. The data arrival
time is the time required for signal to travel from path start point to a path end point. The
data required time is the maximum time a signal has for traveling that path. The difference
of data required time and data arrival time is called slack or timing margin of the path. If
slack is negative, there is a timing violation on that path.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 47/63
Department of ECE, JNTUHCEH 47
6.2.4 Area Report of 32 x 32 Basic HPM
The above report simply displays the total area of the design. The total area is the
sum of three factors: combinational, noncombinational, and net interconnect area. The total
cell area is due to logic cells in design is shown by the combinational (basic logic gates like
ANDs, ORs, and the like) and the noncombinational (registers) factors. The third factor
affecting the area (net interconnect area) is due to the wires connecting these cells .
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 48/63
Department of ECE, JNTUHCEH 48
6.2.5 Power report of 32 x 32 Basic HPM
The above report simply displays the total power of the design. Dynamic power is
the power dissipated when the circuit is active i.e. performing some function. Dynamic
power is further divided into two components: Switching power and Internal power.
Switching power is dissipated when charging and discharging the load capacitance at
the cell output.The amount of switching power depends on the switching activity (is related
to the operating frequncy) of the cell. The more there are logic transitions on the cell output,
the more switching power increases.
Internal power is consumed within a cell for charging and discharging internal cell
capacitances. Internal power also includes short-circuit power. During logic transitions both
P and N type transistors are both on simultaneously for a short time causing direct
connection from Vdd rail to ground rail.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 49/63
Department of ECE, JNTUHCEH 49
6.2.6 Timing report of 32 x 32 Basic HPM
The delay report shows delay calculation in two sections: the first section for data
arrival time calculation and the second for data required time calculation. The data arrival
time is the time required for signal to travel from path start point to a path end point. The
data required time is the maximum time a signal has for traveling that path. The difference
of data required time and data arrival time is called slack or timing margin of the path. If
slack is negative, there is a timing violation on that path.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 50/63
Department of ECE, JNTUHCEH 50
6.2.7 Area Report of 16 x 16 Recursive Multiplier with Clock Gating
The above report simply displays the total area of the design. The total area is the
sum of three factors: combinational, noncombinational, and net interconnect area. The total
cell area is due to logic cells in design is shown by the combinational (basic logic gates like
ANDs, ORs, and the like) and the noncombinational (registers) factors. The third factor
affecting the area (net interconnect area) is due to the wires connecting these cells .
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 51/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 52/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 53/63
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 54/63
Department of ECE, JNTUHCEH 54
6.2.11 Power Report of 32 x 32 Recursive Multiplier with Clock Gating
The above report simply displays the total power of the design. Dynamic power isthe power dissipated when the circuit is active i.e. performing some function. Dynamic
power is further divided into two components: Switching power and Internal power.
Switching power is dissipated when charging and discharging the load capacitance at
the cell output.The amount of switching power depends on the switching activity (is related
to the operating frequncy) of the cell. The more there are logic transitions on the cell output,
the more switching power increases.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 55/63
Department of ECE, JNTUHCEH 55
Internal power is consumed within a cell for charging and discharging internal cell
capacitances. Internal power also includes short-circuit power. During logic transitions both
P and N type transistors are both on simultaneously for a short time causing direct
connection from Vdd rail to ground rail.
6.2.12 Timing Report of 32 x 32 Recursive Multiplier with Clock Gating
The delay report shows delay calculation in two sections: the first section for data
arrival time calculation and the second for data required time calculation. The data arrival
time is the time required for signal to travel from path start point to a path end point. The
data required time is the maximum time a signal has for traveling that path. The difference
of data
required time and data arrival time is called slack or timing margin of the path. If slack is
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 56/63
Department of ECE, JNTUHCEH 56
negative, there is a timing violation on that path.
6.3 Comparison
The comparison between the TABLE 5.1 (Basic HPM Multiplier) and TABLE 5.2
(Recursive multiplier with clock gating) summarizes the enhanced performance of the proposed
multiplier in terms of percentages which are listed in TABLE 5.3. The summary of Area, Power
and Delay comparisons in Table 5.1 and 5.2 for 16 and 32 bit are plotted in figures. 5.6, 5.7 and
5.8 respectively.
Multiplier
Word size
Type of
Operation
Area (µm2) Delay
(ns)
Power
(µW)
16 x 16
32 x 32
Basic HPM
Basic HPM
12771.8
44388.3
7.43
13.13
313
815
Table 6.5 HPM Multiplier
Table 6.6 Recursive Multiplier with Clock Gating
Multiplier
Word size
Type of
Operation
Area (µm2) Delay
(ns)
Power
(µW)
16 x 16
32 x 32
Recursive
Multiplication
with Clock
Gating
Recursive
Multiplication with
Clock gating
14297.7
48912.5
7.13
12.13
161
443.7
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 57/63
Department of ECE, JNTUHCEH 57
Table 6.7 Percentage Results of Recursive Multiplier with Clock gating with reference
to the HPM
Area Comparison plot for the tables 6.5 & 6.6 of HPM and Recursive Multiplier with
clock gating
Area comparison of 16 and 32 bit Multipliers
Figure 6.5 Area comparison plot
As shown in the figure 6.5 it is clearly observed that the Area occupied by the 16 bit
is more than compared to the 16 bit basic HPM multiplier, similarly the area occupied by
the 32 bit recursive multiplier is more than compared to the 32 bit basic HPM multiplier.
05000
10000
15000
20000
25000
30000
35000
40000
4500050000
16 x 16 32 x 32
HPM
Recursive
Multiplication with
Clock Gating
Multiplier
Word size
Area (%) Delay (%) Power (%)
16 x 16
32 x 32
11.94
10.19
-4.03
-7.16
-48.5
-45.5
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 58/63
Department of ECE, JNTUHCEH 58
Power Comparison plot for the tables 6.5 & 6.6 of HPM and Recursive Multiplier
with clock gating
Power comparison of 16 and 32 bit Multipliers
Figure 6.6 Power Comparison Plot
As shown in the figure 6.6 it is clearly observed that the power consumed by the 16
bit recursive multiplier is less than compared to the 16 bit basic HPM multiplier, similarly
the power consumed by the 32 bit recursive multiplier is less than compared to the 32 bit
basic HPM multiplier.
0
100
200
300
400
500
600
700
800
900
16 x 16 32 x 32
HPM
Recursive
Multiplication
with Clock
Gating
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 59/63
Department of ECE, JNTUHCEH 59
Delay Comparison plot for the tables 6.5 & 6.6 of HPM and Recursive Multiplier
with clock gating
Figure 6.7 Delay Comparison Plot
As shown in the figure 6.7 it is clearly observed that the Delay of the 16 bit recursive
multiplier is less than compared to the 16 bit basic HPM multiplier, similarly the delay of
the 32 bit recursive multiplier is less than compared to the 32 bit basic HPM multiplier.
Delay comparison of 16 and 32 bit Multipliers
0
2
4
6
8
10
12
14
16 x 16 32 x 32
HPM
Recursive
Multiplication with
Clock Gating
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 60/63
Department of ECE, JNTUHCEH 60
Chapter - 7
Conclusion
In this thesis, successfully achieved a faster and low power multiplication by using
a combination of High Performance Multiplication [HPM] column reduction technique,
implementing a N-bit multiplier by recursive multiplication and acceleration of the final
addition using a hybrid adder (RCA and BEC Adder) and low power has been achieved
by using clock gating technique. The result analysis shows that area overheads are not
significant when compared to the increase in speed and reduction in power
consumption. The proposed multiplier design technique can be implemented with any
type of parallel multipliers to achieve faster and low power performance.
The design is implemented using Verilog HDL and simulated with the help of
VCS Compiler and Synthesis is done by using Design Compiler and with the proposed
architecture, double-throughput has been achieved and the results show that for the 32-bit
proposed multiplier is as much as faster, occupies more area and consumes lesser power
with respect to the regular HPM multiplier.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 61/63
Department of ECE, JNTUHCEH 61
Chapter - 8
Future Scope
As an attempt to develop fast and low-power multiplier design, the research
presented in this dissertation has achieved good results and demonstrated the efficiency of
high level optimization techniques. However, there are limitations in our work and several
future research directions are possible.
The results analysis shows that there is a increase in speed and reduction in power
consumption at synthesis level, by implementing physical design we can still improve
increase in speed and power consumption that would prove better according to situation and
require less power and consume less time.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 62/63
Department of ECE, JNTUHCEH 62
BIBLIOGRAPHY
[1] B.Parhami, "Computer Arithmetic", Oxford University Press, 2000.
[2] E. E. Swartzlander, Jr. and G. Goto, "Computer arithmetic," The Computer
Engineering Handbook, V. G. Oklobdzija, ed., Boca Raton, FL: CRC Press, 2002.
[3] C. S. Wallace, “A Suggestion for a Fast Multiplier ,” IEEE Transactions on
Electronic Computers, Vol. EC-13, pp. 14-17, 1964.
[4] Luigi Dadda, “Some Schemes for Parallel Multipliers,” Alta Frequenza, Vol. 34, pp.
349-356, August 1965
[5] H. Eriksson, P. Larsson-Edefors, M. Sheeran, M. Själander, D. Johansson, and M.
Schölin, “Multiplier reduction tree with logarithmic logic depth and regular
connectivity,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2006, pp. 4 – 8.
[6] V. G. Oklobdzija and D.Villeger , “Improving Multiplier Design by Using Improved
Column Compression Tree and Optimized Final Adder in CMOS Technology”, IEEE
transactions on Very Large Scale Integration (VLSI) systems, Vol. 3, no. 2, June 1995.
[7] Magnus Själander and Per Larsson-Edefors, ” Multiplication Acceleration
Through Twin Precision “, IEEE Trans. O VLSI Systems vol. 17, no. 9, pp. 1233-1245 Sep
2009.[8] V. G. Oklobdzija and D.Villeger , “Improving Multiplier Design by Using Improved
Column Compression Tree and Optimized Final Adder in CMOS Technology”, IEEE
transactions on Very Large Scale Integration (VLSI) systems, Vol. 3, no. 2, June 1995.
[9] Paul F.Stelling, “Design strategies for optimal hybrid final adders in parallel
multiplier ”,Journal of VLSI signal processing, vol 14,pp,321-331,1996.
[10] Sabyasachi Das and Sunil P.Khatri,"Generation of the Optimal Bit-Width
Topology of the Fast Hybrid Adder in a Parallel Multiplier", International
Conference on Integrated Circuit Design and Technology (ICICDT) May, 2007.
[11] B.Ramkumar, Harish M Kittur and P.Mahesh Kannan, “ ASIC Implementation of
Modified Faster Carry Save Adder ”, European Journal of Scientific Research, Vol. 42,
Issue 1, 2010.
[12] B.Ramkumar, Harish M Kittur, “Low Area, Low Power CSLA”, IEEE transactions
on Very Large Scale Integration (VLSI) systems.
[13] K.C. Bickerstaff, E.E. Swartzlander, M.J. Schulte, Analysis of column
compression multipliers, Proceedings of 15th IEEE Symposium on Computer
Arithmeitc,2001.
8/13/2019 Design Of Fast and low power Multiplier
http://slidepdf.com/reader/full/design-of-fast-and-low-power-multiplier 63/63
[14] W. J. Townsend, Earl E. Swartzlander and J.A. Abraham, “A comparison of
Dadda and Wallace multiplier delays”, Advanced Signal Processing
Algorithms, Architectures and Implementations XIII. Proceedings of the SPIE, vol.
5205, 2003, pages 552-560.
[15] Danysh and Swamlander Jr., "A recursive fast multiplier", Asilomar Conf. on
Signals,Systems & Computers, vol. 1, pp. 197 -201, 1998.