VLSI Arithmetic Adders & Multipliers

VLSI ArithmeticAdders & Multipliers

Prof. Vojin G. Oklobdzija

University of California

http://www.ece.ucdavis.edu/acsel

Prof. V.G. Oklobdzija VLSI Arithmetic 2

Introduction• Digital Computer Arithmetic belongs to

Computer Architecture, however, it is also an aspect of logic design.

• The objective of Computer Arithmetic is to develop appropriate algorithms that are utilizing available hardware in the most efficient way.

• Ultimately, speed, power and chip area are the most often used measures, making a strong link between the algorithms and technology of implementation.


Basic Operations

• Addition

• Multiplication

• Multiply-Add

• Division

• Evaluation of Functions

• Multi-Media

Addition of Binary Numbers


Addition of Binary NumbersFull Adder. The full adder is the fundamental building block of most arithmetic circuits:

The sum and carry outputs are described as:

iiiiiiiiiiiiiiiiiii cbcabacbacbacbacbac 1

iiiiiiiiiiiii cbacbacbacbas

FullAdder

CinCout

si

ai bi


Addition of Binary Numbers

Propagate

Propagate

Generate

Generate

Inputs Outputs

ci ai bi si ci+1

0 0 0 0 0

0 0 1 1 0

0 1 0 1 0

0 1 1 0 1

1 0 0 1 0

1 0 1 0 1

1 1 0 0 1

1 1 1 1 1


Full-Adder Implementation Full Adder operations is defined by equations:

iiiiiiiiiiiiiiiiii cpcbacbacbacbacbas

iiiiiiiiiiii cpgbacbacbac 1

One-bit adder could be implemented as shown

Carry-Propagate:and Carry-Generate gi

iii bap

iii bag cout c in

s i

a i b i


High-Speed Addition

iii cps

iiii cpgc 1

One-bit adder could be implemented more efficiently

because MUX is faster

iii bap iii bag

0

1s

b ia i

cout

s i

c in


The Ripple-Carry Adder


The Ripple-Carry AdderA0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2

A3 B3

S3

Co,3

(= Ci,1)FA FA FA FA

Worst case delay linear with the number of bits

tadder N 1– tcarry tsum+

td = O(N)

Goal: Make the fastest possible carry path circuit

From Rabaey


Inversion Property

A B

S

CoCi FA

A B

S

CoCi FA

S A B Ci S A B Ci

=

Co A B Ci Co A B Ci

=

From Rabaey


Minimize Critical Path by Reducing Inverting Stages

A0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2 Co,3FA’ FA’ FA’ FA’

A3 B3

S3

Odd CellEven Cell

Exploit Inversion Property

Note: need 2 different types of cellsFrom Rabaey


Ripple Carry Adder Carry-Chain of an RCA implemented using multiplexer from the standard cell library:

a i+1 b i+1 a i b ia i+2 b i+2

cout

c i+1 c i

s is i+1s i+2

c in

Critical Path

Oklobdzija, ISCAS’88


Manchester Carry-Chain Realization of the Carry Path

• Simple and very popular scheme for implementation of carry signal path

V dd

Carry out Carry in

Propagatedevice

Predischarge& kill device

Generatedevice

++++++++

V ddV ddV ddV ddV ddV ddV dd


Original DesignT. Kilburn, D. B. G. Edwards, D. Aspinall, "Parallel Addition in Digital Computers:

A New Fast "Carry" Circuit", Proceedings of IEE, Vol. 106, pt. B, p. 464, September 1959.


Manchester Carry Chain (CMOS)

P0

Ci,0

P1

G0

P2

G1

P3

G2

P4

G3 G4

VDD

Kilburn, et al, IEE Proc, 1959.

•Implement P with pass-transistors•Implement G with pull-up, kill (delete) with pull-down•Use dynamic logic to reduce the complexity and speed up


Pass-Transistor Realization in DPL A

A

B

B

C C

V C CS

S

XO R /XN O R M U LT IPLEX ER B U FFER

C C

M U LT IPLEX ER

V C CC

O

CO

B U FFER

V C C

V C C

O R /N O R

A N D /N A N D

A

A

B

B

A

A

B

B


Carry-Skip Adder

MacSorley, Proc IRE 1/61Lehman, Burla, IRE Trans on Comp, 12/61


Carry-Skip Adder

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,3Co,2Co,1Co,0Ci ,0

FA FA FA FA

P0 G1 P0 G1 P2 G2 P3 G3

Co,2Co,1Co,0Ci,0

Co,3

Mul

tipl

exer

BP=PoP1P2P3

Idea: If (P0 and P1 and P2 and P3 = 1)then Co3 = C0, else “kill” or “generate”.

Bypass

From Rabaey


Carry-Skip Adder: N-bits, k-bits/group, r=N/k groups

G r G r-1

...

SN-k-1S N-1

a N -1bN -1 b N -k-1a N -k-1

S(r-1)k-1 S (r-2)k

G 1G o

...

Sk

S2k-1

a 2k-1b 2k-1 b kak

Sk-1

S0

...

...a (r-1)k b(r-1)k a (r-1)kb (r-1)k

...a k-1 b k-1 a0 b 0

...

C in

... ... ... ... ... ... ... ...

P r-1P r-2 P 1 P 0

C out + + + +

A N D

O RO RO R O R

A N DA N DA N D

critica l pa th , de lay =2(k-1)+(N /2-2)


Carry-Skip Adder

SKIPRCAd tN

tkt

2

212

N

tp

ripple adder

bypass adder

4..8

k


Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Carry-chain of a 32-bit Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

G 0

... ...

a0 b

0

...

...

ai

bi

aN-1

bN-1

S j

P m -2

C inC out

C ou

t

G 2G m -2G m -1G m

G 0G 1G 2G m -2G m -1G m

S N-1S i

S 0

P 2P 0P m -1P m

.....

G 1

P 1

C in

.....

aj b

j

Carry signal path

skip ing

ripp ling



1 13 34 4

5 56

=9

Any-point-to-any-point delay = 9 as compared to 12 for CSKA


Carry-chain block size determination for a 32-bit Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Delay Calculation for Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

P0

Ci,0

P1

G0

P2

G1

P3

G2

BP

G3

BP

Co,3

Delay model:


Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

Variable Group Length

Oklobdzija, Barnes, Arith’85

321 cNcctd



Variable Block Lengths

• No closed form solution for delay• It is a dynamic programming problem


Delay Comparison: Variable Block Adder(Oklobdzija, Barnes: IBM 1985)


Delay Comparison: Variable Block Adder

0

2

4

6

8

10

12

14

16

4 11 18 25 32 39 46 53 60

Size N

Del

ay

VBA- Multi-Level

CLA

VBA


Fan-Out Dependency


Fan-In Dependency


Delay Comparison: Variable Block Adder(Oklobdzija, Barnes: IBM 1985)



Carry-Lookahead Adder(Weinberger and Smith)

A. Weinberger and J. L. Smith, “A Logic for High-Speed Addition”,

National Bureau of Standards, Circ. 591, p.3-12, 1958.



1111

111

1112

)(

cppgpg

cpgpg

cpgc

iiiii

iiii

iiii

iiiiiiiiiiii cpgbacbacbac 1

iiiiiiiiii

iiiiiiii

iiii

cpppgppgpg

cppgpgpg

cpgc

1212122

11122

2223

)(


Carry-Lookahead Adder

jiiiiiiiiij cpppgppgpgG 123123233

iiiij ppppP 123

jiij cPGc 4)1(4

One gate delay to calculate p, g

One to calculateP and two for G

Three gate delaysTo calculate C4(j+1)

Compare that to 8 in RCA !

a i b i

Cin Cj

G jP j

a i+1 b i+1

g i+1p i+1 g i p i

a i+2 b i+2a i+3 b i+3

g i+1p i+1g i+1p i+1

C4(j+1)

C4j+1C4j+2C4j+3

P , G G roup



iiiiiiiiiij GPPPGPPGPG 123123233*G

iiiij PPPPP 123*

jkkj cPGc 4)1(4 **

P j

G* P*

C 4j+1

G jP j+1G j+1P j+3G j+3P j+2G j+2

C4jC4(j+1)

C 4j+2C 4j+3

Additional two gate delays

C16 will take a total of 5 vs. 32 for RCA !


32-bit Carry Lookahead Adder

C in

C out C in

C 4C 8C 12

C out

C 20C 24C 28

C in

C 16

a ib i

ind ividua l addersgenera ting: g i, p i,

and sum S i

C arry-lookahead b locks o f4-b its generating:

G i, P i, and C in fo r theadders

C arry-lookahead super- b locks o f4-b its b locks genera ting:

G * i, P * i, and C in fo r the 4-b itb locks

G roup producing fina lcarry C out and C 16

C ritica l pa th de lay = (fo r g i,p i)+2x2 (fo r G ,P )+3x2 (fo r C in)+1XO R - (fo r S um ) = appx. 12of de lay


Carry-Lookahead Adder(Weinberger and Smith: original derivation )


Carry-Lookahead Adder(Weinberger and Smith: original derivation )


Carry-Lookahead Adder (Weinberger and Smith)please notice the similarity with Parallel-Prefix Adders !


Carry-Lookahead Adder (Weinberger and Smith)please notice the similarity with Parallel-Prefix Adders !

Delay Optimized CLA

B. Lee, V. G. Oklobdzija

Journal of VLSI Signal Processing, Vol.3, No.4, October 1991


Delay Optimized CLA: Lee-Oklobdzija

‘91(a.) Fixed groups and levels

(b.) variable-sized groups, fixed levels

(c.) variable-sized groups and fixed levels

(d.) variable-sized groups and levels


Two-Levels of Logic Implementation of the Carry Block


Two-Levels of Logic Implementation of the Carry-Lookahead Block


Three-Levels of Logic Implementation of the Carry Block (restricted fan-in)


Three-Levels of Logic Implementation of the Carry Lookahead (restricted fan-in)


Delay Optimized CLA: Lee-Oklobdzija ‘91

Delay: Two-level BCLA Delay: Three-level BCLA


Delay Optimized CLA: Lee-Oklobdzija ‘91

(a.) 2-level BCLA =8.5nS (b.) 3-level BCLA =8.9nS

Motorola: CLA Implementation Example

A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS Adder Design”,

Proceedings of the IEEE Custom Integrated Circuits Conference, May 3-6, 1992.


Critical path in Motorola's 64-bit CLA

C ritica l pa th : A , B - G 0 - G 3:0 - G 15:0 - G 47:0 - C 48 - C 60 - C 63 - S 63

G4

P7

G0

P0

G1

P1

G2

P2

G3

P3

...

CARRYBLOCK

G8

P1

1

... G1

2

P1

5

... G1

6

P3

1

... G3

2

P4

7

... G4

8

P5

1

G6

0

P6

0

G6

1

P6

1

G6

2

P6

2

G6

3

P6

3

... G5

2

P5

5

... G5

6

P5

9

...

PG BLOCK

PG BLOCK

PG BLOCK

PG BLOCK

P,G

0

P,G

1:0

P,G

2:0

G3

:0

P3

:0

G7

:4

P7

:4

G1

1:8

P1

1:8

G1

5:1

2

P1

5:1

2

G3

:0

P3

:0

G7

:0

P7

:0

G1

1:0

P1

1:0

G1

5:0

P1

5:0

G1

5:0

P1

5:0

G3

1:1

6

P3

1:1

6

G3

1:0

P3

1:0

G4

7:3

2

P4

7:3

2

G4

7:0

P4

7:0

G5

1:4

8

P5

1:4

8

G5

5:5

2

P5

5:5

2

G5

9:5

6

P5

9:5

6

C6

4

G5

1:4

8

P5

1:4

8

G5

5:4

8

P5

5:4

8

G5

9:4

8

P5

9:4

8

P,G

60

P,G

61

:60

P,G

62

:60

G6

3:6

0

P6

3:6

0

G6

3:4

8

P6

3:4

8

G6

3:0

P6

3:0

C0

C4

C8

C1

2

C1

6

C3

2

C4

8

C1

6

C3

2

C4

8

C5

2

C5

6

C6

0

C6

3

PG BLOCK

C6

2

C6

1


Motorola's 64-bit CLA

conventional PG Block


Motorola's 64-bit CLA

Modified PG Block

Intermediate propagate signals Pi:0 are generated to speed-up C3

Ling’s Adder

Huey Ling, “High-Speed Binary Adder”

IBM Journal of Research and Development, Vol.5, No.3, 1981.


Ling AdderVariation of CLA:

Ling, IBM J. Res. Dev, 5/81

1 iiii GpgG

1 iii GpS

iii bap

iii bag

11 iiii HtgH

11 iiiiii HtgHtS

iii bat

iii bag

Ling’s equations:


Ling Adder

1 iiii GpgG

1

11

iiii

iiiiii

Gpgg

GpGggG

1 iiii GtgG11 iiii GtgH

Ling’s equation

Doran, Trans on Comp 9/88

Propagates informationon two bits


Ling Adder

01231232333 gtttgttgtgG

0121223

00121122233

gttgtgg

gtttgttgtgH

Conventional:

Ling:


S. Naffziger, ISSCC’96






















Results:S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96

• 0.5u Technology

• Speed: 0.930 nS

• Nominal process, 80C, V=3.3V

ConditionalSum Adder

J. Sklansky, “Conditional-Sum Addition Logic”, IRE Transactions on Electronic

Computers, EC-9, p.226-231, 1960.





Carry-Select Adder

O. J. Bedrij, “Carry-Select Adder”, IRE Transactions on Electronic Computers, June

1962, p.340-34


Carry-Select Adder

O.J. Bedrij, IBM Poughkeepsie, 1962


Carry-Select AdderAddition under assumption of Cin=0 and Cin =1.


Carry Select Adder:combining two 32-b VBAs in select mode

Delay =VBA32+ MUX

Addition Under Non-equal Signal Arrival Profile

Assumption

P. Stelling , V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer

Academic Publishers, Vol.14, No.3, December 1996


Signal Arrival Profile form the Parallel Multiplier Partial-Product Recuction Tree

Prof. V.G. Oklobdzija VLSI Arithmetic 81Oklobdzija, Villeger, IEEE Transactions on VLSI Systems, June, 1995


Oklobdzija and Villeger, IEEE Transactions on VLSI Systems, June, 1995









Performing Multiply-Add Operation in the Multiply Time

P. Stelling, V. G. Oklobdzija, " Achieving Multiply-Accumulate Operation in the

Multiply Time", Thirteenth International Symposium on Computer Arithmetic, Pacific

Grove, California, July 5 - 9, 1997.



Final Adder: Implementation







Recurrence Solver Based Adders

Koggie and Stone, IEEE Trans on Computers, August 1973

Bilgory and Gajski, 18th DAC, 1981

Brent and Kung, IEEE Trans on Computers, March 1982


Recurrence Solver Based Adders• 1973, Koggie and Stone published a general

recurrence scheme for parallel computation• 1979, Brent and Kung published Tech. Report on

regular layout for parallel adders• 1980, Guibas and Vuillemin, developed a layout

scheme based on recurrence equation for addition• 1980, Ladner and Fisher published “parallel prefix

computation”, Jo of ACM• 1981, Bilgory and Gajski published a paper on

recurrence structures for automatic cell generation


Recurrence Solver Based Adders

They are based on recurrence equation for P,G

(what is new there since Weinberger ?!!):

Or: and

jiiiiiiiiij cpppgppgpgG 123123233

iiiij ppppP 123

11 iiii GpgG11 iii PpP


Recurrence Solver Based Adders C 16 C 13C 14C 15 C 7 C 1C 2C 3C 8 C 4C 5C 6C 12 C 9C 10C 11

(g1 , p

1 )

(g3 , p

3 )

(g4 , p

4 )

(g2 , p

2 )

(g5 , p

5 )

(g7 , p

7 )

(g8 , p

8 )

(g6 , p

6 )

(g9 , p

9 )

(g11 , p

11 )

(g12 , p

12 )

(g10 , p

10 )

(g13 , p

13 )

(g15 , p

15 )

(g16 , p

16 )

(g14 , p

14 )

generationof carry

generationof g i, p i


Carry-Lookahead Adder (Weinberger and Smith)

Just to remind you !please notice the similarity with Parallel-Prefix Adders !

Multiplexer Based Adder

Farooqui and Oklobdzija1999 Int’l Sym. on VLSI Technology, Taipei,

Taiwan, June 8-10, 1999


Multiplexer Based Adder

• Based on the realization that MUX circuit is faster than a logic gate due to its transmission gate implementation.

• Based on Carry-Lookahead method (W-S), or recurrence solver.


Multiplexer Based AdderA. A. Farooqui, V. G. Oklobdzija , F. Chechrazi, 1999 Int’l Sym. on VLSI

Technology, Taipei, Taiwan, June 8-10, 1999.

a3b2a2 b2a2b3a3

0 1

b0 a0 a1b0 a0 b1 a1

0 1

01

g01g23

p23

p3p1

g03p03

g03 p03

g3p

3

g2p

2

g1p

1

g0p

0




4 -b it M U Xb a se d g ro u p

c a r ry g e n .


c a r ry g e n .


c a r ry g e n .


c a r ry g e n .

M U X an d N O RM U X an d N O R

M U X an d N A N DM U X an d N A N D

A 03B 03A 47B 47A 811B 811A 1215B 1215

G 0 -3

P 0 -3G 4 -7P 4 -7G 8 -11

P 8 -11G 1 2 -1 5

P 1 2 -1 5

C 3C 7C 11C 1 5

P 0 -7

G 0 -7

P 8 -1 5 G 8 -1 5

G 0 -11G 0 -1 5P 0 -11P 0 -1 5

B 811 A 811B 811A1215B1215 A1215B1215

S um 0-3

4 -b itS u m

4 -b itS u m

C in0C in1

S um 4-7

1 0

A 47B 47 A 47B 47

4 -b itS u m

4 -b itS u m

C in0C in1

S um 8-11

1 0

A 811

4 -b itS u m

4 -b itS u m

C in0C in1

S um 12-15

1 0

4 -b itS u m

C in0A 03B 03

AND

AND

P art_C ont

P art_C ont

CSA CSACSA




0 10 1

g0p1

p0

a0b0

0 1

01

a1b1

p2

g1

g1

0 1

01

a2b2

p3

g2

0 1

g2g1

Cin

Sum0Sum1Sum2Sum3




• Results in a very fast structure• 7-MUX delays for a 64-b adder• Delay using standard cell 0.25u, 2.5V, 25oC :

Adder Size (bits)

Delay

(pS)

8 625

16 665

32 710

64 903


DEC "Alpha" 21064 Adder

• Combination:– 8-bit tapered pre-discharged Manchester Carry

Chains, with Cin = 0 and Cin = 1

– 32-bit LSB Carry Lookahead Adder– 32-bit MSB Conditional-Sum Adder– Carry-Select on most significant 32-bits– Latches in the middle: pipelined addition


DEC "Alpha" 21064 Adder Latch

S witch

Latch

S witch

Latch

S witch

Latch

S witch

Latch

S witch

Latch

S witch

Latch

S witch

Latch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

D ualS w itch

Latch & X O R Latch & X O R Latch & X O R Latch & X O R

Latch & X O R Latch & X O RLatch & X O RLatch & X O R

PG K C ellPG K C ell PG K C ell PG K C ell PG K C ellPG K C ell PG K C ell PG K C ell

LookA head

C arryC hain

C arryC hain

C arryC hain

C arryC hain

C arryC hain

C arryC hain

C arryC hain

C arryC hain

M UX

10

10

10

10

10

10

10

C in

Input O perandsB yte 7








R esu lt R esu lt R esu lt R esu lt R esu lt R esu lt R esu lt R esu lt


DEC "Alpha" 21064 Adder: Results

• The first 200MHz processor

• Built using 0.75u technology

• V=3.3V, 30W

• Pipelined (two-latches) allowing 5nS throughput and 10nS latency

ConclusionVLSI Implementation of Addition


Conclusion: VLSI Implementation of Addition

• Currently, implementation parameters are not reflected in algorithms used for development

• Layout and wire delays effects are largely neglected and this is becoming intolerable in the next generation of technology

• Transistor sizing has a large effect which can out weight the algorithm

• There is a great disconnect between algorithm and implementation

• New rules and measures of goodness are needed

Multiplication

Parallel Multiplier Implementation


Multiplication Algorithm:

in

i

iin

i

i ryXryXXYP

1

0

1

0

0 p)(0

)(1)1(

jnjj Xyrp

rp for j=0,....,n-1

initially

p(n)=XY after n steps


Parallel MultipliersParallel Multipliers

Step 0

S tep 1

S tep 2

S tep 3

S tep 4


4:2 Compressor

4-2

I4 I1I2I3

C 0 C i

C S


Re-designed 4:2 Compressor with 3 XOR Delay

C inI1

I2

I3

I4

0

1

S

C

C out

118 VLSI Arithmetic Prof. V.G. Oklobdzija

A Method for Generation of FastParallel Multipliers

by

Vojin G. OklobdzijaDavid VillegerSimon S. Liu

Electrical and Computer EngineeringUniversity of California

Davis


Carry Propagate Adder

Vertical Slices

Horizontal Propagation

Carry and Sum Connection to the Final Adder

Partial Product Martix Divided into Vertical Compressor Slices


Idea !!!!!


A

B

Cin Sum

Carry

Signal Delays in a Full Adder(3,2) Counter

Fast Input

Fast Output


Three-Dimensional optimization Method: TDM

(Oklobdzija, Villeger, Liu, 1996)

Sum

Carry

A

BCin

Sum

Carry

A

BCin

I1

I2

I3

I4

C out

C in 3 XO Rdelays


A

B

Cin Sum

Carry

A

B

Cin Sum

Carry

Carry-Out

In 1

In 2In 3In 4

Carry In

Sum

Carry

Modified 4:2 Compressor with Optimal Interconnections of two Full Adders

3 XOR gates


Example of a12 X 12 Multiplication

1 0 1 1 0 1 0 1 0 1 0 01 0 1 1 0 1 0 1 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 01 0 1 1 0 1 0 1 0 1 0 0

1 0 1 1 0 1 0 1 0 1 0 00 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0

1 0 1 1 0 1 0 1 0 1 0 01 0 1 1 0 1 0 1 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 01 0 1 1 0 1 0 1 0 1 0 0

Vertical Compressor Slice - VCS

(Partial Product for X*Y =B54 * B1B)

FA FA

FA

FA

0 0 1 1 0 1 0

FA

3-Dimensional View of Partial Product Reduction

Time

Final Adder


Method


sc

cina b

TDM ArrangementWorst Case

4

4

21

44

sc

cina b

24 3

6 1

1

6

3


Example of a Optimized Interconnection

sc

cina b

sc

cina b

sc

cina b

sc

cina b

bit (n-1) positionbit (n) position

2 xor0 xor

1 xor

3 xor3 xor

Example of a not Optimized Interconnection

sc

cina b

sc

cina b

sc

cina b

sc

cina b

bit (n-1) positionbit (n) position

2 xor0 xor

1 xor

4 xor3 xor

Example of Delay Optimization


The 9th Vertical Compressor Slice of a Multiplier

A B

C S

A B Cin

C S

A B Cin

C S

A B Cin

C S

A B Cin

C S

A B Cin

C S

A B Cin

C S

0 0 0 0 0 0 0 0 0 .5 1 1 2 3

.5 1 11 2 22 2.5

3 3 3.5 4

5 5

129 VLSI Arithmetic Prof. V.G. OklobdzijaComputer Tools


Algorithm for Automatic Generation of Partial Product Array.

Initialize:

Form 2N-1 lists Li ( i = 0, 2N-2 ) each consisting of pi elements where:

p i = i+1 for i £ N-1 and p i = 2N-1-i for i N

An element of a list Li ( j = 0,...,pi-1 ) is a pair: <nj, j>i where:

nj : is a unique node identifying name

j : is a delay associated with that node representing a delay of a signal arriving to the node nj with respect to some reference point.

For i = 0,1 and 2N-2: connect nodes from the corresponding lists Li directly to the CPA.


For i=2 to i=2N-3 {Partial Product Array Generation} Begin For if length of Li is even Then Begin If

sort the elements of Li in ascending order by the values of delay j connect an HA to the first 2 elements of Li starting with the slowest input

Ds =max {A+A-s, B+B-s} Dc =max {A+A-c, B+B-c} remove 2 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li+1 decrement the length of Li increment the length of Li+1

End If;


while length of Li > 3 Begin While sort the elements of Li in ascending order by the values of delay j connect an FA to the first 3 elements of Li starting with the slowest input of the FA:

Ds =max {A+A-s, B+B-s, Ci+Ci-s} Dc = max {A+A-c, B+B-c, Ci+Ci-c}

remove 3 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li+1 subtract 2 from the length of Li increment the length of Li+1

End While;

sort the elements of Li connect an FA to the last 3 nodes of Li connect the S and C to the bit i and i+1 of the CPA

End For;End Method;


Delays

Delay(S) = MAX {Delay(A) + DA-S, Delay(B) + DB-S, Delay(Cin) + DCin-S}

Delay(C) = MAX {Delay(A) + DA-C, Delay(B) + DB-C, Delay(Cin) + DCin-C}

In our case the delays in a FA are :

FAA S = FAB S = 2 XOR delays

FACin S = FAA C = FAB C = FACin C = 1 XOR delay.

In a HA:

HAA S = HAB S = 1 XOR delay while HAA C = HAB C = 0.5 XOR delay.


0

2

4

6

8

10

12

14

16

18

20

22

24

Del

ay (

XO

R L

evel

s)

0 20 40 60 80 100

Multiplier Width

Equivalent XOR Delays

TDM

Fadavi-Ardekani

9:2

4:2

3,2


Comparison between TDM and other representative schemes, in XOR levels.

Multiplier

Word-length

Wallace Tree [7] 4:2 Tree [11] Fadavi-

Ardekani [16]

TDM

3 2 2 2 2

4 4 3 3 3

6 6 6 5 5

8 8 6 7 5

9 8 8 7 6

11 10 9 8 7

12 10 9 8 7

16 12 9 10 8

19 12 12 11 9

24 14 12 12 10

32 16 12 13 11

42 16 15 14 12

53 18 15 15 13

64 20 15 16 14

95 20 18 17 15


oC, VCritical Path Delay [CMOS: Leff=1 , T=25 cc=5V]

N = 24-bits 4:2 Design 9:2 Design Fadavi-Ardekani TDM Design

Delay [nS] 14.0 13.0 11.7 10.5


Competing Approaches


Organization of Hitachi's DPL multiplier

4-2 4-2

4-2

4-2 4-2

4-2

4-2 4-2

4-2

4-2

4-2

4-2

4-2

54 b it 54 b it

B ooth 's E ncoder

108-b C LA A dder

108 b it

W alace 's tree

C onditiona l C arry S e lection (C C S )


Hitachi's 4:2 compressor structure

M UX

M UX

M UX

M UX

I4

I3

I1

I2

M UX

M UX

I1

I3

I4

C i

C i

C o

C

S

3 G ATES


DPL multiplexer circuit

L

H

M U X

D 0

D 1

D 0

D 1

S S

O U T

O U T

O U T

S

D 1

D 0


RECOMENDATIONS


Conclusion

1. The key to improving multiplier speed was in optimizing interconnections, not the compressor circuit (as it was believed for so long).

2. With the increase in wire delay it is important to make a connection between layout topology and algorithm for optimal interconnection of the PPRT.

3. Using one of the “fast adders” (CLA) as a final adder was acutally counterproductive. A simple final adder, but optimized for the signal arrival profile yields better results with less hardware.

4. It is possible to further optimize the PPRT and FA so that Multiply-Add operation (fused) can be performed in multiply time.

5. For the larger size multipliers / adders (as used in cryptography) the optimization procedures (described) yields even better results.

See: http://www.ece.ucdavis.edu/acsel/Publications.html


Read This !

1. E. Swartzlander, "Computer Arithmetic". Vol. 1&2, IEEE Computer Society Press, 1990.

2. K. Hwang, "Computer Arithmetic : Principles, Architecture and Design", John Wiley and Sons, 1979.

3. M. Ercegovac, “Digital Systems and Hardware/Firmware Algorithms”, Chapter 12: Arithmetic Algorithms and Processors, John Wiley & Sons, 1985.

4. A. Chandrakasan, W. Bowhill, F Fox, Editors, "Design of High Performance Microprocessors Circuits", IEEE Press, July 2000.

5. V. G. Oklobdzija, “High-Performance System Design: Circuits and Logic”, IEEE Press, July 1999.

Also: http://www.ece.ucdavis.edu/acsel/Publications.html


THE

END

Hollywood

Documents

VLSI Arithmetic Adders & Multipliers