Download ppt - Faster functional modules lessons taught from FP-ADDERs Guy Even Electrical Engineering Dept. Tel-Aviv Univ. Silicon Value Seminar (April 29, 2002)

faster functional moduleslessons taught from FP-ADDERs

Guy Even

Electrical Engineering Dept.

Tel-Aviv Univ.

Silicon Value Seminar (April 29, 2002)

outline• FP-Adder: an example of a complicated

module

– brief overview

– focus on two sub-blocks

• Counting leading zeros – priority encoders

various design methods:

• divide & conquer

• parallel prefix computation

• redundant addition

• Adders:

– fast adders

– compound adders

Background

Faster clock rates require faster modules.

Example: Floating-Point Adders

• early designs: 50-60 logic levels.

• 15-20 gate levels per cycle 3-4 cycles!

• new designs: 25 gate levels 2 cycles.

How? better algorithms and faster sub-blocks…

Floating-Point Add• Algorithm: Why 50-60 logic

levels?

• Sub-Modules: List of sub-blocks.floating-point number: S-sign, E-exponent, F-mantissa

FES 2)1(

FbFF EbSbEaSaES 2)1(2)1(2)1(

floating-point addition:

input: (Sa,Ea,Fa) & (Sb,Eb,Fb)

output: (S,E,F) such that:

FP-Add: naïve algorithm

Round:

SWAP operands

Align mantissa of smaller operand (shift right)

Compute sticky-bit

(OR of bits shifted outside)

Pre-process: Add/Sub:

Add/Sub mantissas

Convert sum

to sign & mag

abs(negative sum)

rounding decision

INC according to

rounding decision

Normalize sum

(shift left)

RESULT

focus: normalization shift

Problem:

– LZ= number of leading zeros

– shift left by LZ positions unary example:

X[1:4]=0010

A[1:4]=0011Use a priority encoder!

two types of priority encoders:– unary

– binary

binary example:

X[1:4]=0010

Y[2:0]= 010

Unary PENC

otherwise.0

1][: if1][

jXijiA

– input: X[1:n]

– Output: A[1:n]

– functionality:

Simpler: ])[,],2[],1[(][ iXXXORiA

Implementation: what is the best design?

Delay = (log n) & Cost = (n).

Unary PENC – divide & conquer

delay: O(log n) is optimal O(log n) even if fan-out considered

cost: O(n log n) not optimal

OR(n/2)

U- PENC(n/2)

X[1:n/2]

n/2

U-PENC(n/2)

X[1+n/2:n]

n/2

OR-tree(n/2)

1

A[1+n/2:n]A[1:n/2]

linear fan-out

logarithmic delay

share OR-tree

slight reduction of cost

Unary PENC - improve

Parallel Prefix Computation (PPC) [FL,BK]!

])[,],2[],1[(][

])3[],2[],1[(]3[

])2[],1[(]2[

]1[]1[

nXXXORnA

XXXORA

XXORA

XA

Unary PENC = PPC(OR)

A[1] A[3]

X[3] X[4]X[1] X[2] X[n-1] X[n]

A[n-1]

OR OR

A[4]A[2] A[n]

OROR OR

U-PENC (n/2)

delay = O(log n)

cost = O(n)

PPC - properties

A[1] A[3]

X[3] X[4]X[1] X[2] X[n-1] X[n]

A[n-1]

OR OR

A[4]A[2] A[n]

OROR OR

U-PENC (n/2)

Fan-out:

Logarithmic fan-out can be decreased to constant (cost still O(n)).

Layout:

O(n log n) area.

Same design as “Brent-Kung” adder.

Applicable for every associative operator.

Binary PENC

i

i nXiY ]:1[ in zeros leading ofnumber 2][

– input: X[1:n] (n=2^k)

– Output: Y[k:0]

– functionality:

Relation to Unary PENC:

.])[1(2][1

n

ji

i jAiY

Implementation: what is the best design?

Delay = (log n) & Cost = (n).

Binary PENC – simple & optimal

PPC (OR)

X[1:n]

encoder(n)

Y[k:0]

A[1:n]

])[,],1[],0[(][ iXXXORiA

diff(n)delay(diff(n)) = constant

delay(encoder(n)) = O(log n)

cost(diff(n)) = O(n)

cost(encoder(n)) = O(n)

Binary PENC – with adder tree

PPC (OR)

X[1:n]

ADD-tree(n)

Y[k:0]

A[1:n]

problem:

adder(k) in tree O(log k) delay per adder

total delay is O(log n log k).

1

0])[1(2][

n

ji

i jAiY

])[,],1[],0[(][ iXXXORiA

Redundant addition

a3 a2 a1 a0

c3 c2 c1 c0

b3 b2 b1 b0

y3 y2 y1 y0

x3 x2 x1 x0

add columns in parallel using Full-Adders

Partial compression or (3:2)-addition:

delay is constant!

Tree structure enables (n:2)-addition

with O(log n) delay.

(n:2)-addition used in fast multipliers

Binary PENC – O(log n) delay

Tree of Full-Adders:

delay of each full-adder is constant

depth is O(log n)

output is carry-save number

])[,],1[],0[(][ iXXXORiA

A[1:n]

PPC (OR)

X[1:n]

FA-tree(n)

2:1-Adder

Y[k:0]

2[k:0]

1

0])[1(2][

n

ji

i jAiY

Binary PENC – divide & conquer

XL

1 2 n/2

XR

n/2+1 n

)(

)(

RR

LL

XPENCBinY

XPENCBinY

000 ifn/2

000 if0

LR

LL

XY

XYY

XL=00…0

YL[k-1]=1

Binary PENC – divide & conquer

+2(k-1)

(Half Adder)

Bin-PENC(n/2)

X[1:n/2]

k

Bin-PENC(n/2)

X[1+n/2:n]

k

1MUX(k)

k

k

Y[k:0]

YRYL

YL[k-1]

delay=constant

cost=O(log n)

delay=constant

cost=O(log n)fan-out=k

incurs O(log log n)

delay

bottom line:

delay = O(log n log logn )

cost = O(n)

initial analysis:

delay = O(log n)

cost = O(n)

PENC – quick summary

designmethodcostdelay

U-PENCdiv & conquern log nlog n

PPCn

area=n log n

log n

Bin-PENCPPC+encodernlog n

PPC+Add_treenlog n

Div & Conquernlog n

log log n

PENC - further issues

back to FP-Adder:

can we estimate LZ before subtracting?

100

111

000

00

10

01

must pre-process to avoid “catastrophic cancellation”!

method: partial compression (signed half-adders).

focus – adderAvoid INC after rounding decision by

pre-computing increment.

a b

k

RESULT

Compound Adder

a+b a+b+1

MUXrounding decision

PPC Adder [FL,BK]•computes carry bits C[n:1]

•sum bits satisfy: S[i]=XOR(A[i],B[i],C[i]).

•computation of carry bits C[n:1].

claim: pgppjiijiC ]:[:1]1[

example:

A[3:0]=0100

B[3:0]=1110

[3:0]=pgpk

C[4:1]=1100

2][][if

1][][if

0][][if

][

,,][

iBiAg

iBiAp

iBiAk

i

gpki

Define:

PPC adder (cont.)

pgppjiij ]:[:

how to compute the event:

define an operator : {k,p,g} {k,p,g} {k,p,g} as follows:

g x = g

p x = p

k x = kclaim: is associative.

definition: [i] = [i] … [0].

claim: giiC ][1]1[

compute [i]

using PPC with

-gates!

Compound Adder [T]how to compute a+b & a+b+1?

– use 2 separate adders

– understand PPC adder

1

1 .

.

B[0]

A[0]

B[1]

A[1]

]1[

]1[

nB

nA

a+b+1

= (a+0.5)+(b+0.5)

recall: [i] = [i] … [0].

Now, for a+b+1: ’[i] = [i] … [0] g.

.][

][

][' 1]1['

ki

ggi

giiC

Therefore,

Conclusion

• faster modules require clever designs

• starting point: gate count (for delay & cost)

• Must take fan-out & layout into account

• lots of methods: – divide & conquer

– parallel prefix computation

– redundant arithmetic