faster functional moduleslessons taught from FP-ADDERs
Guy Even
Electrical Engineering Dept.
Tel-Aviv Univ.
Silicon Value Seminar (April 29, 2002)
outline• FP-Adder: an example of a complicated
module
– brief overview
– focus on two sub-blocks
• Counting leading zeros – priority encoders
various design methods:
• divide & conquer
• parallel prefix computation
• redundant addition
• Adders:
– fast adders
– compound adders
Background
Faster clock rates require faster modules.
Example: Floating-Point Adders
• early designs: 50-60 logic levels.
• 15-20 gate levels per cycle 3-4 cycles!
• new designs: 25 gate levels 2 cycles.
How? better algorithms and faster sub-blocks…
Floating-Point Add• Algorithm: Why 50-60 logic
levels?
• Sub-Modules: List of sub-blocks.floating-point number: S-sign, E-exponent, F-mantissa
FES 2)1(
FbFF EbSbEaSaES 2)1(2)1(2)1(
floating-point addition:
input: (Sa,Ea,Fa) & (Sb,Eb,Fb)
output: (S,E,F) such that:
FP-Add: naïve algorithm
Round:
SWAP operands
Align mantissa of smaller operand (shift right)
Compute sticky-bit
(OR of bits shifted outside)
Pre-process: Add/Sub:
Add/Sub mantissas
Convert sum
to sign & mag
abs(negative sum)
rounding decision
INC according to
rounding decision
Normalize sum
(shift left)
RESULT
focus: normalization shift
Problem:
– LZ= number of leading zeros
– shift left by LZ positions unary example:
X[1:4]=0010
A[1:4]=0011Use a priority encoder!
two types of priority encoders:– unary
– binary
binary example:
X[1:4]=0010
Y[2:0]= 010
Unary PENC
otherwise.0
1][: if1][
jXijiA
– input: X[1:n]
– Output: A[1:n]
– functionality:
Simpler: ])[,],2[],1[(][ iXXXORiA
Implementation: what is the best design?
Delay = (log n) & Cost = (n).
Unary PENC – divide & conquer
delay: O(log n) is optimal O(log n) even if fan-out considered
cost: O(n log n) not optimal
OR(n/2)
U- PENC(n/2)
X[1:n/2]
n/2
U-PENC(n/2)
X[1+n/2:n]
n/2
OR-tree(n/2)
1
A[1+n/2:n]A[1:n/2]
linear fan-out
logarithmic delay
share OR-tree
slight reduction of cost
Unary PENC - improve
Parallel Prefix Computation (PPC) [FL,BK]!
])[,],2[],1[(][
])3[],2[],1[(]3[
])2[],1[(]2[
]1[]1[
nXXXORnA
XXXORA
XXORA
XA
Unary PENC = PPC(OR)
A[1] A[3]
X[3] X[4]X[1] X[2] X[n-1] X[n]
A[n-1]
OR OR
A[4]A[2] A[n]
OROR OR
U-PENC (n/2)
delay = O(log n)
cost = O(n)
PPC - properties
A[1] A[3]
X[3] X[4]X[1] X[2] X[n-1] X[n]
A[n-1]
OR OR
A[4]A[2] A[n]
OROR OR
U-PENC (n/2)
Fan-out:
Logarithmic fan-out can be decreased to constant (cost still O(n)).
Layout:
O(n log n) area.
Same design as “Brent-Kung” adder.
Applicable for every associative operator.
Binary PENC
i
i nXiY ]:1[ in zeros leading ofnumber 2][
– input: X[1:n] (n=2^k)
– Output: Y[k:0]
– functionality:
Relation to Unary PENC:
.])[1(2][1
n
ji
i jAiY
Implementation: what is the best design?
Delay = (log n) & Cost = (n).
Binary PENC – simple & optimal
PPC (OR)
X[1:n]
encoder(n)
Y[k:0]
A[1:n]
])[,],1[],0[(][ iXXXORiA
diff(n)delay(diff(n)) = constant
delay(encoder(n)) = O(log n)
cost(diff(n)) = O(n)
cost(encoder(n)) = O(n)
Binary PENC – with adder tree
PPC (OR)
X[1:n]
ADD-tree(n)
Y[k:0]
A[1:n]
problem:
adder(k) in tree O(log k) delay per adder
total delay is O(log n log k).
1
0])[1(2][
n
ji
i jAiY
])[,],1[],0[(][ iXXXORiA
Redundant addition
a3 a2 a1 a0
c3 c2 c1 c0
b3 b2 b1 b0
y3 y2 y1 y0
x3 x2 x1 x0
add columns in parallel using Full-Adders
Partial compression or (3:2)-addition:
delay is constant!
Tree structure enables (n:2)-addition
with O(log n) delay.
(n:2)-addition used in fast multipliers
Binary PENC – O(log n) delay
Tree of Full-Adders:
delay of each full-adder is constant
depth is O(log n)
output is carry-save number
])[,],1[],0[(][ iXXXORiA
A[1:n]
PPC (OR)
X[1:n]
FA-tree(n)
2:1-Adder
Y[k:0]
2[k:0]
1
0])[1(2][
n
ji
i jAiY
Binary PENC – divide & conquer
XL
1 2 n/2
XR
n/2+1 n
)(
)(
RR
LL
XPENCBinY
XPENCBinY
000 ifn/2
000 if0
LR
LL
XY
XYY
XL=00…0
YL[k-1]=1
Binary PENC – divide & conquer
+2(k-1)
(Half Adder)
Bin-PENC(n/2)
X[1:n/2]
k
Bin-PENC(n/2)
X[1+n/2:n]
k
1MUX(k)
k
k
Y[k:0]
YRYL
YL[k-1]
delay=constant
cost=O(log n)
delay=constant
cost=O(log n)fan-out=k
incurs O(log log n)
delay
bottom line:
delay = O(log n log logn )
cost = O(n)
initial analysis:
delay = O(log n)
cost = O(n)
PENC – quick summary
designmethodcostdelay
U-PENCdiv & conquern log nlog n
PPCn
area=n log n
log n
Bin-PENCPPC+encodernlog n
PPC+Add_treenlog n
Div & Conquernlog n
log log n
PENC - further issues
back to FP-Adder:
can we estimate LZ before subtracting?
100
111
000
00
10
01
must pre-process to avoid “catastrophic cancellation”!
method: partial compression (signed half-adders).
focus – adderAvoid INC after rounding decision by
pre-computing increment.
a b
k
RESULT
Compound Adder
a+b a+b+1
MUXrounding decision
PPC Adder [FL,BK]•computes carry bits C[n:1]
•sum bits satisfy: S[i]=XOR(A[i],B[i],C[i]).
•computation of carry bits C[n:1].
claim: pgppjiijiC ]:[:1]1[
example:
A[3:0]=0100
B[3:0]=1110
[3:0]=pgpk
C[4:1]=1100
2][][if
1][][if
0][][if
][
,,][
iBiAg
iBiAp
iBiAk
i
gpki
Define:
PPC adder (cont.)
pgppjiij ]:[:
how to compute the event:
define an operator : {k,p,g} {k,p,g} {k,p,g} as follows:
g x = g
p x = p
k x = kclaim: is associative.
definition: [i] = [i] … [0].
claim: giiC ][1]1[
compute [i]
using PPC with
-gates!
Compound Adder [T]how to compute a+b & a+b+1?
– use 2 separate adders
– understand PPC adder
1
1 .
.
B[0]
A[0]
B[1]
A[1]
]1[
]1[
nB
nA
a+b+1
= (a+0.5)+(b+0.5)
recall: [i] = [i] … [0].
Now, for a+b+1: ’[i] = [i] … [0] g.
.][
][
][' 1]1['
ki
ggi
giiC
Therefore,
Conclusion
• faster modules require clever designs
• starting point: gate count (for delay & cost)
• Must take fan-out & layout into account
• lots of methods: – divide & conquer
– parallel prefix computation
– redundant arithmetic