mwe/PHD/1
Critical ALU Path Optimization and Implementation in a
BiCMOS Process for Gigahertz Range Processors
Matthew W. Ernest
Electrical, Computer and Systems Engineering Dept.
Rensselaer Polytechnic Institute
mwe/PHD/2
Overview
• Motivation
• Parallel Prefixes and Carry Types
• HBT Digital Circuits
• Pseudo-carry Adder
• Future Directions
mwe/PHD/3
Motivation
“Speed has always been important otherwise one wouldn't need the computer.” -Seymour
Cray
• Ubiquity
• Simplicity
• Complexity
mwe/PHD/4
Parallel Prefixes
• The set of problems covering sequences of operations where terms are added in order to the result of the previous operation
• Carry computation is an application of parallel prefix theory
Given: x0 x1 x2 ... xk
Find: x0 x0 x1 x0 x1 x2 ... x0 x1 x2... xk
mwe/PHD/5
Carry types: Carry Select• Compute possible results in
parallel• Select when actual carry-in
available• Requires internal carry for
blocks, e.g. ripple• Delay: O(f(n/b) +b), min.
O(n1/2)• Area: O(f(n/b)b+b), approx.
2n • Affected by block sizing
0
1
0
1
mwe/PHD/6
Carry Types: Carry look-ahead
• Carry-out can be “generated” at current position or carry-in “propagated”
• Delay: O(1)• Area: O(n2)• High fan-in/fan-out
mwe/PHD/7
Carry Types: Block carry look-ahead
• A block propagates a carry if all bits in the block propagate a carry
• A block generates a carry if a bit generates a carry and all succeeding bits propagate
• Delay: O(log n)
• Area: O(n log n)
mwe/PHD/8
Block carry look-ahead trees
mwe/PHD/9
Carry vs. Pseudo-carryCout=Gn+ Pn• Gn-1 +…+Pn• Pn-1• ... P0• Cin
If G=A•Band P=A+Bthen
G=G•PCout= Pn•Gn+ Pn• Gn-1 +…+Pn• Pn-1• ... P0• Cin
Cout= Pn(Gn+ Gn-1 +…+Pn-1• ... P0• Cin)Cout= Pn•Hn
Hn =Gn+ Gn-1 +…+Pn-1• ... P0• Cin
mwe/PHD/10
Carry vs. Pseudo-carry
• Redundant terms create factorization opportunities
• Factorization moves terms from critical paths to non-critical paths
• Multiple paths can be parallelized
• Products with fewer terms lead to implementations with smaller, faster gates
mwe/PHD/11
Block Generate:Gi•j
0= Gij + Pi
jGij-1i + … + Pi
jPij-1iPi
j-2i•••Gi0
If G=A•Band P=A+Bthen
G=G•PGi•j
0= PijGi
j + PijGi
j-1i + … + PijPi
j-1iPij-2i•••Gi
0
Gi•j0= Pi
j(Gij + Gi
j-1i + … + Pij-1iPi
j-2i•••Gi0)
Hi•j0= Gi
j + Gij-1i + … + Pi
j-1iPij-2i•••Gi
0
Deriving Block Pseudo-carry from Block Carry Look-ahead Terms
• Pseudo-carries can be generated in blocks like carries
mwe/PHD/12
H2s= G1
s+1 + G1s
Hi+js= Hj
s+i + Ijs+i-1•Hi
s
Hi+j+ks= Hk
s+I+j + Iks+I+j-1•Hj
s+i + Iks+I+j-1• Ij
s+i-1•His
Ip+qt= Iq
t+p•Ipt
Ip+q+rt= Ir
t+q+p•Iqt+p•Ip
t
Generalized Pseudocarry Equations
mwe/PHD/13
Sn=AnBnCn-1
IfTn=AnBn
Cm= Pm•Hm
thenSn=TnPn-1Hn-1
Generating Sums Using Pseudocarry
• Sum with pseudo-carry no more complex than sum with carry
• Other look-ahead features still apply, e.g. Han-Carlson “every other carry”
mwe/PHD/14
Adder comparision
Bits Rip
ple
CSelA B C CLA
PC
LA
32 32 12 12 9 6 5
64 64 20 16 12 7 6
mwe/PHD/15
HBT Digital Circuits
• Exponential I/V relationship leads to high gain and fast switching
• Vertical arrangement allows critical dimensions to be smaller with tighter tolerances
• Traditionally high DC power consumption: compare increasing leakage and switching currents for FETs
mwe/PHD/16
Current Steering Logic• Constant current source equals
combined emitter currents• Ratio of current through each
transistor is exp. function of base voltage
• Difference in currents at collector converted to difference in voltage on pull-up resistors.
mwe/PHD/17
Single-ended vs. Double-ended
• Limited to simple functions
• Large fan-in
• Any function of inputs• Fan-in limited by supply
voltage
mwe/PHD/18
Look-ahead gate w/ fully differential logic
Hn
In
Hn-1 Hn-1
In
Hn
Hn Hn
In In
Hn-1 Hn-1
Hn-2 Hn-2
In-1 In-1
mwe/PHD/19
Mixed input look-ahead gates
Hn
In
Hn-1
In
HnVr Vr • In(Hn+ Hn-1) + In•Hn
• Hn+ In•Hn-1
• Two series-gated levels for three inputs
mwe/PHD/20
Hn Hn
InIn
Hn-1 Hn-1Hn-2
In-1 In-1
Hn
Mixed input look-ahead gates
• In In-1(Hn+ Hn-1 + Hn-2) + In
In-1(Hn+ Hn-1) + In• In-1• Hn
• Hn+ In•Hn-1 + In• In-1• Hn-2
• Three series-gated levels for five inputs
mwe/PHD/21
Pseudocarry BlocksH2
sH2
s H2s
H2s H2
sH2
s H2s
H2s H2
sH2
s H2s
H2s H2
sH2
s H2s
H2s H2
sH2
s H2s
H2s H2
sH2
s H2s
H2s H2
sH2
s H2s
H2s H2
sH2
s H2s
H2s
H6s
H6s H6
sH6
s H6s
H6s H6
sH6
s H6s
H6s
H18s
H18s H14
sH14
s
H32s
H32s
mwe/PHD/22
Pseudocarry Tree Oscillator
B A
Cin
Cout
32
031
1
1 Select
mwe/PHD/23
Carry Tree High-speed Output
2 x 165 ps
mwe/PHD/24
Breakdown of measured delay
Devices
71%
Wire C
12%
Temperature
6%
Resistor model
11%
Total measured delay = 165 ps
mwe/PHD/25
Loaded vs. unloaded toggling
• At design time, fT peak at 1.2mA/um2 but limit at 2mA/um2
• For some devices, max. frequency when driving load can occur above fT peak current
• Models supported this, no reason at time to not believe them
• However, models are never qualified above fT peak current!
mwe/PHD/26
Loaded vs. unloaded toggling
0.00E+00
1.00E-11
2.00E-11
3.00E-11
4.00E-11
5.00E-11
6.00E-11
7.00E-11
8.00E-11
0.00E+00 5.00E-04 1.00E-03 1.50E-03 2.00E-03 2.50E-03
Tail Current
Bu
ffer
Del
ay
mwe/PHD/27
Resistor Model Effects9805A 99B
Simulated Fabricated
Pull-up 444 528
Tail 1000 1091
mwe/PHD/28
Model parameter variation
0
50
100
150
200
250
300
350
400
450
500
9708A 9802 9805 1999B v2.3
Design Kit
Par
amte
r val
ue RB (ohms)
RE (ohms)
RC (ohms)
DARPA02 Design DARPA02 Fabrication
mwe/PHD/29
Cadence internal parasitic methods
• Approximates all capacitance as polynomial function of distance between conductors
• Cannot extract RC and capacitance between conductors at the same time: killer for differential wiring!
• Convenient, but window of usability small and shrinking
mwe/PHD/30
QuickCap capacitance extraction
• Field solving with floating random walk method
• Accuracy almost wholly a function of run time: 4x run time give ½ error
• Random walks independent, near perfect parallelization
mwe/PHD/31
Comparing parasitic extraction
0
5
10
15
20
25
30
35
40
45
50
0 200 400 600 800 1000 1200
Length (um)
Dela
y (
ps) Qcap RC
RCNET
PCAP
Calc RC
mwe/PHD/32
Cadence/QuickCap Design Flow• Extract physical data
from layout
• Compute RC with QuickCap
• Extract netlist from schematic
• Combine to simulate with Spectre
mwe/PHD/33
Partial manual extraction with QuickCap
• Identify main wires of oscillation paths: approx. dozen pairs
• QuickCap extraction for each wire-ground cap. and cap. between pair
• Add RC-ladder for each pair by hand to schematic and simulate
mwe/PHD/34
Simulation with Parasitic Extraction
Feedback path
w/o parasitics
(ps)
QuickCap parasitic cap.
(ps)
COEFGEN parasitic cap.
(ps)
Raphael parasitic
cap.(ps)
QuickCap parasitic
RC(ps)
Cin 100 121 128 131 135
A1 103 123 130 129 137
A31 108 127 129 132 141
mwe/PHD/35
Pseudo-carry Tree configured as Ring Oscillator
B A
Cin
Sel0Sel1
Cout
32 30 1
1
1
00...00 11...11
mwe/PHD/36
SMI00 Test Structure Layout
mwe/PHD/37
SMI00 Test Structure
mwe/PHD/38
Carry Tree High-speed Outputs
16 x 146 ps
mwe/PHD/39
Comparisons of published adders
Reference Type Size Gate Del. TimeZIMM96 Carry 32 5 -STEL96 Adder 64(32) 12.5(12?) -WANG97 Adder 32 3 2.7nsCHAN98 Adder 64(32) 27(19.5) -SILB98 Fixed 64 - 550 psAIPP99 Adder 64 - 660 psSAGE01 Adder 32[16x2] - <500psMATH01 Adder 64 - 482 psSTAS01 Adder 64 - 440 psLEE02 Adder 64 900 psVANA02 ALU 32 8 <200 ps
mwe/PHD/40
Cascode Output Stage• Eliminates Miller
capacitance between input and output
• Reduces Cjc and Cjs on outputs
• Shortens rise time, but increases delay
mwe/PHD/41
Dotted Emitter/Collector
mwe/PHD/42
“Wide/Short” gate with dotted emitter/collector
mwe/PHD/43
“Wide/Short” gate with dotted emitter/collector
• Shorter trees lead to lower supply voltages• Wider trees reduce ratio of emitter-followers to
terms computed, lowering total current• More inputs per look-ahead gate means fewer
look-ahead levels• Elimination of single-ended inputs on critical H
signals allow faster switching with reduced swing
mwe/PHD/44
Even wider look-ahead gate
Width limited by• Accumulated Cjc and Cjs of dotted-and node• Saturation vs. breakdown• Fan-out loading from inputs and interconnect
mwe/PHD/45
Conclusions
• 32-bit addition depth reduced to 5 gates fabricated. 4 and 3 gate depth circuits designed.
• Gate to compute 3-way look-ahead fabricated. Up to 8-way look-ahead designed.
• Carry delay for 32-bit addition measured at 146ps.• QuickCap technology file for 5HP brings
simulated results within 11% of measured.