10
ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead Adder in 90nm CMOS Sean Kao, Radu Zlatanovici, Borivoje Nikolic University of California, Berkeley, CA Fast and energy-efficient single-cycle 64b addition is essential for today’s high-performance microprocessor execution cores. The designer has several degrees of freedom to optimize the adder for performance and power. There is a choice of radix-2 or radix-4 trees, with full or sparse implementation of the conventional or Ling’s carry-lookahead (CLA) equations as well as the circuit design style. In this paper, the impact of these design choices on the performance and power of the domino CMOS adder are ana- lyzed and then confirmed using an optimization tool. An opti- mized topology, a sparse radix-4 Ling adder, is designed and fab- ricated in a general-purpose 90nm CMOS technology using stan- dard V TH transistors. It performs single-cycle 64b additions in 240ps and consumes 260mW at a nominal supply voltage of 1V and room temperature. In a high-performance integer execution unit, the microarchitec- ture sets the constraints for the adder design. The selected bit- slice height of 24 M1 tracks accommodates a wide-issue architec- ture. The bit-slice height and adder topology determine the length of the long ‘prefix’ wires in the carry-tree. In this study, the input capacitance per bit slice is limited to 27fF. The output load- ing capacitance of the adder equals its input capacitance, assum- ing that a buffer would be used to efficiently drive the local loop- back bus. The carry tree (computing the final carry[63:0]) and the final sum-select are in the critical path while the sum-precom- pute is non-critical (Fig. 24.2.1). Using a circuit-optimization framework [1], representative 64b CLA adder design choices are evaluated under these design constraints. The optimization framework is configured to minimize the adder delay subject to a maximum energy constraint by tuning transistor sizes. Ling’s pseudo-carry equations [2] reduce transistor stack height in the first stage of the carry tree [3]; in turn, they increase the complexity of the sum pre-computation. Figure 24.2.2 shows the energy–delay tradeoff curves for conventional CLA adders and Ling adders implemented with full (Kogge-Stone) radix-2 and radix-4 carry trees. For low delay targets, Ling adders are about 5% faster because of the reduced transistor stack in the first stage. However, when energy constraints are tighter, more com- plex sum-precompute blocks appear in the critical path, making Ling adders less efficient. Figure 24.2.2 also shows that radix-4 adders are more efficient than radix-2 adders for this technology and chosen design constraints. Radix-4 adders have fewer stages in the critical path composed of more complex gates with more branching than radix-2. The radix-4 adders are closer to the opti- mal number of stages with low output loads [4]. Sparse trees have been explored recently [5,6] because of their potential energy savings. Sparse trees compute only every second or fourth carry signal and use a more complex sum pre-computa- tion for the remaining bits. There are fewer “propagate” and “gen- erate” gates in the tree, but with a higher fanout at the final carry. The input gates in the critical path can be upsized, reduc- ing the effect of the internal wiring in the tree. As a tradeoff, there is a cost in the additional complexity and loading from the sum-precompute block. Figure 24.2.3 shows the energy-delay tradeoff curves for full and sparse domino Ling radix-2 and radix- 4 trees. Because of their larger stage count, radix-2 adders bene- fit more from sparse trees. Sparseness of 4 reduces the delay by 9% or the energy consumption by 50%. Increasing the sparseness for radix-4 adders shows a more delicate balance of tradeoffs between increased critical-path gate sizes and more complex sum-precompute. The radix-4 tree with the sparseness of 2 has the lowest delay (7.3 FO4) with less than 20% energy reduction compared to the full radix-4 tree. A single-rail domino radix-4 sparse-2 tree, implementing Ling’s pseudo-carry equations in the critical path and a static CMOS sum-precompute block in the non-critical path, is implemented in 7M1P 90nm CMOS. The adder core consists of 6028 transistors whose sizing is based on the results of the optimization frame- work. The size of the adder core is 417μm × 75μm. Figure 24.2.7 shows a micrograph of a part of the chip containing two adder cores and the corresponding testing circuitry. Figure 24.2.4 shows the block diagram of the adder, on-chip test- ing circuitry, and the corresponding timing waveforms. Delayed- precharge domino logic is used in the carry tree in order to hide the precharge phase from the overall cycle time. Most stages in the critical path use footless domino logic with stack node precharging, as shown in Fig. 24.2.5. Since the monotonicity of the global inputs a[63:0] and b[63:0] cannot be guaranteed, the first stage is implemented using footed domino logic. The inputs of the sum-select MUX, S 0 [63:0], S 1 [63:0], are outputs of a static block and non-monotonic, thus, psel must be a hard clock edge (Fig. 24.2.4). Critical timing edge arrivals can be fine tuned at runtime through the scan chain of the chip in order to ensure cor- rect functionality and best performance. The sparse-2 carry tree computes only even-order carries and each signal selects two sums (Fig. 24.2.5). The non-critical sum- precompute block has two types of paths: odd-order sums are pre- computed as a i + b i, and (a i + b i )’; even-order sums are a function of the carry into the previous bit and are more complex. The lay- out of critical and non-critical paths are interleaved such that the more complex even-order sum-precompute gates fit exactly in the space freed by the eliminated carry gates of the sparse tree, resulting in a compact bit-sliced layout. Measured results are presented in Fig. 24.2.6. At the nominal supply of 1V, the adder core runs at an average speed of 4.2GHz (240ps, approx. 7.7FO4) for the slowest input vector and con- sumes 260mW in the worst case, with 2.3mW of leakage at room temperature. The power measurements include the adder core and clock-generation circuitry, but not the on-chip testing circuit- ry. By increasing the supply voltage to 1.3V the delay decreases to 180ps with 606mW of active power and 4.9mW of leakage. Acknowledgments: This work was supported in part by NSF grant ECS-0238527, UC Micro and Intel Corp. The authors thank ST Microelectronics for test chip fabri- cation, Yong Yang for the help in the design of the test circuitry, Ryan Roberts for the board design. References: [1] R. Zlatanovici, B. Nikolic, “Power – Performance Optimization for Custom Digital Circuits,” Proc. PATMOS, pp. 404-414, Sept., 2005. [2] H. Ling, “High Speed Binary Adder,” IBM J. R&D, vol. 25, no. 3, pp. 156-166, May, 1981. [3] S. Naffziger, “A Sub-Nanosecond 0.5µm 64b Adder Design,” ISSCC Dig. Tech. Papers, pp. 362-363, Feb., 1996. [4] I. Sutherland, R. Sproull, H. Harris, “Logical Effort,” Morgan Kauffman 1999. [5] S. Mathew, et al, “A 4GHz 300mW 64b Integer Execution Unit ALU with Dual Supply Voltages in 90nm CMOS,” ISSCC Dig. Tech. Papers, pp. 162-163, Feb., 2004. [6] Y. Shimazaki, R. Zlatanovici, B. Nikolic, “A Shared-Well Dual-Supply Voltage 64-bit ALU,” ISSCC Dig. Tech. Papers, pp.104-105, Feb., 2003. 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

24.2 A 240ps 64b Carry-Lookahead Adder in 90nm CMOS

Sean Kao, Radu Zlatanovici, Borivoje Nikolic

University of California, Berkeley, CA

Fast and energy-efficient single-cycle 64b addition is essential fortoday’s high-performance microprocessor execution cores. Thedesigner has several degrees of freedom to optimize the adder forperformance and power. There is a choice of radix-2 or radix-4trees, with full or sparse implementation of the conventional orLing’s carry-lookahead (CLA) equations as well as the circuitdesign style. In this paper, the impact of these design choices onthe performance and power of the domino CMOS adder are ana-lyzed and then confirmed using an optimization tool. An opti-mized topology, a sparse radix-4 Ling adder, is designed and fab-ricated in a general-purpose 90nm CMOS technology using stan-dard VTH transistors. It performs single-cycle 64b additions in240ps and consumes 260mW at a nominal supply voltage of 1Vand room temperature.

In a high-performance integer execution unit, the microarchitec-ture sets the constraints for the adder design. The selected bit-slice height of 24 M1 tracks accommodates a wide-issue architec-ture. The bit-slice height and adder topology determine thelength of the long ‘prefix’ wires in the carry-tree. In this study, theinput capacitance per bit slice is limited to 27fF. The output load-ing capacitance of the adder equals its input capacitance, assum-ing that a buffer would be used to efficiently drive the local loop-back bus. The carry tree (computing the final carry[63:0]) and thefinal sum-select are in the critical path while the sum-precom-pute is non-critical (Fig. 24.2.1). Using a circuit-optimizationframework [1], representative 64b CLA adder design choices areevaluated under these design constraints. The optimizationframework is configured to minimize the adder delay subject to amaximum energy constraint by tuning transistor sizes.

Ling’s pseudo-carry equations [2] reduce transistor stack heightin the first stage of the carry tree [3]; in turn, they increase thecomplexity of the sum pre-computation. Figure 24.2.2 shows theenergy–delay tradeoff curves for conventional CLA adders andLing adders implemented with full (Kogge-Stone) radix-2 andradix-4 carry trees. For low delay targets, Ling adders are about5% faster because of the reduced transistor stack in the firststage. However, when energy constraints are tighter, more com-plex sum-precompute blocks appear in the critical path, makingLing adders less efficient. Figure 24.2.2 also shows that radix-4adders are more efficient than radix-2 adders for this technologyand chosen design constraints. Radix-4 adders have fewer stagesin the critical path composed of more complex gates with morebranching than radix-2. The radix-4 adders are closer to the opti-mal number of stages with low output loads [4].

Sparse trees have been explored recently [5,6] because of theirpotential energy savings. Sparse trees compute only every secondor fourth carry signal and use a more complex sum pre-computa-tion for the remaining bits. There are fewer “propagate” and “gen-erate” gates in the tree, but with a higher fanout at the finalcarry. The input gates in the critical path can be upsized, reduc-ing the effect of the internal wiring in the tree. As a tradeoff,there is a cost in the additional complexity and loading from thesum-precompute block. Figure 24.2.3 shows the energy-delaytradeoff curves for full and sparse domino Ling radix-2 and radix-4 trees. Because of their larger stage count, radix-2 adders bene-fit more from sparse trees. Sparseness of 4 reduces the delay by9% or the energy consumption by 50%. Increasing the sparsenessfor radix-4 adders shows a more delicate balance of tradeoffs

between increased critical-path gate sizes and more complexsum-precompute. The radix-4 tree with the sparseness of 2 hasthe lowest delay (7.3 FO4) with less than 20% energy reductioncompared to the full radix-4 tree.

A single-rail domino radix-4 sparse-2 tree, implementing Ling’spseudo-carry equations in the critical path and a static CMOSsum-precompute block in the non-critical path, is implemented in7M1P 90nm CMOS. The adder core consists of 6028 transistorswhose sizing is based on the results of the optimization frame-work. The size of the adder core is 417µm × 75µm. Figure 24.2.7shows a micrograph of a part of the chip containing two addercores and the corresponding testing circuitry.

Figure 24.2.4 shows the block diagram of the adder, on-chip test-ing circuitry, and the corresponding timing waveforms. Delayed-precharge domino logic is used in the carry tree in order to hidethe precharge phase from the overall cycle time. Most stages inthe critical path use footless domino logic with stack nodeprecharging, as shown in Fig. 24.2.5. Since the monotonicity ofthe global inputs a[63:0] and b[63:0] cannot be guaranteed, thefirst stage is implemented using footed domino logic. The inputsof the sum-select MUX, S0[63:0], S1[63:0], are outputs of a staticblock and non-monotonic, thus, psel must be a hard clock edge(Fig. 24.2.4). Critical timing edge arrivals can be fine tuned atruntime through the scan chain of the chip in order to ensure cor-rect functionality and best performance.

The sparse-2 carry tree computes only even-order carries andeach signal selects two sums (Fig. 24.2.5). The non-critical sum-precompute block has two types of paths: odd-order sums are pre-computed as ai + bi, and (ai + bi)’; even-order sums are a functionof the carry into the previous bit and are more complex. The lay-out of critical and non-critical paths are interleaved such that themore complex even-order sum-precompute gates fit exactly in thespace freed by the eliminated carry gates of the sparse tree,resulting in a compact bit-sliced layout.

Measured results are presented in Fig. 24.2.6. At the nominalsupply of 1V, the adder core runs at an average speed of 4.2GHz(240ps, approx. 7.7FO4) for the slowest input vector and con-sumes 260mW in the worst case, with 2.3mW of leakage at roomtemperature. The power measurements include the adder coreand clock-generation circuitry, but not the on-chip testing circuit-ry. By increasing the supply voltage to 1.3V the delay decreasesto 180ps with 606mW of active power and 4.9mW of leakage.

Acknowledgments:This work was supported in part by NSF grant ECS-0238527, UC Microand Intel Corp. The authors thank ST Microelectronics for test chip fabri-cation, Yong Yang for the help in the design of the test circuitry, RyanRoberts for the board design.

References:[1] R. Zlatanovici, B. Nikolic, “Power – Performance Optimization forCustom Digital Circuits,” Proc. PATMOS, pp. 404-414, Sept., 2005.[2] H. Ling, “High Speed Binary Adder,” IBM J. R&D, vol. 25, no. 3, pp.156-166, May, 1981.[3] S. Naffziger, “A Sub-Nanosecond 0.5µm 64b Adder Design,” ISSCC Dig.Tech. Papers, pp. 362-363, Feb., 1996.[4] I. Sutherland, R. Sproull, H. Harris, “Logical Effort,” Morgan Kauffman1999.[5] S. Mathew, et al, “A 4GHz 300mW 64b Integer Execution Unit ALUwith Dual Supply Voltages in 90nm CMOS,” ISSCC Dig. Tech. Papers, pp.162-163, Feb., 2004.[6] Y. Shimazaki, R. Zlatanovici, B. Nikolic, “A Shared-Well Dual-SupplyVoltage 64-bit ALU,” ISSCC Dig. Tech. Papers, pp.104-105, Feb., 2003.

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

Page 2: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

ISSCC 2006 / February 8, 2006 / 9:00 AM

Figure 24.2.1: 64b carry-lookahead adder structure. Figure 24.2.2: Ling versus CLA: Energy versus delay.

Figure 24.2.3: Full versus sparse trees: Radix-4 and radix-2 domino Ling adders in theenergy–delay space.

Figure 24.2.5: Adder circuit details and carry tree. Figure 24.2.6: Measured results.

Figure 24.2.4: Adder and testing circuitry block diagram and timing waveforms.

Generatesubtree

Propagatesubtree

G64

MUX

Carry[63:0]

S0[63:0]

S1[63:0]

Sum[63:0]

Sumprecompute

a[63:0], b[63:0], Cin Critical path

Non-critical path

0

10

20

30

40

50

6 7 8 9 10 11 12 13 14

Delay [FO4]

Ene

rgy

[pJ]

R2 Ling

R2 CLA

R4 Ling

R4 CLA

0

10

20

30

40

7 8 9 10 11 12 13 14 15

Delay [FO4]

Ene

rgy

[pJ]

R2 FullR2 SP2R2 SP4R4 FullR4 SP2R4 SP4

1

2

3

46

5

12

3

45

6

Scanchain Comparator

150

200

250

300

350

400

0.7 0.9 1.1 1.3 1.5

Supply Voltage [V]

Add

er D

elay

[ps]

0

200

400

600

150 200 250 300 350 400

Adder Delay [ps]

Pea

k P

ower

[mW

]

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

Page 3: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

Figure 24.2.7: Micrograph of two adder cores and test circuitry.

ADDERCORE

TES

T CIR

CU

ITRY

ADDERCORE

TES

T CIR

CU

ITRY

CLOCK DRIVERS

Page 4: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

Figure 24.2.1: 64b carry-lookahead adder structure.

Generatesubtree

Propagatesubtree

G64

MUX

Carry[63:0]

S0[63:0]

S1[63:0]

Sum[63:0]

Sumprecompute

a[63:0], b[63:0], Cin Critical path

Non-critical path

Page 5: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

Figure 24.2.2: Ling versus CLA: Energy versus delay.

0

10

20

30

40

50

6 7 8 9 10 11 12 13 14

Delay [FO4]

Ene

rgy

[pJ]

R2 Ling

R2 CLA

R4 Ling

R4 CLA

Page 6: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

Figure 24.2.3: Full versus sparse trees: Radix-4 and radix-2 domino Ling adders in the energy–delay space.

0

10

20

30

40

7 8 9 10 11 12 13 14 15

Delay [FO4]

Ene

rgy

[pJ]

R2 FullR2 SP2R2 SP4R4 FullR4 SP2R4 SP4

1

2

3

46

5

12

3

45

6

Page 7: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

Figure 24.2.4: Adder and testing circuitry block diagram and timing waveforms.

Scanchain Comparator

Page 8: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

Figure 24.2.5: Adder circuit details and carry tree.

Page 9: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

Figure 24.2.6: Measured results.

150

200

250

300

350

400

0.7 0.9 1.1 1.3 1.5

Supply Voltage [V]

Add

er D

elay

[ps]

0

200

400

600

150 200 250 300 350 400

Adder Delay [ps]

Pea

k P

ower

[mW

]

Page 10: ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL ...bora/publications/ISSCC06.pdfISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2 24.2 A 240ps 64b Carry-Lookahead

• 2006 IEEE International Solid-State Circuits Conference 1-4244-0079-1/06 ©2006 IEEE

ISSCC 2006 / SESSION 24 / HIGH-PERFORMANCE DIGITAL CIRCUITS / 24.2

Figure 24.2.7: Micrograph of two adder cores and test circuitry.

ADDERCORE

TES

T CIR

CU

ITRY

ADDERCORE

TES

T CIR

CU

ITRY

CLOCK DRIVERS