High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

High-Performance ArithmeticHigh-Performance ArithmeticChallenges: Challenges:

From Architectures to CircuitsFrom Architectures to Circuits

High-Performance ArithmeticHigh-Performance ArithmeticChallenges: Challenges:

From Architectures to CircuitsFrom Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar

Microprocessor Research, Intel LabsIntel Corporation, Hillsboro, OR, USA

[email protected]

Prof. Vojin OklobdzijaACSEL Lab, Dept. of ECE

University of California, Davis, CA, [email protected]

Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar BorkarMicroprocessor Research, Intel LabsIntel Corporation, Hillsboro, OR, USA

[email protected]

Prof. Vojin OklobdzijaACSEL Lab, Dept. of ECE

University of California, Davis, CA, [email protected]

IntelLabs

1616thth IEEE International Computer Arithmetic Symposium, Santiago, June 18 IEEE International Computer Arithmetic Symposium, Santiago, June 18 thth 2003 2003

2

Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case

64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs

4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design

Summary

Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case

64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs

4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design

Summary

OutlineOutline

3

Frequency doubles every generation Performance-critical units

ALUs & AGUs Register files, L0 caches

High-performance trendsHigh-performance trends

Single-cycle latency &

throughput

0.1

1

10

100

1000

10000

100000

1970 1980 1990 2000 2010 2020

MHz

15-30 GHz

8080

8086

386 Pentium® proc

Pentium® 4 proc

64-bit ALUs in 0.1864-bit ALUs in 0.18m m PD-SOI/Bulk CMOS:PD-SOI/Bulk CMOS:

Design & Scaling TrendsDesign & Scaling Trends

64-bit ALUs in 0.1864-bit ALUs in 0.18m m PD-SOI/Bulk CMOS:PD-SOI/Bulk CMOS:

Design & Scaling TrendsDesign & Scaling Trends[S. Mathew et al, ISSCC 2001][S. Mathew et al, ISSCC 2001]

[S. Mathew et al, JSSC, Nov 2001][S. Mathew et al, JSSC, Nov 2001]

5

High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends

High performance circuit design:Sparse-tree semi-dynamic AGUSingle-rail dynamic ALU

High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends

High performance circuit design:Sparse-tree semi-dynamic AGUSingle-rail dynamic ALU

Design choicesDesign choices

6

p+ n+

PD-SOI DevicesPD-SOI Devices

Body of devices not tied to Vcc/Vss Body is isolated by buried oxideFloating Body!

P-Substrate

n+ n+ p+ p+STI

Buried Oxide

P type body N type body

ST

I

ST

I

7

Delay = Function of switching history– Capacitive coupling from S/G/D

– Impact Ionization, Diode conduction

– Transient Vbs DC Vbs

BackgateBuried Oxide

n+ n+

n+ Gate

Body Potential

S DG

Cbox

CdbCsb

Cgb

Complicates timing analysis

History Effect in PD-SOIHistory Effect in PD-SOI

8

64-bit ALU architecture64-bit ALU architecture

Ideal test-bed for evaluating process technologiesIdeal test-bed for evaluating process technologies

1200m Loopback bus

Single rail adder coreSingle rail adder core

Sum

2:1Mux

External operands

Shift control

5:1 Mux

0.5pF

9:1 Mux

Mux control

3:1 Mux

Mux control

9:1 Mux

External operands

Sign control

9

High-performance Adders: High-performance Adders: Kogge StoneKogge Stone

High-performance Adders: High-performance Adders: Kogge StoneKogge Stone

Generate all carries: Full-blown binary tree energy-inefficient

# Carry-merge stages = log2(N)

GG=Gi+PiGi-1

GP=PiPi-1

Oddinput bits

Even input bits

Sumeven

Sumodd

PG Gen. CM1 CM2 CM3 CM4 CM5

CM1 CM2 CM3 CM4 CM5PG Gen.

1 2 3 4 5 6 7

XOR

XOR

10

64-bit Han-Carlson adder core64-bit Han-Carlson adder core

Carry-merge done on even bitslices 50% fewer carry-merge gates vs. Kogge-Stone Extra logic stage generates odd carries

Oddbit

Evenbit

CM0

CM1

b1 b0b2b3b63b62b61b60 PG generator

Odd carry generatorSum XOR

Carry-merge0

Carry-merge1

Carry-merge5

3N

2P2N

2N

2P

b59

11

Energy-efficient adder coreEnergy-efficient adder core

43% less energy/transition at iso-performance43% less energy/transition at iso-performance

Adder architecture Energy/transition

Kogge-Stone

Han-Carlson

120pJ

68pJ

12

CSG

Han Carlson carry-merge treeHan Carlson carry-merge tree

Single rail adder coreCSG circuit generates dual-rail carry

Ceven3N 2P 2N 2P 2N 2P

Even inputs

2PCodd

2N

CM0PG gen. CM1 CM2 CM3 CM4 CM5 CM6

3NOdd

inputs

Ceven

Codd

CSG

Carry-merge tree Odd carrygenerator

ComplementaryComplementarysignal generatorsignal generator

Single rail

Dual rail

13

Complementary signal gen.Complementary signal gen.

Domino-compatible Carry/Carry Permits a single-rail carry-merge tree design Not time-borrowable – Penalty absorbed by

placing gate at 2 boundary

Keeper

Keeper

Carryi

Carryi

Cini

Truepull-downpathComplementary

pull-down path

14

Partial sum generatorPartial sum generator

Generates domino-compatible partial sumPlacing the gate at 1 boundary mitigates

output noise-glitches

1

Ai

Psumi

Keeper

Bi

Ai

Bi

1

1

Pi

Gi

15

ALU performance in bulk CMOSALU performance in bulk CMOSALU performance in bulk CMOSALU performance in bulk CMOS

2P 2P 2P 2N XOR2N 2N

2

Inp.3N9:1

Mux5:1 Mux

3:1 Mux

Bus driver

1200m Bus

1

64b Han-Carlson ALU simulation results

ALU delay 482ps

0.18m bulk CMOS, Vcc=1.5V

Adder core

310ps

2PSum

16

Porting from bulk to PD-SOIPorting from bulk to PD-SOI

SOI favored redesign

Bulk design

SOI design

SOI-optimal design

Direct port

Design issues:•Noise tolerance due to lowered Vt

•Min-delay timing-analysis

Motivation for redesign:•Reduced SOI stack penalty•Deeper stack design •Stage reduction

Motivation for redesign:•Reduced SOI stack penalty•Deeper stack design •Stage reduction

Design choices:•Architecture should favor deep stack design

•Avoid increase in fanouts

17

0.180.18m Bulk & PD-SOI m Bulk & PD-SOI technologiestechnologies

Equal IOFF at DC Vbs

SOI IDSAT is 1-2% lower

Ioff(nA/m) Idsat(A/m)

NMOS-Bulk 3.3 1070

NMOS-SOI 3.3 1050

Ioff(nA/m) Idsat(A/m)

PMOS-Bulk 0.7 445

PMOS-SOI 0.7 441

18

History effect measurements in History effect measurements in 0.180.18m PD-SOIm PD-SOI

Nor

mal

ized

del

ay

10ns 100ns 1s 10s 100s0.8

0.9

1

0.8

0.9

10.8

0.9

1

Pulse width

Transmission gate chain

3NFET-Stack chain

Inverter chain

11% History effect variation



These gates are used in the ALU

design

5-11% delay variation

Measurements agree with

simulation results

Measurements agree with

simulation results

19

Direct port of Han-Carlson Direct port of Han-Carlson ALU to PD-SOIALU to PD-SOI

Adder core speedup = 14%– [Stasiak et al.,ISSCC 2000] 21% speedup

64b Han-Carlson ALU

delay simulations

% Delay improvement

over bulk

Bulk 482ps16%

Direct-port to SOI 403ps

0.18m technology, Vcc=1.5V

20

Speedup analysisSpeedup analysis

Stage typeSpeedup over bulk from direct port to 0.18m PD-

SOI

Static gates

Dynamic gates

3:1 TG Mux

5:1 TG Mux

9:1 TG Mux

12-15%

2-9%

20%

23%

35%

• Diffusion dominated muxes Max. speedup

• Load dominated gates Speedup decreases

21

Motivation for PDSOI-optimal Motivation for PDSOI-optimal redesignredesign

Reduced stack penalty in SOIDeeper stack design Stage reductionALU is amenable to such a redesign

– Not true for all CPU critical pathsSOI-optimal ALU architecture

– Increasing stack depth must not increase fanoutsA novel deep-stack sparse-tree ALU was

developed

22

Sparse-tree adder coreSparse-tree adder core

50% reduced fanouts compared to Han-Carlson 7 gate stages (Two less than Han-Carlson)

2N

2P

4N

2P

3N

Mux

Mux

b1 b0b2b3b63 b62 b61 b60 PG generator

63:62 61:60 59:58 3:2 1:0

7:015:8

7:6 5:4

23:1631:2439:3247:40

15:0 31:1647:32

47:0 31:0

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Int. carry gen. Int. carry gen. Int. carry gen. Int. carry gen.

59:

58 5

7:56

55:

54 5

3:52

51:

50 4

9:48

43:

42 4

1:40

39:

38 3

7:36

35:

34 3

3:32

27:

26 2

5:24

23:

22 2

1:20

19:

18

11:

10 9

:8 7

:6 5:4

3:2

17:

16

1:0

Fast carry-mergetree

23

Intermediate Carry GeneratorIntermediate Carry Generator

• Generates 1 in 4 carries (C3, C7, C19….. C59)

• Non-critical path (ripple carry-select scheme)

• Fast carry selects bet. the conditional carries

01 P3:0 G3:0P7:4 G7:4P11:8 G11:8

2 22 2

Carry from Fast CM Chain

C3C7C11

2:1 Mux2:1 Mux2:1 Mux2:1 Mux

CMCM CMCM

CMCM CMCM

CMCMCMCM

24

Non-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum Generator

Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit 1 in 4 carry selects appropriate sum

Pi Pi+1Pi+2 ,Gi+2

Sumi+1Sumi+2Sumi+3Sumi+3

XOR XORXOR XOR

Pi+3,Gi+3

Sumi

Su

mi ,1

Su

mi ,0

Carry

Gi+1

2:1 2:1 2:1

11 00

2:12:1

CMCM CMCM

CMCMCMCM CMCM

CMCMCMCM CMCMCMCM

XORXOR XORXOR

25

Sparse-tree adder critical pathSparse-tree adder critical path

Fast carry-merge path Critical pathNon-critical side-paths Ripple-carry

2N2N2N2N 2P2P2P2P 4N4N4N4N 2P2P2P2P 3N3N3N3N

Inv 3N 2P 2N

2P 3N

Sum generator

Intermediate carry generator

Fast carry-merge path

SumoutSumout

Input

2N

26

PD-SOI optimal redesign in PD-SOI optimal redesign in 0.180.18mm

Deeper stack redesign additional 5% speedup

64b ALU delay simulationsSpeedup over

bulk

Bulk 482ps -

Direct-port SOI 403ps 16%

SOI-optimal redesign 380ps 21%


27

Margining for reverse-body Margining for reverse-body bias in PD-SOI bias in PD-SOI

400mV rvs. bias increases rise-delay by 10%

Difficult to detect for large circuits

10% Margin required for all max-delay paths

Overall PD-SOI speedup reduces to 11%

28

Reducing reverse-bias Reducing reverse-bias penalty in dynamic SOI gates penalty in dynamic SOI gates

Point solution for dynamic designs Pre-charging stack node decreases penalty to 2%

M1

B

A

OutA B

Stack nodeBody-B

Body-AP0

Max-delay margin reduced to 2%

Cost5% increase in clock energy

Cost5% increase in clock energy

29

0.180.18m ALU performance after m ALU performance after marginingmargining

Maximum PD-SOI speedup reduces to 19%

64b ALU delay simulationsSpeedup over bulk

Speedup after

margining

Bulk 482ps - -

Direct-port SOI 403ps 16% 14%SOI-Optimal redesign

380ps 21% 19%


30

Scaling to 0.13Scaling to 0.13m technologiesm technologiesEqual SOI & bulk IOFF-DC

MOSFET & impact ionization data obtained from 0.13m bulk measurements

SOI parasitic BJT/diode characteristics unchanged from 0.18m fitting

31

Scaling ALU designs to 0.13Scaling ALU designs to 0.13m m technologytechnology

Maximum PD-SOI speedup reduces to 16%

64b ALU delay simulations Speedup over bulk

Speedup after

margining

Bulk 351ps - -

Direct-port SOI 312ps 11% 9%

SOI-Optimal redesign

286ps 18% 16%


32

SOI vs. bulk SummarySOI vs. bulk Summary 482ps energy-efficient dynamic 64b ALU in 0.18m

bulk – 310ps adder core

Direct port to 0.18m SOI 14% speedup SOI optimal redesign 19% speedup

Floating body can get reverse-biased– Preconditioning reduces margin from 10% to 2%

Scaling to 0.13m decreases PD-SOI speedup

Maximum PD-SOI speedup in 0.13m falls to 16%16%

33

Goal: Shift the E-D curveGoal: Shift the E-D curve

High-Performance Low High-Performance Low Power Datapath designPower Datapath design

Delay

Ener

gy

A 4GHz 130nm Address A 4GHz 130nm Address Generation Unit with Generation Unit with 32-bit Sparse-tree 32-bit Sparse-tree

Adder CoreAdder Core

A 4GHz 130nm Address A 4GHz 130nm Address Generation Unit with Generation Unit with 32-bit Sparse-tree 32-bit Sparse-tree

Adder CoreAdder Core

IntelLabs

[S. Mathew et al, VLSI Symp. 2002],[S. Mathew et al, VLSI Symp. 2002],

[S. Mathew et al, JSSC May 2003][S. Mathew et al, JSSC May 2003]

35

AGUs: performance and peak-current limitersHigh activity thermal hotspotGoal: high-performance energy-efficient design

MotivationMotivation

Execution core

120oC

Cache

Processor thermal

map

AGU

Temp(oC)

36

AGU ArchitectureAGU ArchitectureAGU ArchitectureAGU Architecture

Single-cycle latency and throughput Effective Address = Base + Index*Scale +

(Segment +Displacement) 2-phase address computation

Displacement

Effective Address

3:2

Co

mp

ress

orBase

Index

Segment

3b shift

32

3232

32

32b

ad

d

32

+

clk

clk cl

k2 clk3

32

32

37

AGU Operation: Phase 1AGU Operation: Phase 1AGU Operation: Phase 1AGU Operation: Phase 1

Index pre-scaled via 3-bit barrel shifter3:2 compressor renders partial address:

Carry-save formatAdder in pre-charge state

Displacement

Effective Address

3:2

Co

mp

ress

or

3:2

Co

mp

ress

orBase

Index

Segment

3232

3232

3232

3232

32b

ad

der

32

+

clk

clk cl

k2 clk3

32

32

Carry-Saveformat

Carry-Saveformat

3bshift3b

shift

38

AGU Operation: Phase 2AGU Operation: Phase 2AGU Operation: Phase 2AGU Operation: Phase 2

Carry-save to binary format conversion: 2’s complement parallel 32-bit adder

Displacement

Effective AddressEffective Address

Base

Index

Segment

3b shift

32

323232

3232

32b

ad

der

32b

ad

der

3232

+

clk

clk

clk cl

k2cl

k2 clk3

clk3

32

32

3:2

Co

mp

ress

or

39

Kogge-Stone AdderKogge-Stone AdderKogge-Stone AdderKogge-Stone Adder

Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Maximum interconnect spans 16b

Energy inefficientEnergy

inefficient

1235 4679 8101113 12141517 16181921 20222325 24262729 283031PG

Car

ry-m

erg

e g

ates

XOR

00

40

Sparse-tree Adder ArchitectureSparse-tree Adder Architecture

Generate every 4th carry in parallelSide-path: 4-bit conditional sum generator73% fewer carry-merge gatesenergy-efficient

C27 C23 C19 C15 C11 C7 C3

293031 28 252627 24 212223 20 171819 16 131415 12 91011 8 567 4 123 0

41

Non-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum Generator

Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit Sparse-tree carry selects appropriate sum

Pi Pi+1Pi+2 ,Gi+2


XOR XORXOR XOR

Pi+3,Gi+3

Sumi

Su

mi ,1

Su

mi ,0

Carry

Gi+1

2:1 2:1 2:1

11 00

2:12:1

CMCM CMCM

CMCMCMCM CMCM

CMCMCMCM CMCMCMCM

XORXOR XORXOR

42

Conditional Carry for Cin=0Conditional Carry for Cin=0

Optimized First-level Carry-mergeOptimized First-level Carry-merge

Carry-merge stage reduces to inverterConditional carry_0 = Gi#

C#_0C#_0i

Pi

Cin=0

GiPi

Gi

Gi

CMCMCMCM0000

43

Conditional carry for Cin=1Conditional carry for Cin=1

Optimized First-level Carry-mergeOptimized First-level Carry-merge

Pi & Gi correlatedConditional carry_1 = Pi#

Pi

Gi

Ai Bi Pi Gi C#_10 0 0 0 10 1 1 0 01 0 1 0 01 1 1 1 0

C#_1C#_1

Cin=1

Gi

Pi

Gi

Pi

C#_1Pi

CMCMCMCM1111

44

Optimized Sum GeneratorOptimized Sum GeneratorOptimized Sum GeneratorOptimized Sum GeneratorPi Pi+1

Pi+2 ,Gi+2


XOR XORXOR XOR

Pi+3,Gi+3

Sumi

Su

mi ,1

Su

mi ,0

Carry

Gi+1

CMCMCMCM CMCMCMCM

Optimized 1st-level Optimized 1st-level carry-mergecarry-merge

Optimized non-critical path: 4 stages

2:1 2:1 2:12:12:12:12:1

CMCMCMCM CMCMCMCMXORXORXORXOR XORXORXORXOR

45

Adder Core Critical PathAdder Core Critical Path

Critical path: 7 gate stages same as KSSparse-tree: single-rail dynamicExploit non-criticality of sum generatorConvert to static logicSemi-dynamic design

PGPG GGGG11 GGGG77

Static sum generatorStatic sum generator

Single-rail dynamic sparse-tree pathSingle-rail dynamic sparse-tree path

AdderAdderInputsInputs

clk2clk2

SumSum3131

clk3clk3clkclk

clkclk

GGGG2727GGGG1515

CM0CM0LatchLatch CM1CM1 XORXOR

CC2727

SumSum31_031_0

SumSum31_131_1

GGGG33

46

11stst-level Carry-merge: Static Latch-level Carry-merge: Static Latch

Holds state in pre-charge phasePrevents pre-charging of static stages

Pi

Gi-1

clk

Gi-1

Gi

Pi

C#i

Gi

47

Domino-Static InterfaceDomino-Static InterfaceDomino-Static InterfaceDomino-Static Interface

Sum=Sum0 during pre-chargeMux output resolves during evaluation

clk=

0cl

k=0

clk=

1cl

k=1

Carry#i

Gi

Pi

GiG#i-1

2P2N

clk

G#i

P#i

Carry#i

Pi

Gi-1

Gi

Pi

GiG#i-1

Sumi

clk Sum1i

Sum0i

Sum1i

Sum0i

Sumi

Gi-1 Pi

G#i

P#i

48

Sparse-tree ArchitectureSparse-tree ArchitectureSparse-tree ArchitectureSparse-tree ArchitecturePerformance impact: (20% speedup)

33-50% reduced G/P fanouts80% reduced wiring complexity30% reduction in maximum interconnect

Power impact: (56% reduction)73% fewer carry-merge gates 50% reduction in average transistor size

49

Energy-delay SpaceEnergy-delay SpaceEnergy-delay SpaceEnergy-delay Space

20% speedup over Kogge-Stone 56% worst-case energy reduction

Scales with activity factor

00

2020

4040

6060

8080

100100

140140 160160 180180 200200 220220 240240 260260 280280Delay (ps)Delay (ps)

Wo

rst-

case

En

erg

y (p

J)W

ors

t-ca

se E

ner

gy

(pJ)

Dynamic Kogge-StoneDynamic Kogge-Stone

Semi-dynamic Sparse-Tree Semi-dynamic Sparse-Tree

20%20%

4GHz 4GHz DesignDesign

56%

56%

130nm CMOS, 1.2V, 110130nm CMOS, 1.2V, 110ooC SimulationC Simulation

50

Semi-dynamic DesignSemi-dynamic Design

Static sum generators : low switching activity71% lower average energy at 10% activity

00

1010

2020

3030

4040

00 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5Activity factor Activity factor

Ave

rag

e E

ner

gy

(pJ)

Ave

rag

e E

ner

gy

(pJ)

Dynamic Dynamic Kogge-StoneKogge-Stone

Semi-dynamic Semi-dynamic Sparse-Tree Sparse-Tree

71%71%

51

Dual-VDual-Vtt Allocation Allocation

Exploit non-criticality of sidepaths Use high-Vt devices

0% performance penalty 56% reduction in active leakage energy

Low-VLow-Vt t Dual-VDual-Vtt

DelayDelaySwitching EnergySwitching EnergyLeakage EnergyLeakage Energy

152ps152ps36pJ36pJ0.9pJ0.9pJ

152ps152ps

34pJ (-6%)34pJ (-6%)

0.4pJ (-56%)0.4pJ (-56%)

130nm CMOS, 1.2V, 110130nm CMOS, 1.2V, 110ooC SimulationC Simulation

52

Scaling PerformanceScaling PerformanceScaling PerformanceScaling Performance

Average transistor size = 3.5m Reduces impact of increasing leakage

80% reduction in wiring complexity Reduces impact of wire resistance

33% delay scaling, 50% energy reduction

130nm 130nm 100nm100nm

DelayDelaySwitching EnergySwitching EnergyLeakage EnergyLeakage Energy

152ps152ps36pJ36pJ0.9pJ0.9pJ

102ps (-33%)102ps (-33%)18pJ (-50%)18pJ (-50%)

0.7pJ (-23%)0.7pJ (-23%)

A 6.5GHz, 130nm A 6.5GHz, 130nm Single-ended Single-ended Dynamic ALUDynamic ALU

A 6.5GHz, 130nm A 6.5GHz, 130nm Single-ended Single-ended Dynamic ALUDynamic ALU

IntelLabs

[M. Anders et al, ISSCC 2002],[M. Anders et al, ISSCC 2002],

[S. Vangal et al, JSSC November 2002][S. Vangal et al, JSSC November 2002]

54

X8

SchedulerALU 0

X8

SchedulerALU 1

AL

U 0

AL

U 1

5:1

RFFIFO

RFFIFO

FIFORF

FIFORF

5:15:1

5:1

toRF,

FIFO

toRF,

FIFO

toRF,

FIFO

88

88

888

8

sched1#

sched1

sched0#

sched0

sum1#

sum1

sum0#

sum0 32

32

32

32

32-bit ALU/Scheduler Loop32-bit ALU/Scheduler Loop

• Performance-critical execution core loop

55

RFOperand

FIFOOperands

RFOperand

FIFOOperands

5:1 Mux Control5:1 Mux

31 3029 28 3 2 1 0Propagate/Generate/Partial Sum (dynamic)

Carry merge 0 (static)

Carry merge 1 (dynamic)


Carry merge 3 (dynamic)


Carry merge 5 (CSG) / Sum

84u

m lo

op

bac

k b

us

Sum Sum#

Han-Carlson ALU OrganizationHan-Carlson ALU Organization

•Single-rail dynamic 9-stage low-Vt design

56

Carry

iCarry#

i

gi#

Sumi

Psumi Sum# i

Odd-bit CSGCarry merge

Sum generation

gi-1#

2

pi#

Odd-bits CSG Sum GenerationOdd-bits CSG Sum Generation

• Final carry-merge CSG(dual-rail carry output)→ pass-transistor sum XOR

57

Even-bits CSG Sum GenerationEven-bits CSG Sum Generation

• Domino-compatible sum• Dual-rail sum from single-ended g inputs

Carry

iCarry#

i

gi#

Sumi

Psumi Sum #i

Even-bit CSGCarry merge

Sum generation

2

58

Die Micro-photographDie Micro-photograph

• 130nm 6-metal dual-Vt CMOS

• Scheduler:

• 210μm x 210μm

• ALU:

• 84μm x 336μm

Scheduler

ALU

59

Delay and Power MeasurementsDelay and Power Measurements

• 6.5GHz at 1.1V, 25ºC • Power: 120mW total, 15mW leakage• Scalable to 10GHz at 1.7V, 25ºC

0

50

100

150

200

250

300

350

400

450

0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

Supply Voltage (V)P

ower

(mW

)

0

50

100

150

200

250

300

350

400

450

Leak

age

Pow

er (m

W)

Design target

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

Supply Voltage (V)

Fm

ax (G

Hz)

25ºC25ºC

60

Area 50%

Performance (Delay)

10%

Active Leakage

40%

Robustness equal

Improvements Over Dual-rail Improvements Over Dual-rail DominoDomino

• Leakage reduced by eliminating dual-rail logic

• Robustness not compromised

• CSG improves both area and performance

61

SummarySummarySummarySummary4GHz AGU in 1.2V, 130nm technology4GHz AGU in 1.2V, 130nm technologySparse-tree adder architecture described 20% speedup and 56% energy reductionSemi-dynamic design:

Energy scales with switching activity Dual-Vt non-critical paths:

Low active leakage energy6.5GHz ALU and scheduler loop at 1.1V, 25ºC6.5GHz ALU and scheduler loop at 1.1V, 25ºC

–Scalable to 10GHz at 1.7V, 25ºC

Documents

High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel