61
High-Performance Arithmetic High-Performance Arithmetic Challenges: Challenges: From Architectures to Circuits From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel Labs Intel Corporation, Hillsboro, OR, USA [email protected] Prof. Vojin Oklobdzija ACSEL Lab, Dept. of ECE University of California, Davis, CA, USA [email protected] Intel Labs EEE International Computer Arithmetic Symposium, Santiago, June 18 EEE International Computer Arithmetic Symposium, Santiago, June 18 th th 2003 2003

High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

Embed Size (px)

Citation preview

Page 1: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

High-Performance ArithmeticHigh-Performance ArithmeticChallenges: Challenges:

From Architectures to CircuitsFrom Architectures to Circuits

High-Performance ArithmeticHigh-Performance ArithmeticChallenges: Challenges:

From Architectures to CircuitsFrom Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar

Microprocessor Research, Intel LabsIntel Corporation, Hillsboro, OR, USA

[email protected]

Prof. Vojin OklobdzijaACSEL Lab, Dept. of ECE

University of California, Davis, CA, [email protected]

Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar BorkarMicroprocessor Research, Intel LabsIntel Corporation, Hillsboro, OR, USA

[email protected]

Prof. Vojin OklobdzijaACSEL Lab, Dept. of ECE

University of California, Davis, CA, [email protected]

IntelLabs

1616thth IEEE International Computer Arithmetic Symposium, Santiago, June 18 IEEE International Computer Arithmetic Symposium, Santiago, June 18 thth 2003 2003

Page 2: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

2

Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case

64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs

4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design

Summary

Motivation Design choices for high-performance circuits SOI vs. Bulk devices: ALU design test-case

64-bit ALUs in PD-SOI and Bulk CMOS Energy-efficient high-performance AGU/ALUs

4GHz Sparse-tree AGU Design 6.5-10GHz Integer ALU Design

Summary

OutlineOutline

Page 3: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

3

Frequency doubles every generation Performance-critical units

ALUs & AGUs Register files, L0 caches

High-performance trendsHigh-performance trends

Single-cycle latency &

throughput

0.1

1

10

100

1000

10000

100000

1970 1980 1990 2000 2010 2020

MHz

15-30 GHz

8080

8086

386 Pentium® proc

Pentium® 4 proc

Page 4: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

64-bit ALUs in 0.1864-bit ALUs in 0.18m m PD-SOI/Bulk CMOS:PD-SOI/Bulk CMOS:

Design & Scaling TrendsDesign & Scaling Trends

64-bit ALUs in 0.1864-bit ALUs in 0.18m m PD-SOI/Bulk CMOS:PD-SOI/Bulk CMOS:

Design & Scaling TrendsDesign & Scaling Trends[S. Mathew et al, ISSCC 2001][S. Mathew et al, ISSCC 2001]

[S. Mathew et al, JSSC, Nov 2001][S. Mathew et al, JSSC, Nov 2001]

Page 5: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

5

High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends

High performance circuit design:Sparse-tree semi-dynamic AGUSingle-rail dynamic ALU

High performance devices: Partially depleted Silicon-on-Insulator Pros & Cons vs. bulk CMOS Scaling trends

High performance circuit design:Sparse-tree semi-dynamic AGUSingle-rail dynamic ALU

Design choicesDesign choices

Page 6: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

6

p+ n+

PD-SOI DevicesPD-SOI Devices

Body of devices not tied to Vcc/Vss Body is isolated by buried oxideFloating Body!

P-Substrate

n+ n+ p+ p+STI

Buried Oxide

P type body N type body

ST

I

ST

I

Page 7: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

7

Delay = Function of switching history– Capacitive coupling from S/G/D

– Impact Ionization, Diode conduction

– Transient Vbs DC Vbs

BackgateBuried Oxide

n+ n+

n+ Gate

Body Potential

S DG

Cbox

CdbCsb

Cgb

Complicates timing analysis

History Effect in PD-SOIHistory Effect in PD-SOI

Page 8: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

8

64-bit ALU architecture64-bit ALU architecture

Ideal test-bed for evaluating process technologiesIdeal test-bed for evaluating process technologies

1200m Loopback bus

Single rail adder coreSingle rail adder core

Sum

2:1Mux

External operands

Shift control

5:1 Mux

0.5pF

9:1 Mux

Mux control

3:1 Mux

Mux control

9:1 Mux

External operands

Sign control

Page 9: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

9

High-performance Adders: High-performance Adders: Kogge StoneKogge Stone

High-performance Adders: High-performance Adders: Kogge StoneKogge Stone

Generate all carries: Full-blown binary tree energy-inefficient

# Carry-merge stages = log2(N)

GG=Gi+PiGi-1

GP=PiPi-1

Oddinput bits

Even input bits

Sumeven

Sumodd

PG Gen. CM1 CM2 CM3 CM4 CM5

CM1 CM2 CM3 CM4 CM5PG Gen.

1 2 3 4 5 6 7

XOR

XOR

Page 10: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

10

64-bit Han-Carlson adder core64-bit Han-Carlson adder core

Carry-merge done on even bitslices 50% fewer carry-merge gates vs. Kogge-Stone Extra logic stage generates odd carries

Oddbit

Evenbit

CM0

CM1

b1 b0b2b3b63b62b61b60 PG generator

Odd carry generatorSum XOR

Carry-merge0

Carry-merge1

Carry-merge5

3N

2P2N

2N

2P

b59

Page 11: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

11

Energy-efficient adder coreEnergy-efficient adder core

43% less energy/transition at iso-performance43% less energy/transition at iso-performance

Adder architecture Energy/transition

Kogge-Stone

Han-Carlson

120pJ

68pJ

Page 12: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

12

CSG

Han Carlson carry-merge treeHan Carlson carry-merge tree

Single rail adder coreCSG circuit generates dual-rail carry

Ceven3N 2P 2N 2P 2N 2P

Even inputs

2PCodd

2N

CM0PG gen. CM1 CM2 CM3 CM4 CM5 CM6

3NOdd

inputs

Ceven

Codd

CSG

Carry-merge tree Odd carrygenerator

ComplementaryComplementarysignal generatorsignal generator

Single rail

Dual rail

Page 13: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

13

Complementary signal gen.Complementary signal gen.

Domino-compatible Carry/Carry Permits a single-rail carry-merge tree design Not time-borrowable – Penalty absorbed by

placing gate at 2 boundary

Keeper

Keeper

Carryi

Carryi

Cini

Truepull-downpathComplementary

pull-down path

Page 14: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

14

Partial sum generatorPartial sum generator

Generates domino-compatible partial sumPlacing the gate at 1 boundary mitigates

output noise-glitches

1

Ai

Psumi

Keeper

Bi

Ai

Bi

1

1

Pi

Gi

Page 15: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

15

ALU performance in bulk CMOSALU performance in bulk CMOSALU performance in bulk CMOSALU performance in bulk CMOS

2P 2P 2P 2N XOR2N 2N

2

Inp.3N9:1

Mux5:1 Mux

3:1 Mux

Bus driver

1200m Bus

1

64b Han-Carlson ALU simulation results

ALU delay 482ps

0.18m bulk CMOS, Vcc=1.5V

Adder core

310ps

2PSum

Page 16: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

16

Porting from bulk to PD-SOIPorting from bulk to PD-SOI

SOI favored redesign

Bulk design

SOI design

SOI-optimal design

Direct port

Design issues:•Noise tolerance due to lowered Vt

•Min-delay timing-analysis

Motivation for redesign:•Reduced SOI stack penalty•Deeper stack design •Stage reduction

Motivation for redesign:•Reduced SOI stack penalty•Deeper stack design •Stage reduction

Design choices:•Architecture should favor deep stack design

•Avoid increase in fanouts

Page 17: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

17

0.180.18m Bulk & PD-SOI m Bulk & PD-SOI technologiestechnologies

Equal IOFF at DC Vbs

SOI IDSAT is 1-2% lower

Ioff(nA/m) Idsat(A/m)

NMOS-Bulk 3.3 1070

NMOS-SOI 3.3 1050

Ioff(nA/m) Idsat(A/m)

PMOS-Bulk 0.7 445

PMOS-SOI 0.7 441

Page 18: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

18

History effect measurements in History effect measurements in 0.180.18m PD-SOIm PD-SOI

Nor

mal

ized

del

ay

10ns 100ns 1s 10s 100s0.8

0.9

1

0.8

0.9

10.8

0.9

1

Pulse width

Transmission gate chain

3NFET-Stack chain

Inverter chain

11% History effect variation

7% History effect variation

5% History effect variation

These gates are used in the ALU

design

5-11% delay variation

Measurements agree with

simulation results

Measurements agree with

simulation results

Page 19: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

19

Direct port of Han-Carlson Direct port of Han-Carlson ALU to PD-SOIALU to PD-SOI

Adder core speedup = 14%– [Stasiak et al.,ISSCC 2000] 21% speedup

64b Han-Carlson ALU

delay simulations

% Delay improvement

over bulk

Bulk 482ps16%

Direct-port to SOI 403ps

0.18m technology, Vcc=1.5V

Page 20: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

20

Speedup analysisSpeedup analysis

Stage typeSpeedup over bulk from direct port to 0.18m PD-

SOI

Static gates

Dynamic gates

3:1 TG Mux

5:1 TG Mux

9:1 TG Mux

12-15%

2-9%

20%

23%

35%

• Diffusion dominated muxes Max. speedup

• Load dominated gates Speedup decreases

Page 21: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

21

Motivation for PDSOI-optimal Motivation for PDSOI-optimal redesignredesign

Reduced stack penalty in SOIDeeper stack design Stage reductionALU is amenable to such a redesign

– Not true for all CPU critical pathsSOI-optimal ALU architecture

– Increasing stack depth must not increase fanoutsA novel deep-stack sparse-tree ALU was

developed

Page 22: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

22

Sparse-tree adder coreSparse-tree adder core

50% reduced fanouts compared to Han-Carlson 7 gate stages (Two less than Han-Carlson)

2N

2P

4N

2P

3N

Mux

Mux

b1 b0b2b3b63 b62 b61 b60 PG generator

63:62 61:60 59:58 3:2 1:0

7:015:8

7:6 5:4

23:1631:2439:3247:40

15:0 31:1647:32

47:0 31:0

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Sum

Gen

Int. carry gen. Int. carry gen. Int. carry gen. Int. carry gen.

59:

58 5

7:56

55:

54 5

3:52

51:

50 4

9:48

43:

42 4

1:40

39:

38 3

7:36

35:

34 3

3:32

27:

26 2

5:24

23:

22 2

1:20

19:

18

11:

10 9

:8 7

:6 5:4

3:2

17:

16

1:0

Fast carry-mergetree

Page 23: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

23

Intermediate Carry GeneratorIntermediate Carry Generator

• Generates 1 in 4 carries (C3, C7, C19….. C59)

• Non-critical path (ripple carry-select scheme)

• Fast carry selects bet. the conditional carries

01 P3:0 G3:0P7:4 G7:4P11:8 G11:8

2 22 2

Carry from Fast CM Chain

C3C7C11

2:1 Mux2:1 Mux2:1 Mux2:1 Mux

CMCM CMCM

CMCM CMCM

CMCMCMCM

Page 24: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

24

Non-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum Generator

Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit 1 in 4 carry selects appropriate sum

Pi Pi+1Pi+2 ,Gi+2

Sumi+1Sumi+2Sumi+3Sumi+3

XOR XORXOR XOR

Pi+3,Gi+3

Sumi

Su

mi ,1

Su

mi ,0

Carry

Gi+1

2:1 2:1 2:1

11 00

2:12:1

CMCM CMCM

CMCMCMCM CMCM

CMCMCMCM CMCMCMCM

XORXOR XORXOR

Page 25: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

25

Sparse-tree adder critical pathSparse-tree adder critical path

Fast carry-merge path Critical pathNon-critical side-paths Ripple-carry

2N2N2N2N 2P2P2P2P 4N4N4N4N 2P2P2P2P 3N3N3N3N

Inv 3N 2P 2N

2P 3N

Sum generator

Intermediate carry generator

Fast carry-merge path

SumoutSumout

Input

2N

Page 26: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

26

PD-SOI optimal redesign in PD-SOI optimal redesign in 0.180.18mm

Deeper stack redesign additional 5% speedup

64b ALU delay simulationsSpeedup over

bulk

Bulk 482ps -

Direct-port SOI 403ps 16%

SOI-optimal redesign 380ps 21%

0.18m technology, Vcc=1.5V

Page 27: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

27

Margining for reverse-body Margining for reverse-body bias in PD-SOI bias in PD-SOI

400mV rvs. bias increases rise-delay by 10%

Difficult to detect for large circuits

10% Margin required for all max-delay paths

Overall PD-SOI speedup reduces to 11%

Page 28: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

28

Reducing reverse-bias Reducing reverse-bias penalty in dynamic SOI gates penalty in dynamic SOI gates

Point solution for dynamic designs Pre-charging stack node decreases penalty to 2%

M1

B

A

OutA B

Stack nodeBody-B

Body-AP0

Max-delay margin reduced to 2%

Cost5% increase in clock energy

Cost5% increase in clock energy

Page 29: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

29

0.180.18m ALU performance after m ALU performance after marginingmargining

Maximum PD-SOI speedup reduces to 19%

64b ALU delay simulationsSpeedup over bulk

Speedup after

margining

Bulk 482ps - -

Direct-port SOI 403ps 16% 14%SOI-Optimal redesign

380ps 21% 19%

0.18m technology, Vcc=1.5V

Page 30: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

30

Scaling to 0.13Scaling to 0.13m technologiesm technologiesEqual SOI & bulk IOFF-DC

MOSFET & impact ionization data obtained from 0.13m bulk measurements

SOI parasitic BJT/diode characteristics unchanged from 0.18m fitting

Page 31: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

31

Scaling ALU designs to 0.13Scaling ALU designs to 0.13m m technologytechnology

Maximum PD-SOI speedup reduces to 16%

64b ALU delay simulations Speedup over bulk

Speedup after

margining

Bulk 351ps - -

Direct-port SOI 312ps 11% 9%

SOI-Optimal redesign

286ps 18% 16%

0.13m technology, Vcc=1.2V

Page 32: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

32

SOI vs. bulk SummarySOI vs. bulk Summary 482ps energy-efficient dynamic 64b ALU in 0.18m

bulk – 310ps adder core

Direct port to 0.18m SOI 14% speedup SOI optimal redesign 19% speedup

Floating body can get reverse-biased– Preconditioning reduces margin from 10% to 2%

Scaling to 0.13m decreases PD-SOI speedup

Maximum PD-SOI speedup in 0.13m falls to 16%16%

Page 33: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

33

Goal: Shift the E-D curveGoal: Shift the E-D curve

High-Performance Low High-Performance Low Power Datapath designPower Datapath design

Delay

Ener

gy

Page 34: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

A 4GHz 130nm Address A 4GHz 130nm Address Generation Unit with Generation Unit with 32-bit Sparse-tree 32-bit Sparse-tree

Adder CoreAdder Core

A 4GHz 130nm Address A 4GHz 130nm Address Generation Unit with Generation Unit with 32-bit Sparse-tree 32-bit Sparse-tree

Adder CoreAdder Core

IntelLabs

[S. Mathew et al, VLSI Symp. 2002],[S. Mathew et al, VLSI Symp. 2002],

[S. Mathew et al, JSSC May 2003][S. Mathew et al, JSSC May 2003]

Page 35: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

35

AGUs: performance and peak-current limitersHigh activity thermal hotspotGoal: high-performance energy-efficient design

MotivationMotivation

Execution core

120oC

Cache

Processor thermal

map

AGU

Temp(oC)

Page 36: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

36

AGU ArchitectureAGU ArchitectureAGU ArchitectureAGU Architecture

Single-cycle latency and throughput Effective Address = Base + Index*Scale +

(Segment +Displacement) 2-phase address computation

Displacement

Effective Address

3:2

Co

mp

ress

orBase

Index

Segment

3b shift

32

3232

32

32b

ad

d

32

+

clk

clk cl

k2 clk3

32

32

Page 37: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

37

AGU Operation: Phase 1AGU Operation: Phase 1AGU Operation: Phase 1AGU Operation: Phase 1

Index pre-scaled via 3-bit barrel shifter3:2 compressor renders partial address:

Carry-save formatAdder in pre-charge state

Displacement

Effective Address

3:2

Co

mp

ress

or

3:2

Co

mp

ress

orBase

Index

Segment

3232

3232

3232

3232

32b

ad

der

32

+

clk

clk cl

k2 clk3

32

32

Carry-Saveformat

Carry-Saveformat

3bshift3b

shift

Page 38: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

38

AGU Operation: Phase 2AGU Operation: Phase 2AGU Operation: Phase 2AGU Operation: Phase 2

Carry-save to binary format conversion: 2’s complement parallel 32-bit adder

Displacement

Effective AddressEffective Address

Base

Index

Segment

3b shift

32

323232

3232

32b

ad

der

32b

ad

der

3232

+

clk

clk

clk cl

k2cl

k2 clk3

clk3

32

32

3:2

Co

mp

ress

or

Page 39: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

39

Kogge-Stone AdderKogge-Stone AdderKogge-Stone AdderKogge-Stone Adder

Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Maximum interconnect spans 16b

Energy inefficientEnergy

inefficient

1235 4679 8101113 12141517 16181921 20222325 24262729 283031PG

Car

ry-m

erg

e g

ates

XOR

00

Page 40: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

40

Sparse-tree Adder ArchitectureSparse-tree Adder Architecture

Generate every 4th carry in parallelSide-path: 4-bit conditional sum generator73% fewer carry-merge gatesenergy-efficient

C27 C23 C19 C15 C11 C7 C3

293031 28 252627 24 212223 20 171819 16 131415 12 91011 8 567 4 123 0

Page 41: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

41

Non-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum GeneratorNon-critical Sum Generator

Non-critical path: ripple carry chain Reduced area, energy consumption, leakage Generate conditional sums for each bit Sparse-tree carry selects appropriate sum

Pi Pi+1Pi+2 ,Gi+2

Sumi+1Sumi+2Sumi+3Sumi+3

XOR XORXOR XOR

Pi+3,Gi+3

Sumi

Su

mi ,1

Su

mi ,0

Carry

Gi+1

2:1 2:1 2:1

11 00

2:12:1

CMCM CMCM

CMCMCMCM CMCM

CMCMCMCM CMCMCMCM

XORXOR XORXOR

Page 42: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

42

Conditional Carry for Cin=0Conditional Carry for Cin=0

Optimized First-level Carry-mergeOptimized First-level Carry-merge

Carry-merge stage reduces to inverterConditional carry_0 = Gi#

C#_0C#_0i

Pi

Cin=0

GiPi

Gi

Gi

CMCMCMCM0000

Page 43: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

43

Conditional carry for Cin=1Conditional carry for Cin=1

Optimized First-level Carry-mergeOptimized First-level Carry-merge

Pi & Gi correlatedConditional carry_1 = Pi#

Pi

Gi

Ai Bi Pi Gi C#_10 0 0 0 10 1 1 0 01 0 1 0 01 1 1 1 0

C#_1C#_1

Cin=1

Gi

Pi

Gi

Pi

C#_1Pi

CMCMCMCM1111

Page 44: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

44

Optimized Sum GeneratorOptimized Sum GeneratorOptimized Sum GeneratorOptimized Sum GeneratorPi Pi+1

Pi+2 ,Gi+2

Sumi+1Sumi+2Sumi+3Sumi+3

XOR XORXOR XOR

Pi+3,Gi+3

Sumi

Su

mi ,1

Su

mi ,0

Carry

Gi+1

CMCMCMCM CMCMCMCM

Optimized 1st-level Optimized 1st-level carry-mergecarry-merge

Optimized non-critical path: 4 stages

2:1 2:1 2:12:12:12:12:1

CMCMCMCM CMCMCMCMXORXORXORXOR XORXORXORXOR

Page 45: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

45

Adder Core Critical PathAdder Core Critical Path

Critical path: 7 gate stages same as KSSparse-tree: single-rail dynamicExploit non-criticality of sum generatorConvert to static logicSemi-dynamic design

PGPG GGGG11 GGGG77

Static sum generatorStatic sum generator

Single-rail dynamic sparse-tree pathSingle-rail dynamic sparse-tree path

AdderAdderInputsInputs

clk2clk2

SumSum3131

clk3clk3clkclk

clkclk

GGGG2727GGGG1515

CM0CM0LatchLatch CM1CM1 XORXOR

CC2727

SumSum31_031_0

SumSum31_131_1

GGGG33

Page 46: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

46

11stst-level Carry-merge: Static Latch-level Carry-merge: Static Latch

Holds state in pre-charge phasePrevents pre-charging of static stages

Pi

Gi-1

clk

Gi-1

Gi

Pi

C#i

Gi

Page 47: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

47

Domino-Static InterfaceDomino-Static InterfaceDomino-Static InterfaceDomino-Static Interface

Sum=Sum0 during pre-chargeMux output resolves during evaluation

clk=

0cl

k=0

clk=

1cl

k=1

Carry#i

Gi

Pi

GiG#i-1

2P2N

clk

G#i

P#i

Carry#i

Pi

Gi-1

Gi

Pi

GiG#i-1

Sumi

clk Sum1i

Sum0i

Sum1i

Sum0i

Sumi

Gi-1 Pi

G#i

P#i

Page 48: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

48

Sparse-tree ArchitectureSparse-tree ArchitectureSparse-tree ArchitectureSparse-tree ArchitecturePerformance impact: (20% speedup)

33-50% reduced G/P fanouts80% reduced wiring complexity30% reduction in maximum interconnect

Power impact: (56% reduction)73% fewer carry-merge gates 50% reduction in average transistor size

Page 49: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

49

Energy-delay SpaceEnergy-delay SpaceEnergy-delay SpaceEnergy-delay Space

20% speedup over Kogge-Stone 56% worst-case energy reduction

Scales with activity factor

00

2020

4040

6060

8080

100100

140140 160160 180180 200200 220220 240240 260260 280280Delay (ps)Delay (ps)

Wo

rst-

case

En

erg

y (p

J)W

ors

t-ca

se E

ner

gy

(pJ)

Dynamic Kogge-StoneDynamic Kogge-Stone

Semi-dynamic Sparse-Tree Semi-dynamic Sparse-Tree

20%20%

4GHz 4GHz DesignDesign

56%

56%

130nm CMOS, 1.2V, 110130nm CMOS, 1.2V, 110ooC SimulationC Simulation

Page 50: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

50

Semi-dynamic DesignSemi-dynamic Design

Static sum generators : low switching activity71% lower average energy at 10% activity

00

1010

2020

3030

4040

00 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5Activity factor Activity factor

Ave

rag

e E

ner

gy

(pJ)

Ave

rag

e E

ner

gy

(pJ)

Dynamic Dynamic Kogge-StoneKogge-Stone

Semi-dynamic Semi-dynamic Sparse-Tree Sparse-Tree

71%71%

Page 51: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

51

Dual-VDual-Vtt Allocation Allocation

Exploit non-criticality of sidepaths Use high-Vt devices

0% performance penalty 56% reduction in active leakage energy

Low-VLow-Vt t Dual-VDual-Vtt

DelayDelaySwitching EnergySwitching EnergyLeakage EnergyLeakage Energy

152ps152ps36pJ36pJ0.9pJ0.9pJ

152ps152ps

34pJ (-6%)34pJ (-6%)

0.4pJ (-56%)0.4pJ (-56%)

130nm CMOS, 1.2V, 110130nm CMOS, 1.2V, 110ooC SimulationC Simulation

Page 52: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

52

Scaling PerformanceScaling PerformanceScaling PerformanceScaling Performance

Average transistor size = 3.5m Reduces impact of increasing leakage

80% reduction in wiring complexity Reduces impact of wire resistance

33% delay scaling, 50% energy reduction

130nm 130nm 100nm100nm

DelayDelaySwitching EnergySwitching EnergyLeakage EnergyLeakage Energy

152ps152ps36pJ36pJ0.9pJ0.9pJ

102ps (-33%)102ps (-33%)18pJ (-50%)18pJ (-50%)

0.7pJ (-23%)0.7pJ (-23%)

Page 53: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

A 6.5GHz, 130nm A 6.5GHz, 130nm Single-ended Single-ended Dynamic ALUDynamic ALU

A 6.5GHz, 130nm A 6.5GHz, 130nm Single-ended Single-ended Dynamic ALUDynamic ALU

IntelLabs

[M. Anders et al, ISSCC 2002],[M. Anders et al, ISSCC 2002],

[S. Vangal et al, JSSC November 2002][S. Vangal et al, JSSC November 2002]

Page 54: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

54

X8

SchedulerALU 0

X8

SchedulerALU 1

AL

U 0

AL

U 1

5:1

RFFIFO

RFFIFO

FIFORF

FIFORF

5:15:1

5:1

toRF,

FIFO

toRF,

FIFO

toRF,

FIFO

88

88

888

8

sched1#

sched1

sched0#

sched0

sum1#

sum1

sum0#

sum0 32

32

32

32

32-bit ALU/Scheduler Loop32-bit ALU/Scheduler Loop

• Performance-critical execution core loop

Page 55: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

55

RFOperand

FIFOOperands

RFOperand

FIFOOperands

5:1 Mux Control5:1 Mux

31 3029 28 3 2 1 0Propagate/Generate/Partial Sum (dynamic)

Carry merge 0 (static)

Carry merge 1 (dynamic)

Carry merge 2 (static)

Carry merge 3 (dynamic)

Carry merge 4 (static)

Carry merge 5 (CSG) / Sum

84u

m lo

op

bac

k b

us

Sum Sum#

Han-Carlson ALU OrganizationHan-Carlson ALU Organization

•Single-rail dynamic 9-stage low-Vt design

Page 56: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

56

Carry

iCarry#

i

gi#

Sumi

Psumi Sum# i

Odd-bit CSGCarry merge

Sum generation

gi-1#

2

pi#

Odd-bits CSG Sum GenerationOdd-bits CSG Sum Generation

• Final carry-merge CSG(dual-rail carry output)→ pass-transistor sum XOR

Page 57: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

57

Even-bits CSG Sum GenerationEven-bits CSG Sum Generation

• Domino-compatible sum• Dual-rail sum from single-ended g inputs

Carry

iCarry#

i

gi#

Sumi

Psumi Sum #i

Even-bit CSGCarry merge

Sum generation

2

Page 58: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

58

Die Micro-photographDie Micro-photograph

• 130nm 6-metal dual-Vt CMOS

• Scheduler:

• 210μm x 210μm

• ALU:

• 84μm x 336μm

Scheduler

ALU

Page 59: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

59

Delay and Power MeasurementsDelay and Power Measurements

• 6.5GHz at 1.1V, 25ºC • Power: 120mW total, 15mW leakage• Scalable to 10GHz at 1.7V, 25ºC

0

50

100

150

200

250

300

350

400

450

0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

Supply Voltage (V)P

ower

(mW

)

0

50

100

150

200

250

300

350

400

450

Leak

age

Pow

er (m

W)

Design target

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

Supply Voltage (V)

Fm

ax (G

Hz)

25ºC25ºC

Page 60: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

60

Area 50%

Performance (Delay)

10%

Active Leakage

40%

Robustness equal

Improvements Over Dual-rail Improvements Over Dual-rail DominoDomino

• Leakage reduced by eliminating dual-rail logic

• Robustness not compromised

• CSG improves both area and performance

Page 61: High-Performance Arithmetic Challenges: From Architectures to Circuits Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar Borkar Microprocessor Research, Intel

61

SummarySummarySummarySummary4GHz AGU in 1.2V, 130nm technology4GHz AGU in 1.2V, 130nm technologySparse-tree adder architecture described 20% speedup and 56% energy reductionSemi-dynamic design:

Energy scales with switching activity Dual-Vt non-critical paths:

Low active leakage energy6.5GHz ALU and scheduler loop at 1.1V, 25ºC6.5GHz ALU and scheduler loop at 1.1V, 25ºC

–Scalable to 10GHz at 1.7V, 25ºC