53
By: Jabulani Nyathi Washington State University School of EECS April 30, 2009 Circuits and Architectures to Deliver Low Power and High Speed Systems.

By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

  • Upload
    shilah

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Circuits and Architectures to Deliver Low Power and High Speed Systems. By: Jabulani Nyathi Washington State University School of EECS April 30, 2009. Outline. CMOS Scaling Its benefits and The challenges it brings about Various Techniques for Limiting Leakage Currents Their shortfalls - PowerPoint PPT Presentation

Citation preview

Page 1: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

By: Jabulani NyathiWashington State University

School of EECSApril 30, 2009

Circuits and Architectures to Deliver Low Power and High Speed Systems.

Page 2: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

OutlineCMOS Scaling

Its benefits and The challenges it brings about

Various Techniques for Limiting Leakage Currents Their shortfalls

Bridging the speed-Power Gap The Tunable Body Biasing Scheme

Emerging Devices and TechnologiesConcluding Remarks

Page 3: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

CMOS Scaling and its BenefitsAggressive CMOS scaling has been a very

positive development allowing:Fast switching devices, thus high speed computing.Massive integration due to miniaturization

No longer do we need multiple chips to implement a microprocessor and its peripherals

In fact, we can now have multiple computing elements on a single die resulting in system on a chip.

Page 4: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

CMOS Scaling and its ChallengesCMOS scaling results in:

increased leakage currents (5X/node) and Increased dynamic power dissipation.

The interconnect does not scale as fast as the transistor thus

Highly integrated designs require elaborate clock distribution schemes.

IPs within a System on a Chip would be difficult to synchronize with a single clock source.

Page 5: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Scaling Implications

Module1

Local Interconnects

Module2

Global Interconnects

Global Interconnects

Scaled

Page 6: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Dynamic Vs Leakage Power

Page 7: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Research Motivation Desire to Bridge the Speed-Power Gap by

Exploring the feasibility of optimizing devices to operate effectively in both sub-threshold and above threshold voltages.

Emerging Technologies that are Ultra-Low power can benefit from increased speed.Wearable computers, sensor networks, implantable

medical technologyEmphasis on design for energy-efficiency

Page 8: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Existing Low Power Design Approaches

Solve energy dissipation problem from a region of operation standpoint Sub-threshold design

DTMOS: shows a 5.5 times increase in current Dynamic threshold provides energy efficiency

SBB: 4.4 times frequency increase Above threshold (Super-threshold) design

MTCMOS: high and low threshold devices VT Scheme: reduce power by 50% using ABB and

“sleep”/“active” modes Architectural

Gating Techniques: 45% of total power

Page 9: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

DTMOS/SBB Output Voltage Clamping

Traditional

SBB, DTMOS, TBB

600 mV

1.8 V

Page 10: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Proposed Approach Change approach to include all possible operating regions: Tunable

Body Biasing (TBB) Sub-threshold and super-threshold operation bridged Ultra-low energy and low speed or high energy and high speed

Utilize body biasing to improve performance of sub-threshold operation Target increased performance at sub-threshold and slightly above threshold. Save energy by eliminating idle time and process continuously with

variable power supplies (perform just in time task completion) Target applications

Mobile, battery operated (power constrained), variable processing devices Cell phones, PDAs, notebooks, wireless sensors, embedded systems, ASICs,

medical technology, etc.

Page 11: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

TBB Implementation Goals

Attain ON state current gain while minimizing OFF state leakage current increase

Highlight advantages of sub-threshold operation while allowing super-threshold operation if needed

Control bulk terminal to tunable potentials depending on VDD and desired region of operation

MOS Bulk Control Circuits Multiplexer-based approach

Two transistors per bulk control circuit Utilizes Vthn0

Page 12: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

TBB Bulk Control Circuits

Relies on passing of good/poor logic “1” and logic “0” properties of pass-transistors

Requires external control signals SubVt and SubVt_b

VDD

TBB MOS Bulk Control Signal

pMOS Bulk nMOS Bulk

VSS<VDD ≤Vthn0 VSS VDD

VDD > Vthn0 VDD – Vthn0 Vthn0

Page 13: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

TBB Bulk Control Circuit Simulation

Sub-threshold: pBulk = 0 V

Super-threshold: pBulk = VDD – Vthn0

Page 14: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Device Optimization TBB encourages varying supply voltages

How will devices be sized for optimal operation at any supply voltage?

Maintain symmetric switchingExamine inverter at varying supply voltages

Page 15: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Device Optimization (Switching Point)

VDD

IdealInverter

Threshold

Simulated Inverter

Threshold

Percent Variation

1.8 V 900 mV 900 mV 0.0%

1.0 V 500 mV 498 mV 0.4%

376.2 mV 188.1 mV 198.7 mV 5.6%

188.1 mV 94.05 mV 108.6 mV 13.4%

Page 16: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Sub-threshold Noise Margins

Noise Margins significant for proper logic levels

TBB and Traditional static CMOS inverter have comparable noise margins TBB VIH is 12.5% worse TBB VIL is 14.3% better

Page 17: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

0

50

100

150

200

250

300

TRADITIONAL SBB TBB DTMOSStatic CMOS at Vdd = Vthn0 with varying Body Biasing

AV

ER

AG

E S

WIT

CH

ING

DE

LA

Y (n

s) Transmission GateInverterTwo Input NANDTwo Input NORTwo Input XOR

Propagation Delay

Gate Traditional Delay TBB Delay % DecreaseTG 98 ns 14 ns 86Inv 125 ns 20 ns 84NAND 133 ns 18 ns 86NOR 163 ns 25 ns 85XOR 289 ns 40 ns 89

Page 18: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Review of SubVth Circuits Benefits So far, the presentation has shown:

TBB requires control of MOS bulks to span the operating regions of interest. Implementation is successful.

Study of simple logic gates showed: TBB gives a dramatic speed increase (up to 7x) Static CMOS design style is suitable for sub-threshold and super-

threshold operation Sizing of efficient devices for the TBB approach is possible

However, how will a complex system perform? Design with previous knowledge (logic style, sizing) Analyze post-layout simulations

Page 19: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Complex System-on-Chip Design Using TBB

Work addresses the challenges ofGlobal Interconnect DelaysClock distributionSynchronization of unrelated clocks andPower dissipation

Page 20: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Conclusion TBB scheme has been devised to span all regions of operation

from ultra-low power to high-speed. New kind of body biasing Forward-biasing causes exponential sub-threshold current gain

Leads to 7 times frequency increase in simple logic gates Focus on sub-threshold and slightly above threshold to utilize leakage

Bulk control circuits are effective 4% area and 8.9% power dissipation increase

Static CMOS is ideal overall design style Device sizing at either sub-threshold or super-threshold allows efficient

operation with variable supply voltages

Page 21: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Concluding Remarks Allowing tunable operation allows the designer to

choose operating point (kHz, MHz, GHz) – Energy Dissipation is affected. Other schemes do not offer this flexibility TBB can lead to significant energy savings

LFSR results show TBB gives: Maximal 5.7 times speed increase (sub-threshold) Comparable energy at super-threshold and favorable at sub-

threshold Favorable EDP at all operating regions Operate at the same speed with less energy dissipation

Idle state leakage current can be minimized by collapsing the supply voltage

Page 22: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

ROUTER CHIP

Integrating Research Into Instruction

Data Path Circuits Memory Design Sub-System

Page 23: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Incorporating Research into Instruction

A long term objective is to place some of the integrated chips on development boards such as those Digilent Inc produces.

The integrated chips become part of a system and can be used in some of our low level courses.

Most important is the use of these programmable boards to show case the research outcomes, particularly to visiting prospective students.

A sample development board:

Page 24: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Questions and Comments Welcome!

Page 25: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Multiple Clock Domain Synchronization

Computational Module

Computational Module

Computational Module

Computational Module

Computational Module

Computational Module

Synchronous Islands

Micr

o-Ne

twor

k

IsochronousCommunication

locksArbitraryC;ocksRationalCl;sEqualClock;1

QnZn

nfnf slowfast

Page 26: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Reducing Interconnect Delays Improved latency and bandwidth Global interconnects are pipelined at or near the rate of computation

Page 27: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Sources of Power Consumption

circuitshort avgswingcircuitshort

loadclkswingdddynamic

DCleakagestatic

circuitshort dynamicstatictotal

I V P

C f V V P

P P P

P P P P

Most straight forward method to reduce power consumption from any source is to reduce VDD

Controlling frequency directly manipulates dynamic power Controlling device threshold manipulates leakage current,

affecting leakage and short circuit power.

Page 28: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Distributed FIFO Control Circuitry

Page 29: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

  Traditional Body Biasing Tunable Body BiasingTunable BB %

diffVdd LocalClock2 current LocalClock2 current

Vdelay (ps)

freq (GHz) uA

delay (ps)

freq (GHz) uA freq current

1 111.2 9 3100 103.1 9.7 2988 7.8 -3.60.7 172.55 5.8 1240 177.7 5.6 1042 -3.4 -16

0.35 1354.5 0.7383 71 1438 0.6954 72.9 -5.8 -2.70.2 96700 0.0103 2.81 16640 0.0601 5.051 483 79.8

Traditional vs. Tunable Body Biasing

The synchronizer/buffer shows an increase in performance at sub-threshold voltages when using tunable body biasing

Page 30: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Tunable Body BiasingCurrent (uA) Power (uW)

Vdd (V)

Max Freq (GHz) Peak Avg Idle Peak Avg Idle

Traditional Body

Biasing

1 4 5597 2382 8.696 5597 2382 8.696

0.7 2 2222 803.4 4.873 1555.4 562.38 3.411

0.35 0.125 131.1 35.58 1.468 45.885 12.453 0.514

0.2 0.01 7.452 2.895 1.349 1.49 0.579 0.27

Tunable Body

Biasing

1 4 5140 2460 9.54 5140 2460 9.54

0.7 2 2050 833 4.423 1435 583.1 3.096

0.35 0.167 132 39.8 1.589 46.2 13.93 0.556

0.2 0.015 9.468 4.03 1.239 1.894 0.806 0.248

Page 31: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Pursuit of Low Power OperationIt is likely that not all IP blocks in a SoC need

to operate at high speedPower dissipation for those IP blocks could be

reduced by operating at a lower voltageTBB offers the possibility to dynamically

operate at either sub-threshold or super-threshold voltages

Page 32: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Variable Voltage SoC

Consider a SoC with 50 IP blocks, each requiring communication at a rate of 10 MHz

Each IP could operate at sub-threshold levels

The channel could operate at super-threshold voltages while the IP blocks are in sub-threshold

Computational Module

Computational Module

Computational Module

Computational Module

Computational Module

Computational Module

Synchronous Islands

Micr

o-Ne

twor

k

IsochronousCommunication

Vdd1

Vdd2

Vdd3

Vdd4Vdd5

Page 33: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Idle vs Operating PowerIdle Operating

Vdd (V)

Current (uA) Power (uW)

Current (uA)

Power (uW)

1 16.9 16.9 2988 2988

0.7 5.3 3.71 1042 729.4

0.35 1.5 0.525 72.9 25.52

0.2 0.925 0.185 5.051 1.01

During idle periods, it is advantageous to reduce leakage current by Reducing the power supply voltage or Increasing the threshold voltage (e.g. bulk voltage manipulation)

Page 34: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Speed at Varying VDD

0

1

10

100

1000

10000

100000

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Supply Voltage ( V )

Min

imum

Clo

ck P

erio

d ( n

s )

TBB Delay

Traditional Delay

TBB 5.7x FasterAt 376.2 mV

TBB 20% FasterAt 1.8 V

Page 35: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Energy-delay Product

EDP of TBB outperforms Traditional at ALL operating regions, significantly in super-threshold

Page 36: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Regions of Operation

3.9 MHz with0.6 fJ/cycle

222.2 MHz with103 fJ/cycle

1.1 GHz with3.85 nJ/cycle

Page 37: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Contributions of this workProposed scheme alleviates the communication

bottleneck and offers a way to synchronize SoC multiple clocks

Perform data transfers up to 10 GHz Proposed scheme maintains high performance under the

influence of any clock skew 6.5 GHz for any process corner and any skew

Low power FIFO scheme with a small impact on area when used in SoCs with many modules

Page 38: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Contributions of this workProcess corners have a minor impact on performance, resulting

in a 10% reduction of speedThe optimal voltage for minimum energy consumption per

transaction is at 2Vth Introduction of TBB to address leakage and dynamic power

dissipation 500% increase in performance at sub-threshold voltages with a

modest 80% increase in power 5-10% less power dissipation than traditional body biasing

Page 39: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Summary of Proposed FIFO Scheme Linear FIFO scheme that addresses

Signal propagation across communication channel Sustained throughput over long distances

Successful Synchronization Synchronizes equal, rational & arbitrary clocks 6.5 GHz sustained performance after process corner analysis using 3 stages.

Compared to CN scheme Fewer devices per stage, fewer stages needed 25% higher performance, 12% lower power

Operates at both super- and sub-threshold voltages Lower instantaneous power demands from local clocks (less di/dt) Optimal energy per transaction at 0.7V in a 65nm process Sub-threshold reduces power by 3 orders of magnitude Tunable Body Biasing provides 50% increased performance in sub-threshold while

maintaining super-threshold operation

Page 40: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

TBB Scalability

Technology 180 nm 90 nm

Body Biasing and Operating Region

Total Average Power Dissipation

Static Power Contribution

[%]

Total Average Power Dissipation

Static Power Contribution

[%]

Traditional in Sub-threshold 193 pW 0.1% 13.1 nW 1.8%

Traditional inSuper-threshold 39.6 μW Negligible 22.1 μW negligible

TBB in Sub-threshold 1430 pW 25.2% 20.4 nW 6.1%

TBB in Super-threshold 39.4 μW 0.000034% 22.1 μW 0.0025%

At 180 nm, TBB sub-threshold static power % is largeAt 90 nm, the % difference is much less

Total TBB sub-threshold power is large

Total TBB sub-threshold power isn’t so large

Page 41: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

LFSR Energy vs. FrequencyTBB and Traditional LFSR Energy Dissipation vs Frequency

0

25

50

75

100

125

150

175

200

225

0 100 200 300 400 500 600 700 800 900 1000 1100

Frequency [MHz]

Ener

gy D

issi

patio

n [fJ

]

Traditional EnergyTBB Energy

Page 42: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

TBB Implementation Cont.

Page 43: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

TBB Implementation Cont.

Page 44: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Logic Gate Analysis (Power)Power Dissipation vs Supply Voltage

0.0001

0.0010

0.0100

0.1000

1.0000

10.0000

100.0000

1000.0000

0.25 0.3762 0.75 1.8Supply Voltage

Pow

er D

issi

patio

n [ n

W ]

Traditional CMOS Power

TBB CMOS Power

Page 45: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Inverter Power DissipationVDD

Power Dissipation[fW]

•Average Power•[nW]

Maximum Frequency[MHz]

Period[ns]

0.3262 8.27 3.5 0.416 2400.0

0.4262 11.41 30.0 2.6 380.0

0.5643 15.64 651.6 41.7 24.0

1.8 82.30 68.60 833.3 1.2

VDDPower Dissipation

[fW]

•Average Power•[nW]

Maximum Frequency[MHz]

Period[ns]

0.3262 8.52 22.4 2.6 380.0

0.4262 13.00 259.8 20. 50.0

0.5643 15.13 2102.0 138.9 7.2

1.8 81.47 81.5 1000. 1.0

Page 46: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Logic Gate Analysis (Energy)Energy Dissipation vs Supply Voltage

0

20

40

60

80

100

120

140

160

180

0.25 0.3762 0.75 1.8

Supply Voltage [V]

Ener

gy D

issi

patio

n [ f

J ]

Traditional CMOS Energy

TBB CMOS Energy

Page 47: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Logic Gate Analysis (EDP)EDP vs Power Supply

-5000

0

5000

10000

15000

20000

25000

30000

0.25 0.3762 0.75 1.8Supply Voltage [V]

EDP

[ fJ*

ns ]

Traditional CMOS EDP

TBB CMOS EDP

Page 48: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Logic Gate Analysis (Fan-in)

0

200

400

600

800

1000

1200

1400

One Two Three FourNumber of Inputs

Pro

paga

tion

Del

ay [

ns ]

Traditional NAND TBB NANDTraditional NORTBB NOR

Page 49: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Logic Gate Analysis (Logic Styles)

0

10

20

30

40

50

60

70

0.5*Vthn 0.75*Vthn Vthn - 50 mV Vthn Vthn + 50 mV 1.5*Vthn

Supply Voltage [V]

Ener

gy D

issi

pate

d [ f

J ]

Traditional Pseudo-nMOS Energy

TBB Pseudo-nMOS Energy TBB

Page 50: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

LFSR Power Dissipation

-100

0

100

200

300

400

500

600

700

800

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Supply Voltage ( V )

Ave

rage

Pow

er D

issi

patio

n ( u

W )

TBB Power

Traditional Power

Page 51: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Device Optimization (Optimal Region)

0

500

1000

1500

2000

2500

3000

3500

4000

0.3262 0.3762 0.5643 0.7524 1.1286 1.5048 1.8Supply Voltage ( V )

Clo

ck P

erio

d ( n

s )

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Ener

gy D

issi

patio

n ( f

J )

TBB Delay

TBB Energy Dissipation

Page 52: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Regions of Operation

Design

Super-threshold(1.8 V)

Sub-threshold(250 mV)

Optimal(750 mV)

Delay (ns) Energy (fJ) Delay (ns) Energy (fJ) Delay (ns) Energy (fJ)

TraditionalLFSR 0.7 437.6 20000 105 7 74.1

TBBLFSR 0.6 437 4500 22.8 4.5 73.6

GHz kHz MHz

Page 53: By: Jabulani Nyathi Washington State University School of EECS April 30, 2009

Logic Gate Results Results Highlights

TBB, SBB, and DTMOS increase speed up to 7 times in sub-threshold

Static CMOS has best overall logic style performance Pseudo-nMOS, Domino, and pass-transistor still are valuable in

niche situations TBB and Traditional Noise Margins are comparable