50
SOCC, Sept. 25, 2006 Curing the Ailments of Nanometer CMOS through Self-Healing and Resiliency Curing the Ailments of Nanometer CMOS through Self-Healing and Resiliency Jan M. Rabaey Director Gigascale Silicon Research Center Co-Director Berkeley Wireless Research Center University of California at Berkeley

Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

SOCC, Sept. 25, 2006

Curing the Ailments of Nanometer CMOS through Self-Healing and Resiliency

Curing the Ailments of Nanometer CMOS through Self-Healing and Resiliency

Jan M. RabaeyDirector Gigascale Silicon Research Center

Co-Director Berkeley Wireless Research Center

University of California at Berkeley

Page 2: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

2

SOCC, Sept. 2006SOCC, Sept. 2006

The Silicon Age Still on a Roll, But …

Medium High Very HighVariability

Energy scaling will slow down>0.5>0.5>0.35Energy/Logic Op scaling

0.5 to 1 layer per generation8-97-86-7Metal Layers

11111111RC Delay

Reduce slowly towards 2-2.5<3~3ILD (K)

Low Probability High ProbabilityAlternate, 3G etc

128

11

2016

High Probability Low ProbabilityBulk Planar CMOS

Delay scaling will slow down>0.7~0.70.7Delay = CV/I scaling

256643216842Integration Capacity (BT)

8162232456590Technology Node (nm)

2018201420122010200820062004High Volume Manufacturing

Some Major Hurdles on The Way!

2003 ITRS Roadmap2003 ITRS Roadmap2003 ITRS Roadmap

Page 3: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

3

SOCC, Sept. 2006SOCC, Sept. 2006

The Challenges of the Next Decade(s)

•The Physics and Manufacturing Challenges

– A whole slew of static and dynamic variations and error mechanisms

•The Design Introduction Challenge

– Complexity, risk, time, cost

•The n-furcation of the Market

Page 4: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

4

SOCC, Sept. 2006SOCC, Sept. 2006

Variations Becoming Pronounced

0.01

0.1

1

1980 1990 2000 2010 2020

micron

10

100

1000

nm

193nm193nm248nm248nm

365nm365nmLithographyLithographyWavelengthWavelength

65nm65nm90nm90nm

130nm130nm

GenerationGeneration

GapGap

45nm45nm

32nm32nm13nm 13nm EUVEUV

180nm180nm

Design becoming “statistical”• makes verification substantially harder• challenging synchronization strategies• “error-free” design untenable

Courtesy: Shekhar Borkar, Intel

XY 40

50

60

70

80

90

100

110

Tem

per

atu

re (

C)

130nm

30%

5X

0.90.9

1.01.0

1.11.1

1.21.2

1.31.3

1.41.4

11 22 33 44 55Normalized Leakage (Isb)Normalized Leakage (Isb)

No

rmal

ized

Fre

qu

ency

No

rmal

ized

Fre

qu

ency

Page 5: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

5

SOCC, Sept. 2006SOCC, Sept. 2006

Just One Example of Where We are Going

VT Variation – Long/WideVT Variation – Long/Wide

VT Variation – Short/NarrowVT Variation – Short/Narrow

Courtesy: Colin McAndrew, FreescaleCourtesy: Colin McAndrew, Freescale

Page 6: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

6

SOCC, Sept. 2006SOCC, Sept. 2006

Variations Come in Many Different Flavors

Also, local versus global, correlated versus random, temperal versus spatial Also, local versus global, correlated versus random, temperal versus spatial

Different sources lead to different solutions

Different sources lead to different solutions

Page 7: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

7

SOCC, Sept. 2006SOCC, Sept. 2006

Variations Become Indistinguishable from Failure

Source: K. Nowka, IBMSource: K. Nowka, IBM

Page 8: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

8

SOCC, Sept. 2006SOCC, Sept. 2006

Failures Becoming More ProminentElectromigration

(Weak-defective interconnects)

Manufacturing DefectsThat Escape Testing

(Inefficient Burn-in Testing)

Time-DependentDielectric Breakdown (TDDB)

(Ultra-thin gate oxides)

Transient Faults due toCosmic Rays & Alpha Particles

(Increase exponentially withnumber of devices on chip)

Tra

nsi

sto

r R

elia

bili

ty

Transistor Lifetime (years)

Now

Future

Increased Heating

HigherTransistorLeakage

ThermalRunaway

HigherPower

Dissipation

Courtesy: T. AustinCourtesy: T. Austin+ just more complexity+ just more complexity

Page 9: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

9

SOCC, Sept. 2006SOCC, Sept. 2006

Failures Becoming More Prominent

Erratic bit failures in memories caused by temporary trapped charges Erratic bit failures in memories caused by temporary trapped charges

Page 10: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

10

SOCC, Sept. 2006SOCC, Sept. 2006

Dealing with variations and faults

20052005 20102010 The far beyondThe far beyondBeyondBeyond

Co

mp

lexi

tyC

om

ple

xity

20002000

Self-HealingSelf-Healing

EmbracingRandomnessEmbracing

Randomness

Error-resiliencyError-resiliency

Fully structured and regular fabrics

Page 11: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

11

SOCC, Sept. 2006SOCC, Sept. 2006

Curing the Nanometer Ailments

• Regularity and Structure

• Self-Healing

• Error-Resiliency

• Embracing Randomness

Absolutely required for manufacturabilityDriven by photo-lithography and eventually self-assembly constraints

Also for variability, reliability, and time-to-market

Regular implementation fabricsRegular implementation fabrics

Page 12: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

12

SOCC, Sept. 2006SOCC, Sept. 2006

Regular Fabrics – A Plethora of Choices

FPGAFPGA

VPGACMU

VPGACMU

River PLABerkeley

River PLABerkeley

Structured ASIC (e.g. LSI RapidChip)Structured ASIC (e.g. LSI RapidChip)

Trade-off between area, performance, power and

time-to-market (factors 5 to 10)

TradeTrade--off between area, off between area, performance, power and performance, power and

timetime--toto--market market (factors 5 to 10)(factors 5 to 10)

Page 13: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

13

SOCC, Sept. 2006SOCC, Sept. 2006

Regular Fabrics - Example

CMU Regular Logic BricksStandard-cell library with fewer (~10),

coarser, configurable (w/ vias), micro-regular brick layouts…

…that exhibit macro-regularitywhen assembled at chip-level

2-D FFT plotsof poly-Si

patterns

ASIC “spatial” regularity2-D FFT plots

of poly-Si patterns

Brick “spatial” regularity

[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]

Page 14: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

14

SOCC, Sept. 2006SOCC, Sept. 2006

CMU Regular Logic Bricks

[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]

Page 15: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

15

SOCC, Sept. 2006SOCC, Sept. 2006

Curing the Nanometer Ailments

• Regularity and Structure

• Self-Healing

• Error-Resiliency

• Embracing Randomness

Self-Healing Architectures• On chip-test and diagnostics used to

correct for variations and stress• Static and dynamic

Page 16: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

16

SOCC, Sept. 2006SOCC, Sept. 2006

Self-Healing

• Introduce sensors that monitor key aspects of system

– Manufacturing and environmental conditions

Process variations, temperature, voltage, activity, etc

– Key properties that accelerate failure mechanisms

• Employ system-level intelligent control to reduce stress

– Temperature control via resource assignment

– Active management of voltage-reliability trade-offs

• Utilize tuning and healing to alleviate reliability threats

– NBTI reversal

– In-field clock tuning

Courtesy: T. AustinCourtesy: T. Austin

Page 17: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

17

SOCC, Sept. 2006SOCC, Sept. 2006

Test Moving On-Line

• On-chip resources used to minimize test cost • Also available for dynamic re-evaluation and adaptation

On-chip noise samplersOn-chip noise samplers

BusInterface Master Wrapper

Low-CostTester

On-ChipMemory

Diag. test program

Responsemap

VCI

On-chip Bus

00001100000000000000000000000000000000100000000000100110000000001100010000000000111111111111111111111111111111110000000000000000

Logic failure map

CPU

On-chip leakage sensorOn-chip leakage sensor

90 nm Itanium90 nm Itanium

Page 18: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

18

SOCC, Sept. 2006SOCC, Sept. 2006

Adaptive Biasing Using On-Line Test

5

10

15

20

25

30

35

40

45

50

1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Path Delay (ps)

Esw

itch

ing

(fJ) Adaptive Tuning

Worst Case, w/o Vth tuningNominal, w/ Vth tuning

Energy-performance trade-off

ModuleTest

Module

Vbb

Test inputsand responses

Tclock

Vdd

Dynamically adjust supply and threshold design parameters to center the design in the presence of process variations!

Courtesy: K. Cao, Berkeley

10xEasier Again in Regular Fabrics

Page 19: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

19

SOCC, Sept. 2006SOCC, Sept. 2006

Adaptive (Body) Biasing Impact

Courtesy: P. Gelsinger and S. Borkar, Intel (DAC04)

4.5 mm

5.3

mm

Multiplesubsites

4.5 mm

5.3

mm

Multiplesubsites

4.5 mm

5.3

mm

Multiplesubsites

4.5 mm

5.3

mm

Multiplesubsites

Page 20: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

20

SOCC, Sept. 2006SOCC, Sept. 2006

Dynamic Resource Allocation

In the MultiIn the Multi--Processor SpaceProcessor SpaceCompiler combines load Compiler combines load assignment with DVSassignment with DVS

mdlmdl group at PSUgroup at PSU

405060708090

100

2 4 8 16 32

Number of Processors

Nor

mal

ized

Ene

rgy

3D DFE LU SPLAT MGRID WAVE5

More savings with more processors!More savings with more processors!

In the Interconnect SpaceIn the Interconnect SpaceUse routing throttling to Use routing throttling to perform thermal managementperform thermal management

ThermalHerdThermalHerd (L.S. Peh, Princeton)(L.S. Peh, Princeton)

Page 21: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

21

SOCC, Sept. 2006SOCC, Sept. 2006

Rejuvenation

Source: D. Blaauw, UMichSource: D. Blaauw, UMich

Negative Bias Temperature InstabilityNegative Bias Temperature Instability

Page 22: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

22

SOCC, Sept. 2006SOCC, Sept. 2006

Curing the Nanometer Ailments

• Regularity and Structure

• Self-Healing

• Error-Resiliency

• Embracing Randomness

Redundancy GaloreThe only way to provide true error-resiliency!

With billions of transistors, overhead factors of 2 to 3 are reasonable if leading to 100% yield, supreme performance, or new applications.

Page 23: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

23

SOCC, Sept. 2006SOCC, Sept. 2006

Error-Resilient Systems

Incorporate facilities to push through system faults

• Error detection technologies

– Systems checkers, online testing, continuous functional verification

• Fault diagnosis

– Fine-grained testing, online testing

• System state recovery

– Microarchitectural checkpointing, algorithmic tolerance

• Physical repair

– Sparing, TMR

Courtesy: T. AustinCourtesy: T. Austin

Page 24: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

24

SOCC, Sept. 2006SOCC, Sept. 2006

A Gradual Introduction Process

A “pseudo-synchronous”approach to address process variations and power minimization with minimal overhead by combining circuit and architectural techniques

Courtesy: T. Austin, D. Blaauw, MichiganCourtesy: T. Austin, D. Blaauw, Michigan

Example: Aggressive Deployment using “Razor”Example: Aggressive Deployment using “Razor”

recover

IF

Raz

or F

F

ID

Raz

or F

F

EX

Raz

or F

F

MEM(read-only)

WB(reg/mem)

errorbubble

recover recover

Raz

or F

F

Stab

ilizer

FF

PC

recover

flushID

bubble

errorbubble

flushID

errorbubble

flushID

FlushControl

flushID

error

recover

IF

Raz

or F

FR

azor

FF

ID

Raz

or F

FR

azor

FF

EX

Raz

or F

FR

azor

FF

MEM(read-only)

WB(reg/mem)

errorbubble

recover recover

Raz

or F

FR

azor

FF

Stab

ilizer

FF

Stab

ilizer

FF

PCPC

recover

flushID

bubble

errorbubble

flushID

errorbubble

flushID

FlushControl

flushID

error

“razored pipeline”“razored pipeline”

Shadow Latch

Error_L

Errorcomparator

clk_del

FF

clk

QD

Processor

Total

Optimal Voltage

RecovEnergy

Supply Voltage

Ene

rgy

Processor

Total

Optimal Voltage

RecovEnergy

Supply Voltage

Ene

rgy

Page 25: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

25

SOCC, Sept. 2006SOCC, Sept. 2006

The Memory Data-Retention Voltage (DRV)

DRVV when , DD

inverterRight 2

1

inverterLeft 2

1 =∂∂=

∂∂

V

V

V

V

VDD

V1

M4

M3

M6M5

M2

M1

Leakagecurrent

V2

Leakagecurrent

VDDVDD

0 0

0 0.1 0.2 0.3 0.40

0.1

0.2

0.3

0.4

V1 (V)

2VTC1VTC2

VDD=0.18V

VDD=0.4V

VTC of SRAM cell inverters

V2

(V)

When Vdd scales down to DRV, the Voltage Transfer Curves (VTC) of the internal inverters degrade to such a level that Static Noise Margin (SNM) of the SRAM cell reduces to zero.

DRV Condition:

Source: Huifang Qin, ISQED 2004

Example 2: Minimizing standby leakage in SRAMs

Page 26: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

26

SOCC, Sept. 2006SOCC, Sept. 2006

The Impact of Process Variations

DRV Spatial Distribution (256*128 Cells)

130 nm CMOS

100 200 300 4000

1000

2000

3000

4000

5000

6000

DRV (mV)

His

togra

m o

f 32K

SR

AM

cel

ls

Page 27: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

27

SOCC, Sept. 2006SOCC, Sept. 2006

Supply based tradeoff

SRAMError

ControlCode

Data int = 0

Data outt = Tst

Goal:Minimize power/bit

vS

Page 28: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

28

SOCC, Sept. 2006SOCC, Sept. 2006

Power tradeoff with ECC

ECC saves standby powerHamming [31, 26, 3] achieves 33% power

saving

Reed-Muller [256, 219, 8] achieves 35% power saving

At the expense of time and area overheadAt the expense of time and area overhead

Minimum standby time to achieve power savingsMinimum standby time to achieve power savings

Page 29: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

29

SOCC, Sept. 2006SOCC, Sept. 2006

1.1mm

1.1mm

Original mem1024x26

Customized 1024x31

enc

dec

• Error tolerant SRAM optimized for ultra-low voltage standby

• Selected implementation Hamming [31, 26, 3]

• 50% cell design overhead• 19% parity overhead

• Tapeout: May 2006

Prototype Design

Page 30: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

30

SOCC, Sept. 2006SOCC, Sept. 2006

“Aggressive” Deployment At the Algorithm Level

][nx][nyaMain Block

Estimator

][ˆ ny| | >Th

][nye

Energy savings

Voltage

Pow

er

Pmain

PTOT

PEC

1.0

1.0

Courtesy: N. Shanbhag, IllinoisCourtesy: N. Shanbhag, Illinois

Voltage overscale Main Block.

Correct errors using Estimator.

Power savings ≥ 3X!

Voltage overscale Main Block.

Correct errors using Estimator.

Power savings ≥ 3X!

Page 31: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

31

SOCC, Sept. 2006SOCC, Sept. 2006

Leveraging resiliency to increase value

error-free with errors error-corrected

Low power motion estimation architecture using Algorithmic

Noise Tolerance (Shanbhag, UIUC)

Low power motion estimation architecture using Algorithmic

Noise Tolerance (Shanbhag, UIUC)

Up to 71% energy reduction demonstratedUp to 71% energy reduction demonstrated

Page 32: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

32

SOCC, Sept. 2006SOCC, Sept. 2006

• Core function validated by checker

• Checker relaxes burden of correctness on core processor

• Core does the heavy lifting, removes hazards that could slow the simple checker

speculativeinstructions

in-orderwith PC, inst,inputs, addr

IF ID REN REG

EX/MEM

SCHEDULER CHK CT

Performance Correctness

Core Checker

Courtesy: Todd Austin, Univ. of Michigan

205 mm2

Alpha 21264REMORAChecker

12 mm2

Self-checking processor

Moving the Verification on the Chip

Page 33: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

33

SOCC, Sept. 2006SOCC, Sept. 2006

“On-Line X”(X = Verification, Test, Tuning, Reliability, Resource,

Power and Leakage Management)

From Design time to Run Time Yield Improvement!

“Turning lemons into lemonade”

T. Austin

“Turning lemons into lemonade”

T. Austin

Page 34: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

34

SOCC, Sept. 2006SOCC, Sept. 2006

Coordinated Forward Error RecoveryCoordinated Forward Error Recovery

Runtime Validation of Multithreaded Processors

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

FFT LU CHOLESKY BARNES FMM WATER-NSQUARED

WATER-SPATIAL

Runtime Validation Configuration Fault Rate = 1/1K Fault Rate = 1/1M

SM

T P

roce

sso

rReg. File Memory

Runtime Monitorin

g Hardware Context Status Register

Hardware Synchronization Unit

DIVA checker processor

DIVA checker processor

Per-thread retired instructions

dis

pat

ch

Correctness Correctness Properties of Properties of Multithreaded Multithreaded

ExecutionExecution

InterInter--thread thread CommunicationCommunication

InterInter--thread thread SynchronizationSynchronization

IntraIntra--thread thread Data FlowData Flow

IntraIntra--thread thread Control FlowControl Flow

Courtesy: S. Malik, PrincetonCourtesy: S. Malik, Princeton

Page 35: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

35

SOCC, Sept. 2006SOCC, Sept. 2006

BulletProof Silicon – The Next Generation

Goal: Single-defect tolerance for 5% area overhead

Key ideas: • No expensive computation checking• Protect computation and test Hw• Repair by disabling redundant parts

Approach:1. Execute and protect state2. Test concurrently when Hw idle3. If tests fails → roll back state

→ disable component → restart

IF ID EX

MEM W

B

checkers + BIST

µprocessor pipeline

CIRCUIT ENVELOPE – logic-level testing and reconfiguration

ARCHITECTURAL ENVELOPE – Check-pointing and epoch restore

spec

ulat

ive

stat

e

non-

spec

ulat

ive

stat

e

epochs boundary

epochs boundary

Rec

onfig

urat

ion

Courtesy: Austin,

Bertacco, U. Mich

Courtesy: Austin,

Bertacco, U. Mich

Page 36: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

36

SOCC, Sept. 2006SOCC, Sept. 2006

• Exploit the properties of the CMP switch design to provide end-to-end error detection and recovery

– Enhance switch output channels

with CRC checkers

– Split flits into two parts and route

them independently using

different resources

– Add a Recovery Pointer

to each input buffer

– On Error Detection:

- All CRC checkers drop

outgoing packets

- Switch pipeline is flushed

- Head pointers are set to

recovery pointers

- Restart execution

BulletProof Router

CRC Checker

InterconnectSwitch

CRC Checker

CRC Checker

CRC Checker

RecoveryLogic

CRC Checker

RoutedFlit

RoutedFlit

RoutedFlit

RoutedFlit

RoutedFlit

Error Detection Signal

Header

Routing Logic

Input Buffers

Routing LogicVC State

CRC Checker

Buffer Checker

Switch ArbiterSwitch Arbiter

Cross-barCRC

Checker

RecoveryLogic

Switch Recovery

Error

Tail Flit

Head/Tail

Cross-bar Controller

System Diagnosis

System Diagnosis

CRC

abcde abcde

InputBuffers

Tail Head RecoveryHead

a: Correctly routed flitb, c: In the switch pipelined: Next flit to be routede: Last flit buffered

e dabcde abcde

InputBuffers

Tail Head RecoveryHead

a: Correctly routed flitb, c: In the switch pipelined: Next flit to be routede: Last flit buffered

e d

Page 37: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

37

SOCC, Sept. 2006SOCC, Sept. 2006

Towards malleable, resilient architectures

The Quest: Scaleable (hard and soft) architectures that provide flexible redundancy to accommodate systematic and random, static and dynamic errors while avoiding brittleness!

Page 38: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

38

SOCC, Sept. 2006SOCC, Sept. 2006

Curing the Nanometer Ailments

• Regularity and Structure

• Self-Healing

• Error-Resiliency

• Embracing Randomness

Maintaining a purely deterministic Boolean abstraction ultimately becomes untenable! Maintaining our abstractions == Slowly abandon them !!

Page 39: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

39

SOCC, Sept. 2006SOCC, Sept. 2006

The Search for (New) Scaleable and Stackable Abstractions

An Interesting Case Study:The “Neural Network” MOCProperties:Properties:• Works well on noisy signals• Uses “soft” decisions • Operates in the presence of failures of components and interconnections

Challenge: Limited scopeWorks mostly for classification problems

Artificial neuronArtificial neuron

Allow devices to make errorsand use models-of-computation that tolerate them

(signal processing, communication, coding, information theory)

Page 40: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

40

SOCC, Sept. 2006SOCC, Sept. 2006

Exploring the Yellow Brick Road

• 10-15% of terrestrial animal biomass

• 109 Neurons/”node”

• Since 105 years ago

Humans

• 10-15% of terrestrial animal biomass

• 105 Neurons/”node”

• Since 108 years ago

Ants

Easier to make ants than humans“Small, simple, swarm”

CourtesyD. Petrovic, UCB

Page 41: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

41

SOCC, Sept. 2006SOCC, Sept. 2006

Inspired by the Sensor Network Paradigm

Artificial Skin

Communication Backplanes Real-time Health Monitoring

Smart Surfaces

Page 42: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

42

SOCC, Sept. 2006SOCC, Sept. 2006

Example: Collaborative Networks

• Large number of states/nodes

• Bi-directional, non-linear, non-deterministic links

• Local coupling with globally emergent behavior

• Inherently redundant and resilient to failure

• Large number of states/nodes

• Bi-directional, non-linear, non-deterministic links

• Local coupling with globally emergent behavior

• Inherently redundant and resilient to failure

Sensor Network-on-a-chip

Source: N. Shangbah, D. Jones

Page 43: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

43

SOCC, Sept. 2006SOCC, Sept. 2006

SN-on-a-chip – A simple example

Estimators need to be independentfor this scheme to be effective

Estimators need to be independentfor this scheme to be effective

A simple study:

2 different adders with voltage over-scaling

A simple study:

2 different adders with voltage over-scaling

Source: N. Shanbhag, UIUCSource: N. Shanbhag, UIUC

Page 44: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

44

SOCC, Sept. 2006SOCC, Sept. 2006

Distributed Collaborative Systems on a Chip

Example: A configurable radio architecture based on collaborative autonomous entities

Source: J. Roychowdhury, J. Rabaey

Array of locally-coupled cheaplow-power oscillator-based units• Known to exhibit complex, spontaneous pattern formation • Operation mode selected through choice of coupling factors and operational nodes

Emerging patternas a function of coupling factor

Page 45: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

45

SOCC, Sept. 2006SOCC, Sept. 2006

The Mechanical Radio

The Ultimate ULP Tunable Wireless Transceiver?

Support BeamsWine-Glass

Disk

Anchor

InputElectrode

Coupling Beam

OutputElectrode

R = 32 μm

Source: C. Nguyen, UC Michigan

9 wine-glass disc oscillator-based GSMcompliant oscillator

Page 46: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

46

SOCC, Sept. 2006SOCC, Sept. 2006

Transitioning to the Post-Silicon Age

Implementation platforms that work under very low SNR, are non-deterministic, unpredictable and unreliable…

Molecular

Organic

NanoOptics

Nanotube

Page 47: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

47

SOCC, Sept. 2006SOCC, Sept. 2006

Some Concluding Remarks

Formidable challenges over the next decades to dramatically alter design paradigms

Variability and reliability to lead to novel micro-architectures and computational models

Regularity and redundancy central tenets

The opportunities:

Use the abundance of transistors to move the burden from pre- or post-manufacturing evaluation to on-line activities

Gradual incorporation of error-resilient computational models

Formidable challenges over the next decades to dramatically alter design paradigms

Variability and reliability to lead to novel micro-architectures and computational models

Regularity and redundancy central tenets

The opportunities:

Use the abundance of transistors to move the burden from pre- or post-manufacturing evaluation to on-line activities

Gradual incorporation of error-resilient computational models

Page 48: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

48

SOCC, Sept. 2006SOCC, Sept. 2006

The GSRC System-Design Roadmap

Concurrent

Resilient

Alternative

Now 2020’s

Core

Co

mp

lexityC

om

plexity

Co

mp

lexity

GSRC: The best answer to formidable challenges is a critical mass responseGSRC: The best answer to formidable GSRC: The best answer to formidable challenges is a critical mass responsechallenges is a critical mass response

Page 49: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

49

SOCC, Sept. 2006SOCC, Sept. 2006

The GSRC Agenda

Concurrent Systems

Resilient Systems

AlternativeCom

putationalSystem

s

System Design

Core Framew

ork

Design Driver

W.M. Hwu T. Austin N. Shanbhag ASV

J. Wawrzynek

J.Rabaey

S. Malik K. Lutz

Structured along the line of big challenges rather than technologies

Provokes multi-disciplinary out-of-the-box thinking

41 Faculty17 Institutions

41 Faculty41 Faculty17 Institutions17 Institutions

Page 50: Curing the Ailments of Nanometer CMOS through Self-Healing …bwrcs.eecs.berkeley.edu/faculty/jan/JansWeb... · 2006. 10. 9. · 2 SOCC, Sept. 2006 The Silicon Age Still on a Roll,

50

SOCC, Sept. 2006SOCC, Sept. 2006

Thank you!“Creativity is the ability to introduce order into the randomness of nature”

― Eric Hoffer

The contributions of all the GSRC faculty to this presentation are greatly appreciated, so is the funding by the MARCO member companies and the US Government.