42
1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group Dept. of EE & Dept. of CS Stanford University Acknowledgment: Students & Collaborators

1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Embed Size (px)

Citation preview

Page 1: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

1

Robust System Design

to Overcome CMOS Reliability Challenges

Subhasish Mitra

Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li

Robust Systems Group

Dept. of EE & Dept. of CS

Stanford University

Acknowledgment: Students & Collaborators

Page 2: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

2

Robust System Challenges

Technology reliability limits – today’s focus

Soft errors, early-life failures, aging, variability, …

System complexity

Design bugs, defects

Malfunctions can be disastrous

Health, transport, finance, …

“It’s ridiculous. I ’ve got a $300,000 server that

doesn’t work. The thing should be bullet -proof.”

“It’s ridiculous. I ’ve got a $300,000 server that

doesn’t work. The thing should be bullet -proof.”

Page 3: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Robust System Design

Perform correctly

Despite complexity & disturbances

Thorough test & validation

Tolerate imperfect hardware

Beyond silicon-CMOS: imperfection-immune logic

3

Page 4: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

“Low-Cost” Error Detection Most Important

Concurrent error detection (CED) expensive

Crashes vs. silent errors

Belief: Logic parity inexpensive

Reality: Can be expensive

Logic sharing, complex routing

Belief: Software CED inexpensive

Reality: Only for some apps (matrix, FFT)

4

Page 5: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Design Principles to Achieve “Low Cost”

Discover

Failure mode signatures

Utilize

Application characteristics

Globally Optimize

Software orchestration

Reconfigurable resilience

Spend some area, minimize power costs

5

Page 6: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Low-Cost Resilience

WearoutEarly-life failures (ELF)

Lifetime Time

Failure rate

Burn-in difficult

Iddq ineffective

Transistor aging Guardbands expensive

Soft Error Resilience

BISER + LEAP: Errors reduced: 2,000X

Global optimization software-orchestrated

Circuit Failure Prediction: On-line Self-Test & Diagnostics

New ELF signature: Delay shifts over time

6

Page 7: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Outline Introduction

Soft error resilience

Circuit failure prediction

On-line self-test & diagnostics

Conclusion

7

Page 8: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Who Cares About Soft Errors ?

20K processors server farm

1 major flip-flop error every 20 days

Silent data corruption

$ 20K $ 3,616 bank deposit

Downtime: $100K - $10M / hr.

Memory ECC routinely usedSoft error rate contributions

Flip-flop

Unprotected memory

Comb. logic

System error rates increasing

8

Page 9: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

BISER: Built-In Soft Error Resilience

D

C

D

C

Latch

Redundant Latch (Scan Test & Debug reuse)

Q

Q

Weak keeper

OUT

OUTComb. logic

IN

Clock

A B 00 11 01 10

C-element (A, B)

1 0 Previous value retained

Previous value retained

C-element

A

B

[Mitra IEEE Computer 05, ITC 06, Zhang TVLSI 06]9

Page 10: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Architecture-Aware BISER Insertion

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

cumulative latch coverage

cum

ulat

ive

erro

r co

vera

ge10X chip-level protection

9% chip-levelpower penalty

Alpha 21264 error injection

Ack: Prof. S.J. Patel, UIUC for error injector

2X

2.5%powerpenalty

Optimized BISER insertion: verification-guided ?10

Page 11: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Reconfigurable BISER: Economy Mode

11

Integrated design quality

Soft error correction, scan test, post-silicon debug

Q

Scan Clock B = 1

Scan Data

Capture = 0

Update

System Data

System Clock

System Output

1DC12DC2

1DC12DC2

1DC1

1D

C1

Scan / Checking Flip-flop

System Flip - flop

Q

Q QScan Clock A

Scan Output

&

+

C-element

Keeper

Q

Scan Clock B = 1

Scan Data

Capture = 0

Update

System Data

System Clock

System Output

1DC12DC2

1DC12DC2

1DC1

1D

C1

Scan / Checking Flip-flop

System Flip - flop

Q

Q QScan Clock A

Scan Output

&

+

C-element

Keeper

Page 12: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

45nm BISER Results

Radiation experiment results

particles: > 1,000X improvement feasible

Neutrons: > 100X improvement feasible

More reduction possible

[Seifert, Intel, IOLTS 08 Invited Speech]12

Page 13: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Single Event Multiple Upsets increasing

Node separation too expensive

New idea: LEAP

Special layout

New properties of single event transients

Single Soft Error Resilience Not Enough

13

Page 14: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Layout by Error-Aware Transistor Positioning

2,000X fewer errors vs. D flip-flop

5X fewer soft errors vs. DICE

Same DICE circuit

3% power, 1% delay, 40% area costs

n2 n4 n6 n8

n7n5n3n1

M2

M1

M4

M3 M5

M6

M7

M8

DICE LEAP-DICE layoutLEAP

14

Page 15: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Key LEAP Idea

n1

PMOSNMOS

n2 VDDGND

logic 1

logic 0

V(out)

Time

Reduced

single event

transientn1

n2in = 1

ON

OFF Particle Strike

out

15

Page 16: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Resilient Flip-flops Alone Not Enough Design optimization essential

Case study

Given: flip-flops to be protected

Find: lowest-cost solution

Scenario 1: BISER only

Scenario 2: Flip-flop parity only

or

Scenario 3: BISER + flip-flop parity

16[Mitra DATE 10]

Page 17: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Scenario 3: BISER + Flip-flop ParityCK

Combinationallogic

Y1, …, Ys

Question:

Which flip-flops for BISER ?

D

q1

p

Parity = Y1 … Yk-2

Parity checker Error

qk-2

qk-1

D

qk

D

BISER

BISER

qs

D

D

D

17

Page 18: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Optimized BISER + Parity: SimpleSPI Core

0%

10%

20%

30%

20 40 60 80 100

Power cost

50% flip-flops selected at random(Experiment 1)

0%

10%

20%

20 40 60 80 100

Power cost

50% flip-flops selected at random(Experiment 2)

% selected flip-flops protected with parity (BISER for rest)

% selected flip-flops protected with parity (BISER for rest)

Optimization a MUST

18

Page 19: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Outline Introduction

Soft error resilience

Circuit failure prediction

On-line self-test & diagnostics

Conclusion

19

Page 20: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Low-Cost Resilience

WearoutEarly-life failures (ELF)

Lifetime Time

Failure rate

Burn-in difficult

Iddq ineffective

Transistor aging Guardbands expensive

Soft Error Resilience

BISER + LEAP: Errors reduced: 2,000X

Global optimization software-orchestrated

Circuit Failure Prediction: On-line Self-Test & Diagnostics

New ELF signature: Delay shifts over time

20

Page 21: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Circuit Failure Prediction Early Indicator

Failure Prediction Error Detection

Before errors appear After errors appear

+ No corruption – Corrupt data & states

+ Low cost – High cost

+ Self-diagnosis – Limited diagnosis

21[Agarwal VTS 07, Li IEEE Design & Test 09]

Applicability: Early-life failures, circuit aging

Page 22: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Early-Life Failures (ELF)

Weak chips – caused by defects

Fail early in field (a.k.a. infant mortality)

Gate-oxide defects important

Burn-in ELF screen

Major test cost

Reduced effectiveness

Burn-in alternatives difficult: Iddq, VLV test

22

2001 Burn-in & Test Socket Workshop

Page 23: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

23

Gate-Oxide ELF: Failure Prediction Example

New signature: Delay SHIFTs over time

Before functional failure

Distinct from NBTI, PBTI, hot carriers

[Chen VTS 08, IRPS 09, Kim VTS 10, VLSI Circuits 10]

Page 24: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

24

Large-Scale Gate-Oxide ELF Experiments

60 70 80 90 10020

30

40

50

60

70

80

90

Fresh Ids [A]

Ids

[A

] afte

r 53

40 m

in s

tres

s

0 50 100 150 200 2500

10

20

30

40

50

60

X

Y

Outliers

Outlier locations random in

0.2m array

W = 0.2m

60 70 80 90 10020

30

40

50

60

70

80

90

Fresh Ids [A]

Ids

[A

] afte

r 53

40 m

in s

tres

s

0 50 100 150 200 2500

10

20

30

40

50

60

X

Y

Outliers

Outlier locations random in

0.2m array

W = 0.2m

20 40 60 80 100

-6

-4

-2

0

2

4

Ids[A]

Sta

nd

ard

No

rma

l Qu

an

tile

5340 min. stress

240 min. stress

Fresh

10 min. stress

Outliers

20 40 60 80 100

-6

-4

-2

0

2

4

Ids[A]

Sta

nd

ard

No

rma

l Qu

an

tile

5340 min. stress

240 min. stress

Fresh

10 min. stress

Outliers

952 pairs: Ids outliers11.6% of entire population

885 pairs 92%

952 pairs: Largest Ig increase

11.6% of entire population

952 pairs: Ids outliers11.6% of entire population

885 pairs 92%

952 pairs: Largest Ig increase

11.6% of entire population

48K transistor arrays

Page 25: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Gate-Oxide ELF Test Structure

Emulate single gate-oxide ELF using stress

NMOS or PMOS

Thin-oxide NMOS under stress

Thick-oxide

[Kim VLSI Circuits 10] 25

Page 26: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Gate-oxide ELF Stress Delay Shifts1V

Stress time (a.u.)200 400 600 800

0

50

100

150

Functional failure

Measured delay SHIFT (ps)

Gate-oxide ELF delay shift(increased gate leakage)

0

Stress

26

Page 27: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Big Question: How to Detect Delay Shifts ?

Existing Techniques Why inadequate ?

Delay fault detection

flip-flopsDelay shift delay fault

Canary

circuits

ELF defects

not detectable

Concurrent error

detection Expensive

27

Page 28: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Solution: On-Line Self-Test and DiagnosticsTask

1

Scan Enable

Launch-Capture

Scan-in & scan-out

Di Di+1 Dj

Di > Di+1 > Dj

Task2

TaskN

TaskN+1

TaskN+2

TaskM

On-line self-test & diagnostics

Configurable launch-capture delay

28

Page 29: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Monotonic Launch-Capture Delay Control

MeasuredLaunch-Capture delay (ps)

129 Delay configurations

0 20 40 60 80 100 120

200

400

600

800

1,000

Phase change

29

Fine control of less than 20ps

Page 30: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

On-Line Delay Shift Detection Results

974 67

967 66

903 60

885 58

866 56

851 54

603 29

581 27

Delay

(ps)

Delay

config

1,012 72

Stress time (a.u.)

Fu

nct

ion

al F

ailu

re

30[Kim VLSI Circuits 10]

Gate-oxide ELF delay shiftGate-oxide ELF delay shift

Page 31: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

On-Line Delay Shift Detection Results

974 67

967 66

903 60

885 58

866 56

851 54

603 29

581 27

Delay

(ps)

Delay

config

1,012 72

Stress time (a.u.)

Fu

nct

ion

al F

ailu

re

31[Kim VLSI Circuits 10]

Gate-oxide ELF delay shiftGate-oxide ELF delay shift

Stress time (a.u.)10-7

10-6

10-5

Ig (A)

Page 32: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

On-line Self-Test & Diagnostics

Failure prediction, detection, self-healing

Challenges

Very high test coverage

Stuck-at not enough, delay tests required

No visible system downtime

Minimal costs & design flow impact

Existing Logic BIST difficult

32

Page 33: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Concurrent with system operation

Autonomous

Stored Patterns: off-chip FLASH

Test compression: X-Compact

High coverage & upgradeable

Comparable or better than production tests

CASP On-line Self-Test & Diagnostics

[Li DATE 08, VTS 10] 33

Special system architecture support for CASP

Page 34: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Robust Uncore Essential

Uncore12%

Processor cores12%

Memories76%

New on-line self-test & diagnostics for uncore

Naïve stall-and-test too expensive

8-cores 64-threads

OpenSPARC T2 SoC

© opensparc.net

Uncore

34

Page 35: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

New Uncore CASP Principles

I. Resource reallocation and sharing (RRS)

II. No-performance-impact testing

III. Smart backup

1% area, 1% power, 3% performance impact

Very low cost vs. concurrent error detection

©opensparc.net

OpenSPARC T2 SoC

35[Li VTS 10]

200 MBytes off-chip FLASH

Page 36: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Hardware-Only CASP Inefficient

Non-trivial hardware modification

I/O packet drop, interrupts

Visible application performance impact

Solutions

VAST – Virtualization Assisted CASP Self-Test

OS migration

CASP-aware OS scheduling

NEC

CPUs OS Virtualization s/w

ARM: MP11 x 4 Linux 2.6.7 NEC in-house

VAST Demonstration Platform

Efficiency

Cov

erag

e

LogicBIST

CASPVAST +

CASP-awareOS scheduling

High test quality

36[Inoue ITC 08]

Page 37: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

37

CASP-Aware Software OrchestrationWorkload: Firefox

Platform: Dual quad-core Xeon, Linux 2.6.25.9 scheduler modified

> 200ms, <500ms

< 200ms > 500ms

Hardware-only CASP

No Effect UNACCEPTABLE

Response time

CASP-aware OS scheduling

[Li ICCAD 09]

Page 38: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Error Resilient System Architecture (ERSA)

RRC1

L1 $

RRC2

L1 $

RRC3

L1 $

RRCN

L1 $

L2 $Bank 1

L2 $Bank 2

L2 $Bank N

RRC1

L1 $

RRC2

L1 $

RRC3

L1 $

RRCN

L1 $

L2 $Bank 1

L2 $Bank 2

L2 $Bank N

Super

Reliable

Core

Relaxed

Reliability

Cores

[Leem DATE 10]

{Vdd , fclk , protection}

Killer probabilistic apps:

Recognition, Mining, Synthesis

Asymmetric & configurable resilience: Application-aware

Highly resilient

Accuracy: 90% +

Minimal runtime impact

RMS on ERSA

38

Page 39: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Outline Introduction

Soft error resilience

Circuit failure prediction

On-line self-test & diagnostics

Conclusion

39

Page 40: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

40

Post-Silicon Validation Critical

New approach: IFRA + QED

Intel® Nehalem + CoreTM i7 results

Improved error detection latencies: 106 X

Higher error coverage: 4X

Highly accurate bug localization: 90 – 96%

“Post-silicon cost & complexity rising faster than design cost” – S. Yerramilli, V.P., Intel

[Park DAC 08, TCAD 09, DAC 10, Hong ITC 10]

Page 41: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Carbon Nanotube (CNT) FETs: Big Promise

Collaborator: Prof. H.-S.P. Wong, EE, Stanford 41

Major barriers: inherent imperfections at nano-scale

Mis-positioned & metallic CNTs

Imperfection-immune design a MUST

New solutions robust CNT VLSI

20 µm

20 µm

20 µm

20 µm

VDD

GND

VDD

GND

First demo: Adder sum, Latches, Monolithic 3D IC

VDD

OUT

IN

BIAS

GND

2nd Layer

1st

Layer

Conventional via,

NOT TSVVDD

OUT

IN

BIAS

GND

2nd Layer

1st

Layer

Conventional via,

NOT TSV

Page 42: 1 Robust System Design to Overcome CMOS Reliability Challenges Subhasish Mitra Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li Robust Systems Group

Conclusion

New solutions: elegantly simple, highly effective

42

WearoutEarly-life failures (ELF)

Lifetime Time

Failure rate

Burn-in difficult

Iddq ineffective

Transistor aging Guardbands expensive

Soft Error Resilience

BISER + LEAP: Errors reduced: 2,000X

Circuit Failure Prediction: On-line Self-Test & Diagnostics

New ELF signature: Delay shifts over time