Upload
imogene-potter
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
1
Robust System Design
to Overcome CMOS Reliability Challenges
Subhasish Mitra
Kevin Brelsford, Young Moon Kim, Kelin Lee, Yanjing Li
Robust Systems Group
Dept. of EE & Dept. of CS
Stanford University
Acknowledgment: Students & Collaborators
2
Robust System Challenges
Technology reliability limits – today’s focus
Soft errors, early-life failures, aging, variability, …
System complexity
Design bugs, defects
Malfunctions can be disastrous
Health, transport, finance, …
“It’s ridiculous. I ’ve got a $300,000 server that
doesn’t work. The thing should be bullet -proof.”
“It’s ridiculous. I ’ve got a $300,000 server that
doesn’t work. The thing should be bullet -proof.”
Robust System Design
Perform correctly
Despite complexity & disturbances
Thorough test & validation
Tolerate imperfect hardware
Beyond silicon-CMOS: imperfection-immune logic
3
“Low-Cost” Error Detection Most Important
Concurrent error detection (CED) expensive
Crashes vs. silent errors
Belief: Logic parity inexpensive
Reality: Can be expensive
Logic sharing, complex routing
Belief: Software CED inexpensive
Reality: Only for some apps (matrix, FFT)
4
Design Principles to Achieve “Low Cost”
Discover
Failure mode signatures
Utilize
Application characteristics
Globally Optimize
Software orchestration
Reconfigurable resilience
Spend some area, minimize power costs
5
Low-Cost Resilience
WearoutEarly-life failures (ELF)
Lifetime Time
Failure rate
Burn-in difficult
Iddq ineffective
Transistor aging Guardbands expensive
Soft Error Resilience
BISER + LEAP: Errors reduced: 2,000X
Global optimization software-orchestrated
Circuit Failure Prediction: On-line Self-Test & Diagnostics
New ELF signature: Delay shifts over time
6
Outline Introduction
Soft error resilience
Circuit failure prediction
On-line self-test & diagnostics
Conclusion
7
Who Cares About Soft Errors ?
20K processors server farm
1 major flip-flop error every 20 days
Silent data corruption
$ 20K $ 3,616 bank deposit
Downtime: $100K - $10M / hr.
Memory ECC routinely usedSoft error rate contributions
Flip-flop
Unprotected memory
Comb. logic
System error rates increasing
8
BISER: Built-In Soft Error Resilience
D
C
D
C
Latch
Redundant Latch (Scan Test & Debug reuse)
Q
Q
Weak keeper
OUT
OUTComb. logic
IN
Clock
A B 00 11 01 10
C-element (A, B)
1 0 Previous value retained
Previous value retained
C-element
A
B
[Mitra IEEE Computer 05, ITC 06, Zhang TVLSI 06]9
Architecture-Aware BISER Insertion
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
cumulative latch coverage
cum
ulat
ive
erro
r co
vera
ge10X chip-level protection
9% chip-levelpower penalty
Alpha 21264 error injection
Ack: Prof. S.J. Patel, UIUC for error injector
2X
2.5%powerpenalty
Optimized BISER insertion: verification-guided ?10
Reconfigurable BISER: Economy Mode
11
Integrated design quality
Soft error correction, scan test, post-silicon debug
Q
Scan Clock B = 1
Scan Data
Capture = 0
Update
System Data
System Clock
System Output
1DC12DC2
1DC12DC2
1DC1
1D
C1
Scan / Checking Flip-flop
System Flip - flop
Q
Q QScan Clock A
Scan Output
&
+
C-element
Keeper
Q
Scan Clock B = 1
Scan Data
Capture = 0
Update
System Data
System Clock
System Output
1DC12DC2
1DC12DC2
1DC1
1D
C1
Scan / Checking Flip-flop
System Flip - flop
Q
Q QScan Clock A
Scan Output
&
+
C-element
Keeper
45nm BISER Results
Radiation experiment results
particles: > 1,000X improvement feasible
Neutrons: > 100X improvement feasible
More reduction possible
[Seifert, Intel, IOLTS 08 Invited Speech]12
Single Event Multiple Upsets increasing
Node separation too expensive
New idea: LEAP
Special layout
New properties of single event transients
Single Soft Error Resilience Not Enough
13
Layout by Error-Aware Transistor Positioning
2,000X fewer errors vs. D flip-flop
5X fewer soft errors vs. DICE
Same DICE circuit
3% power, 1% delay, 40% area costs
n2 n4 n6 n8
n7n5n3n1
M2
M1
M4
M3 M5
M6
M7
M8
DICE LEAP-DICE layoutLEAP
14
Key LEAP Idea
n1
PMOSNMOS
n2 VDDGND
logic 1
logic 0
V(out)
Time
Reduced
single event
transientn1
n2in = 1
ON
OFF Particle Strike
out
15
Resilient Flip-flops Alone Not Enough Design optimization essential
Case study
Given: flip-flops to be protected
Find: lowest-cost solution
Scenario 1: BISER only
Scenario 2: Flip-flop parity only
or
Scenario 3: BISER + flip-flop parity
16[Mitra DATE 10]
Scenario 3: BISER + Flip-flop ParityCK
Combinationallogic
Y1, …, Ys
Question:
Which flip-flops for BISER ?
D
q1
p
Parity = Y1 … Yk-2
Parity checker Error
qk-2
qk-1
D
qk
D
BISER
BISER
qs
D
D
D
17
Optimized BISER + Parity: SimpleSPI Core
0%
10%
20%
30%
20 40 60 80 100
Power cost
50% flip-flops selected at random(Experiment 1)
0%
10%
20%
20 40 60 80 100
Power cost
50% flip-flops selected at random(Experiment 2)
% selected flip-flops protected with parity (BISER for rest)
% selected flip-flops protected with parity (BISER for rest)
Optimization a MUST
18
Outline Introduction
Soft error resilience
Circuit failure prediction
On-line self-test & diagnostics
Conclusion
19
Low-Cost Resilience
WearoutEarly-life failures (ELF)
Lifetime Time
Failure rate
Burn-in difficult
Iddq ineffective
Transistor aging Guardbands expensive
Soft Error Resilience
BISER + LEAP: Errors reduced: 2,000X
Global optimization software-orchestrated
Circuit Failure Prediction: On-line Self-Test & Diagnostics
New ELF signature: Delay shifts over time
20
Circuit Failure Prediction Early Indicator
Failure Prediction Error Detection
Before errors appear After errors appear
+ No corruption – Corrupt data & states
+ Low cost – High cost
+ Self-diagnosis – Limited diagnosis
21[Agarwal VTS 07, Li IEEE Design & Test 09]
Applicability: Early-life failures, circuit aging
Early-Life Failures (ELF)
Weak chips – caused by defects
Fail early in field (a.k.a. infant mortality)
Gate-oxide defects important
Burn-in ELF screen
Major test cost
Reduced effectiveness
Burn-in alternatives difficult: Iddq, VLV test
22
2001 Burn-in & Test Socket Workshop
23
Gate-Oxide ELF: Failure Prediction Example
New signature: Delay SHIFTs over time
Before functional failure
Distinct from NBTI, PBTI, hot carriers
[Chen VTS 08, IRPS 09, Kim VTS 10, VLSI Circuits 10]
24
Large-Scale Gate-Oxide ELF Experiments
60 70 80 90 10020
30
40
50
60
70
80
90
Fresh Ids [A]
Ids
[A
] afte
r 53
40 m
in s
tres
s
0 50 100 150 200 2500
10
20
30
40
50
60
X
Y
Outliers
Outlier locations random in
0.2m array
W = 0.2m
60 70 80 90 10020
30
40
50
60
70
80
90
Fresh Ids [A]
Ids
[A
] afte
r 53
40 m
in s
tres
s
0 50 100 150 200 2500
10
20
30
40
50
60
X
Y
Outliers
Outlier locations random in
0.2m array
W = 0.2m
20 40 60 80 100
-6
-4
-2
0
2
4
Ids[A]
Sta
nd
ard
No
rma
l Qu
an
tile
5340 min. stress
240 min. stress
Fresh
10 min. stress
Outliers
20 40 60 80 100
-6
-4
-2
0
2
4
Ids[A]
Sta
nd
ard
No
rma
l Qu
an
tile
5340 min. stress
240 min. stress
Fresh
10 min. stress
Outliers
952 pairs: Ids outliers11.6% of entire population
885 pairs 92%
952 pairs: Largest Ig increase
11.6% of entire population
952 pairs: Ids outliers11.6% of entire population
885 pairs 92%
952 pairs: Largest Ig increase
11.6% of entire population
48K transistor arrays
Gate-Oxide ELF Test Structure
Emulate single gate-oxide ELF using stress
NMOS or PMOS
Thin-oxide NMOS under stress
Thick-oxide
[Kim VLSI Circuits 10] 25
Gate-oxide ELF Stress Delay Shifts1V
Stress time (a.u.)200 400 600 800
0
50
100
150
Functional failure
Measured delay SHIFT (ps)
Gate-oxide ELF delay shift(increased gate leakage)
0
Stress
26
Big Question: How to Detect Delay Shifts ?
Existing Techniques Why inadequate ?
Delay fault detection
flip-flopsDelay shift delay fault
Canary
circuits
ELF defects
not detectable
Concurrent error
detection Expensive
27
Solution: On-Line Self-Test and DiagnosticsTask
1
Scan Enable
Launch-Capture
Scan-in & scan-out
Di Di+1 Dj
Di > Di+1 > Dj
Task2
TaskN
TaskN+1
TaskN+2
TaskM
On-line self-test & diagnostics
Configurable launch-capture delay
28
Monotonic Launch-Capture Delay Control
MeasuredLaunch-Capture delay (ps)
129 Delay configurations
0 20 40 60 80 100 120
200
400
600
800
1,000
Phase change
29
Fine control of less than 20ps
On-Line Delay Shift Detection Results
974 67
967 66
903 60
885 58
866 56
851 54
603 29
581 27
Delay
(ps)
Delay
config
1,012 72
Stress time (a.u.)
Fu
nct
ion
al F
ailu
re
30[Kim VLSI Circuits 10]
Gate-oxide ELF delay shiftGate-oxide ELF delay shift
On-Line Delay Shift Detection Results
974 67
967 66
903 60
885 58
866 56
851 54
603 29
581 27
Delay
(ps)
Delay
config
1,012 72
Stress time (a.u.)
Fu
nct
ion
al F
ailu
re
31[Kim VLSI Circuits 10]
Gate-oxide ELF delay shiftGate-oxide ELF delay shift
Stress time (a.u.)10-7
10-6
10-5
Ig (A)
On-line Self-Test & Diagnostics
Failure prediction, detection, self-healing
Challenges
Very high test coverage
Stuck-at not enough, delay tests required
No visible system downtime
Minimal costs & design flow impact
Existing Logic BIST difficult
32
Concurrent with system operation
Autonomous
Stored Patterns: off-chip FLASH
Test compression: X-Compact
High coverage & upgradeable
Comparable or better than production tests
CASP On-line Self-Test & Diagnostics
[Li DATE 08, VTS 10] 33
Special system architecture support for CASP
Robust Uncore Essential
Uncore12%
Processor cores12%
Memories76%
New on-line self-test & diagnostics for uncore
Naïve stall-and-test too expensive
8-cores 64-threads
OpenSPARC T2 SoC
© opensparc.net
Uncore
34
New Uncore CASP Principles
I. Resource reallocation and sharing (RRS)
II. No-performance-impact testing
III. Smart backup
1% area, 1% power, 3% performance impact
Very low cost vs. concurrent error detection
©opensparc.net
OpenSPARC T2 SoC
35[Li VTS 10]
200 MBytes off-chip FLASH
Hardware-Only CASP Inefficient
Non-trivial hardware modification
I/O packet drop, interrupts
Visible application performance impact
Solutions
VAST – Virtualization Assisted CASP Self-Test
OS migration
CASP-aware OS scheduling
NEC
CPUs OS Virtualization s/w
ARM: MP11 x 4 Linux 2.6.7 NEC in-house
VAST Demonstration Platform
Efficiency
Cov
erag
e
LogicBIST
CASPVAST +
CASP-awareOS scheduling
High test quality
36[Inoue ITC 08]
37
CASP-Aware Software OrchestrationWorkload: Firefox
Platform: Dual quad-core Xeon, Linux 2.6.25.9 scheduler modified
> 200ms, <500ms
< 200ms > 500ms
Hardware-only CASP
No Effect UNACCEPTABLE
Response time
CASP-aware OS scheduling
[Li ICCAD 09]
Error Resilient System Architecture (ERSA)
RRC1
L1 $
RRC2
L1 $
RRC3
L1 $
RRCN
L1 $
L2 $Bank 1
L2 $Bank 2
L2 $Bank N
RRC1
L1 $
RRC2
L1 $
RRC3
L1 $
RRCN
L1 $
L2 $Bank 1
L2 $Bank 2
L2 $Bank N
Super
Reliable
Core
Relaxed
Reliability
Cores
[Leem DATE 10]
{Vdd , fclk , protection}
Killer probabilistic apps:
Recognition, Mining, Synthesis
Asymmetric & configurable resilience: Application-aware
Highly resilient
Accuracy: 90% +
Minimal runtime impact
RMS on ERSA
38
Outline Introduction
Soft error resilience
Circuit failure prediction
On-line self-test & diagnostics
Conclusion
39
40
Post-Silicon Validation Critical
New approach: IFRA + QED
Intel® Nehalem + CoreTM i7 results
Improved error detection latencies: 106 X
Higher error coverage: 4X
Highly accurate bug localization: 90 – 96%
“Post-silicon cost & complexity rising faster than design cost” – S. Yerramilli, V.P., Intel
[Park DAC 08, TCAD 09, DAC 10, Hong ITC 10]
Carbon Nanotube (CNT) FETs: Big Promise
Collaborator: Prof. H.-S.P. Wong, EE, Stanford 41
Major barriers: inherent imperfections at nano-scale
Mis-positioned & metallic CNTs
Imperfection-immune design a MUST
New solutions robust CNT VLSI
20 µm
20 µm
20 µm
20 µm
VDD
GND
VDD
GND
First demo: Adder sum, Latches, Monolithic 3D IC
VDD
OUT
IN
BIAS
GND
2nd Layer
1st
Layer
Conventional via,
NOT TSVVDD
OUT
IN
BIAS
GND
2nd Layer
1st
Layer
Conventional via,
NOT TSV
Conclusion
New solutions: elegantly simple, highly effective
42
WearoutEarly-life failures (ELF)
Lifetime Time
Failure rate
Burn-in difficult
Iddq ineffective
Transistor aging Guardbands expensive
Soft Error Resilience
BISER + LEAP: Errors reduced: 2,000X
Circuit Failure Prediction: On-line Self-Test & Diagnostics
New ELF signature: Delay shifts over time