New Approaches to Fault-Tolerant Systems Design
Andreas Steininger
Vienna University of Technology
A. Steininger page 2
My contact data
Andreas SteiningerVienna University of Technology
Faculty of InformaticsInstitute of Computer Engineering
Embedded Computing Systems Group
Treitlstrasse 3A- 1040 Vienna
Austria
http://ti.tuwien.ac.at/ecs
A. Steininger page 3
Main Contributors to this Material
Dr. Thomas Kottke R. Bosch AG / EADS
Dr. Peter Tummeltshammer R. Bosch AG / Thales
Dr. Christoph Scherrer Alcatel / Thales
Dr. Eric Armengaud DecomSys / VirtualVehicle
Dr. Karl Thaller DecomSys / Elektrobit Austria
Dr. Martin Horauer UAT Technikum Wien
Paul Milbredt AUDI AG
A. Steininger page 4
Outline• Fault tolerance – some (very) basics• Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node
– Basic architecture– Temporal diversity– Treatment of common cause faults– Switching performance mode / safety mode– Fault-tolerance validation by fault injection
A. Steininger page 5
Faults, Errors and Failures
computer
1 0
fault
error failure
A. Steininger page 6
Error Detection
computer
1 0
fault
error failure
Fault detection: usually too difficult (too many possibilities)
A. Steininger page 7
Error Detection
computer
1 0
error failure
Failure detection: too late:want to prevent failure!
A. Steininger
0
page 8
Error Detection
computer
1 0
error
To decide that „1“ is wrong we need a reference.Where to get this reference from?
Option 1:Perform same compu-tation a second time (hopefully the fault is gone by then…)
Time redundancy
A. Steininger page 9
Error Detection
computer
1 0
error
To decide that „1“ is wrong we need a reference.Where to get this reference from?
A. Steininger
0
page 10
Error Detection
computer
1 0
To decide that „1“ is wrong we need a reference.Where to get this reference from?
Option 2:Use a second computer in parallel (hopefully this one works well…)
Space redundancy
A. Steininger page 11
Error Detection
computer
1 0
error
To decide that „1“ is wrong we need a reference.Where to get this reference from?
Option 3:Add additional information (hopefully not affected as well…)
Information redundancy0
A. Steininger
computer ED
computer
computer
page 12
Achieving Fault Tolerance
Fail safe: system can be safelystopped when error is detected example: train
computer
computer
ED
Fail operational: system must keep on working when error is detected example: autopilot in airplanecomputer
computer
computer
ED
A. Steininger page 13
Outline Fault tolerance – some (very) basics• Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node
– Basic architecture– Temporal diversity– Treatment of common cause faults– Switching performance mode / safety mode– Fault-tolerance validation by fault injection
A. Steininger page 14
Electronics in Cars – some Facts
high proportion of value: up to 30%
high development potential:more than 80% of the innovations
high number of Electronic Control Units (ECUs)up to 70
complex distributed systemdifferent networks & topologies
A. Steininger page 15
Electronics in Cars - Benefits
cheap alternative to existing mechanical solutions– lighter, smaller, cheaper, more flexible,…
enabler for further optimizations– electronic ignition, motor management, …
key to new functionality– safety: ESP, active suspension, crash sensing…– comfort: air conditioning, infotainment,…– security: immobilizer, alarm, electronic key, GPS tracking,…– autonomy: anticipatory braking, lane keeping,…
A. Steininger page 16
Key Demands
Safety
Real-Time
Low Cost
Robustness
Testability
A. Steininger page 17
Key Demands
Safety
Real-Time
Low Cost
Robustness
Testability
– high risk potential (energy!)
– high public awareness
– no safe state (in general)
– certification required(EN 61508, ISO 26262)
– high complexity of system & application
– legal issues (liability)
A. Steininger page 18
Key Demands
Safety
Real-Time
Low Cost
Robustness
Testability
– engine: 6000 rpm = 1/10ms
– VDM: 100km/h = 28cm/10ms
– need to synchronize distributed activities
– real-time communication
– image processing tasks
A. Steininger page 19
Key Demands
Safety
Real-Time
Low Cost
Robustness
Testability
– extreme competition
– high cost inhibits introduction
– tailored safety concepts minimum degree of replication use structural redundancies
– generic solutions scalable, configurable, flexible
– marginal costs beat NRE
A. Steininger page 20
Current Status
fail safe functions realized:– shut off upon error– mechanical fall-back system assumes control
no true “by wire” functions– single-channel solutions sufficient
tolerance against random faults– avoid design faults by field experience => no diversity– avoid common cause faults by design (?)
single fault assumption– keep faults rare (shielding, etc.)
A. Steininger page 21
Outline Fault tolerance – some (very) basics Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node
– Basic architecture– Temporal diversity– Treatment of common cause faults– Switching performance mode / safety mode– Fault-tolerance validation by fault injection
A. Steininger page 22
A Fault Tolerant Node
mission: make a node (processor) fault tolerant
need to consider CPU and memory
aim is “fail safe” (but keep option for fail op in mind)– simplex unit with error detection capabilities– duplication and comparison– hybrid approach
A. Steininger page 23
Options for the CPU Core
Single core + ED
Dual core + cmp
Superscalar proc.+ cmp + ED
modify custom CPU core
– parity for buses
– two-rail coding for signals
– self-checking implemen-tation of simple units
– duplicate & compare for complex units
– careful layout
A. Steininger page 24
Options for the CPU Core
Single core + ED
Dual core + cmp
Superscalar proc.+ cmp + ED
duplicate custom CPU core
– master/checker operation
– shared (safe) memory
– validity check for inputs
– self-checking comparator checks equality of outputs
– option: clock delay
– option: mode switch
A. Steininger page 25
Solution Example “Dual Core Frame”
benefitscan use custom core without modificationssafety analysis valid for other cores as wellpromises high ED coverage with moderate effortsCPU is hard to protect otherwise
crucial pointsenable easy recovery ( => keep outage short)
eliminate single points of failuredetect common cause faults
A. Steininger page 26
Instr. Mem Data Mem
=? =?=?
Instr. Addr. Instr. Data Addr.Data out Data in
Instr. Addr. Instr. Data Addr.Data out Data in
Core 1 (Master)
Core 2 (Checker)
Error_Sig„Safe memories“
Parity for buses
Dual-Rail CodingSelf-Checking Comparators
Protection in the Dual Core Frame
A. Steininger page 27
Potential for Common Cause Faults
identical input data identical clock (lock step) shared clock generator shared power supply both processors on same die
(physical proximity; thermal & mechanical coupling)
A. Steininger page 28
Temporal Diversity
operate checker with a delay against master– same fault hits at different point of computation– therefore different effect => detect by comparison– different critical paths emerge
store master output for comparison choose delay of 1 / 1.5 / 2 clock cycles
– larger delay causes high effort for little gain (=>experiments)
– error detection latency is equal to the delay– need to delay memory write and outputs by this amount
A. Steininger page 29
Instr. Mem Data Mem
=? =?=?
Instr. Addr. Instr. Data Addr.Data out Data in
Instr. Addr. Instr. Data Addr.Data out Data in
Core #1 (Master)
Core #2 (Checker)
ErrorDT
Temporal Diversity: Implementation
A. Steininger page 30
Fail Safe Dual Core Frame – Summary safe memories for instructions and data comparison of all core outputs parity protection for buses (data, address)
dual rail coding for single signals (int, rst, err)
totally self-checking comparators temporal diversity
How safe is the proposed solution?
A. Steininger page 31
Assessment of the Solution’s Quality
How measure quality? ( Aim is fail safe)error detection coverage => detect all errors
error detection latency => detect them quickly
Which method to choose? theoretical analysis / modelling experimental fault injection field observation
A. Steininger page 32
Fault Injection Experiment
2 SPEAR cores in fail safe frame (= DUT)
synthesized to EDIF netlist injected one by one into netlist exhaustive list of stuck-at-1 and stuck-at-0 faults download to FPGA, application run “golden device” as reference (= REF)
upon mismatch (DUT REF) => check comparator
A. Steininger page 33
master slave frame overalldetected no effect 204 51170 3517 54891
before effect 19047 98 734 19879
during effectRD 0 0 0 0WR 559 0 921 1480
after effectRD 31455 0 87 31542WR 0 0 0 0
not detected
no effect 4269 4276 1073 9618with effect 0 0 0 0
overall 55534 55544 6332 117410
No change of memory contents in case of errorErroneous read access is uncritical
Results of FI Experiment
A. Steininger page 34
Enabling fast Recovery error signal (dual rail)
notifies external component / memory turns any further WR into RD (error confinement) triggers processor interrupt
status register (memory mapped) updated by HW indicates source of error (data parity, address mismatch,…)
recovery can build on uncorrupted status can benefit from detailed status information
A. Steininger page 35
Why is fast Recovery important? application specific fault-tolerance time
application can “survive” without computer even in fail-operational case typ. some 10ms for car (recall: 100km/h = 28cm/10ms)
meaning of fast recovery if failed computer recovers within FT time,
no need for hot standby => COST! re-booting after failure is
- pragmatic- safe- expensive!
A. Steininger page 36
Fail Safe Dual Core – Summary 1 duplicate & compare
generic approach, applicable to any core typecovers all (local) errorsneed to carefully eliminate single points of failureneed to complement with protection for signals & buses
temporal diversity mitigates (many) common cause failures requires output delay to ensure error confinement
A. Steininger page 37
Possible Sources of CCFs Design & process
design fault or (latent) process deficiency
Thermal coupling hot spot affects both replica in the same way
Mechancial defectaffects both replica symmetrically
Electrical couplingwire bound (shared lines: VDD, reset, clock) wireless (EMI)
A. Steininger page 38
Komp.error
Why use Single Die then?
cheaper and fasteruse two instances of same designfast & comprehensive comparison
CCFs on single dieintuitively higher threadquantification of thread?mitigation techniques?
A. Steininger page 39
The Actual Problem with CCFs
One fault event affects both replicaAND
is not detected by comparator i.e. leads to “symmetric” fault effect
AND
produces an erroneous outputi.e. does not crash the cores
A. Steininger page 40
Possible Countermeasures for CCFs Design & process
design fault or (latent) process deficiency
Thermal coupling hot spot affects both replica in the same way
Mechancial defectaffects both replica symmetrically
Electrical couplingwire bound (shared lines: VDD, reset, clock) wireless (EMI)
diversity, burn-in,fault avoidance
asymmetric propagation paths
asymmetric critical paths
asymmetric antennas (?)
A. Steininger page 41
Possible Countermeasures for CCFs Design & process
design fault or (latent) process deficiency
Thermal coupling hot spot affects both replica in the same way
Mechancial defectaffects both replica symmetrically
Electrical couplingwire bound (shared lines: VDD, reset, clock) wireless (EMI)
asymmetric propagation paths
A. Steininger page 42
Propagation Speed Comparison
Thermal & mechanical propagation arerelatively slow
10000s of clock cycles within 1ms
A. Steininger page 43
Experimental Assessment
Evaluation Experiments1) single corresponding points
with offset t
2) multiple corresp. points with offset t
3) single non-corresp. points no offset
Core 1 Core 2
Core 1 Core 2
Core 1 Core 2
Master
Compare unit
Checker
GoldenNode
Da
ta
Ad
dr
Iad
dr
Da
ta
Ad
dr
Iad
dr
We
We
Erroneous write
access?
A. Steininger page 44
Symmetry Requirements for CCF
even a small offset…
fault multiplicity …
asymmetry of impact …
…improve detection coverage
A. Steininger page 45
Symmetry Requirements for CCF
even a small offset…
fault multiplicity …
asymmetry of impact …
…improve detection coverage
RF (7028)
ExVecTab(8202)
ALU (2472)
PSW (308)
DEC (152)
P2 (158)
PC+P1(182)
A. Steininger page 46
Squeezing our more Efficiency dual core is expensive normally yields performance improvement
would be welcome here as well: increasing performance demand @ limited clock rates
but: exclusively dedicated to safety here
observation: not all tasks are safety critical
enable flexible switching between “safety mode” and “performance mode”
A. Steininger page 47
Operation in Performance Mode
cores execute different instruction streams in parallel both cores have direct access to memory / peripherals instruction caches introduced to minimize penalties from
conflicting access temporal diversity disabled comparator disabled
A. Steininger page 48
Requirements on the Mode Switching coherent operation in safety mode
internal states of cores must be aligned before switching to safety mode (register file, cache)
safe operation in safety mode switching must not introduce safety leakageno corruption of safety-relevant data in perform. mode
low performance penalty for mode switchingslow or complicated switching would spoil the
anticipated performance gain
A. Steininger page 49
Implementation of the Split Core Frame
InstructionRAM
Control
Instruction
safedata memory
safe instructionmemory
DataRAM
Control
Instruc-tion-
cache
Instruc-tion-
cache
Mode-SwitchDetect
Mode-Switch
Core 1Instructionaddress Instruction
Dataaddress
Dataout
Datainclk
WaitSignal Interrupt
Core 2
InstructionDataaddress
Dataout
Datainclk
WaitSignal Interrupt
modeswitch
modeswitch
Address
Adress parity
Instruction parity
Address with parity
Datawith parity
Data with parity
Instructionaddress
Mode-SwitchDetect
InstructionRAM
Control
Instruction
safedata memory
safe instructionmemory
DataRAM
Control
Instruc-tion-
cache
Instruc-tion-
cache
Mode-SwitchDetect
Mode-Switch
Core 1Instructionaddress Instruction
Dataaddress
Dataout
Datainclk
WaitSignal Interrupt
Core 2
InstructionDataaddress
Dataout
Datainclk
WaitSignal Interrupt
modeswitch
modeswitch
Address
Adress parity
Instruction parity
Address with parity
Datawith parity
Data with parity
Instructionaddress
Mode-SwitchDetect
A. Steininger page 50
Mode Switch: Safety => Performance
core1 signal
message2
wait1
wait2
message1
clk
status safety mode
clk_core2
core2 signal
safety modeperformance mode
core1 signalcore1 signal
message2message2
wait1wait1
wait2wait2
message1message1
clkclk
status safety modestatus safety mode
clk_core2clk_core2
core2 signalcore2 signal
safety modeperformance mode safety modeperformance mode
LDL r1, 248LDH r1, 255
mode switchingLDW r2, r1BTEST r2, 1
JMPI_CT
load ID reg address
mode switch instr=> core1 wait=> core2 wait=> clk align=> switch mode
load & check ID bit=> cond branch core2
A. Steininger page 51
Mode Switch: Performance => Safety
core1 signal
message2
wait1
wait2
message1
clk
status safety mode
clk_core2
core2 signal
safety modeperformance mode
core1 signalcore1 signal
message2message2
wait1wait1
wait2wait2
message1message1
clkclk
status safety modestatus safety mode
clk_core2clk_core2
core2 signalcore2 signal
safety modeperformance mode safety modeperformance mode
core1 signal
message2
wait1
wait2
message1
clk
status safety mode
clk_core2
core2 signal
safety modeperformance mode
core1 signalcore1 signal
message2message2
wait1wait1
wait2wait2
message1message1
clkclk
status safety modestatus safety mode
clk_core2clk_core2
core2 signalcore2 signal
safety modeperformance mode safety modeperformance mode
core1 encounters mode switch instr=> trigger MSU (core1 signal)
=> halt core1 (wait1)
=> interrupt core2 (message2)core2 encounters interrupt=> save context=> jump to mode switch instr
core2 executes mode switch=> halt core2 & switch clock=> resume core1=> resume core2 after delay
A. Steininger page 52
master slave frame overall
detected no effect 1029 56962 5334 63325
before effect 5026 0 1324 6350
within 1,5cy 50956 0 569 51525
later 0 0 0 0
not detected
no effect 7055 7102 4275 18432
with effect 0 0 0 0
overall 64066 64064 11502 139632
Delayed WR still ensures error confinement
Fault Injection in Safety Mode
A. Steininger page 53
Fault Injection in Performance Mode
detection in perf mode safety mode
nevereffect in early late stuck ≤1.5cy >1.5c
yperf only 1149 423 25617 34583 458both modes -- -- -- 0 0 0safety only -- -- -- 9654 0 0none 1473 47715 18560
fault injected in performance mode, then switch to safety mode
No undetected effects / late detections in safety modeWatchdog important to prevent hang-up in perf mode
A. Steininger page 54
We still need a “Safe Memory”
detect bit flips in storage cellsparity (or EDC/ECC)
detect erroneous address decodingspecial decoder logic design
protect interfaces parity for data, address and control buses
prevent illegal WR access provide mask input for write enable
Why not duplicate & compare?
A. Steininger page 55
We still need a “Safe Memory”
detect bit flips in storage cellsparity (or EDC/ECC)
detect erroneous address decodingspecial decoder logic design
protect interfaces parity for data, address and control buses
prevent illegal WR access provide mask input for write enable
A. Steininger page 56
Possible Address Decoder Errors
correct behavior:any given address activates exactly
one assigned memory cell
erroneous behaviors: an address activates no memory cell at all an address activates more than one memory cell an address activates a wrong memory cell
A. Steininger page 57
Checking the Address Decoder
large decoders built from cascade of smaller ones
memory cell array
dual-railchecker
pe
dual-railchecker
XOR
XOR
AP
&
A0
A1
A2
&
&
&
&
&
&
&
re-check parity behind cell array:OR over even cells parity ?
check for missing or multiple cell activations:XOR(upper half) XOR(lower half) ?
A. Steininger page 58
Summary the automotive domain has its own laws and rules
need “extremely cost-effective robust solutions for safety-critical real-time applications, versatile and custom tailored”
on node level different redundancy concepts applicable example: dual core CPU and memory with protection mech’s on-line testing for memory may be required
on system level crucial role of communication infrastructure advantages of time triggered approach insufficient suitability of structural testing
A. Steininger page 60
Related publications of my group (1)[1] T. Kottke and A. Steininger, “A Fail-Silent Memory for Automotive Applications”, 9th
IEEE European Test Symposium, Corsica 2004.
[2] T. Kottke and A. Steininger, “A Generic Dual Core Architecture with Error Containment”, Journal of Computing and Informatics, vol. 23, no.5, 2004.
[3] T. Kottke and A. Steininger, “A Reconfigurable Generic Dual-Core Architecture”, Int’l Conference on Dependable Systems and Networks (DSN2006), Philadelphia, 2006.
[4] T. Kottke and A. Steininger, “A Fail-Silent Reconfigurable Superscalar Processor”, 13th IEEE Pacific Rim Int’l Symposium on Dependable Computing, Melbourne, 2007.
[5] C. El Salloum, A. Steininger, P. Tummeltshammer and W. Harter, “Recovery Mechanisms for Dual Core Architectures”, 21st IEEE Int’l Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’06), Washington, 2006.
[6] A. Steininger and C. Temple, “Economic Self-Test in the Time-Triggered Architecture”, IEEE Design & Test of Computers, vol 3/1999
[7] A. Steininger, “Testing and Built-in Self-Test – A Survey”, Journal of Systems Architecture 46(2000)
A. Steininger page 61
Related publications of my group (2)[8] A. Steininger and C. Scherrer, “On the Necessity of BIST in Safety-Critical
Applications – A Case Study”, 29th Annual Int’l Symposium on Fault-Tolerant Computing (FTCS’29), Madison, 1999.
[9] C. Scherrer and A. Steininger, “How does Resource Utilization Affect Fault Tolerance?”, 2000 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’00), Yamanashi, 2001.
[10] C. Scherrer and A. Steininger, “How to Tune the MTTF of a Fail-Silent System”, 2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’01), San Francisco, 2001
[11] C. Scherrer and A. Steininger, “Dealing with Dormant Faults in an Embedded Fault-Tolerant Computer System”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003.
[12] K. Thaller and A. Steininger, “A Transparent Online Memory Test for Simultaneous Detection of Functional Faults and Soft Errors in Memories”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003.
A. Steininger page 62
Related publications of my group (3)[13] E. Armengaud, F. Rothensteiner, A. Steininger, R. Pallierer, M. Horauer, M. Zauner, “A
Structured Approach for the Systematic Test of Embedded Automotive Communication Systems”, Int’l Test Conference 2005, Austin 2005.
[14] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on Emerging Technologies and Factory Automation, Prague 2006.
[15] E. Armengaud, A. Steininger, M. Horauer, „Towards a Systematic Test of Embedded Automotive Communication Systems“, IEEE Transactions on Industrial Informatics vol 4, no 3
[16] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic Design, Test and Applications, Hong Kong, 2008.
[17] P. Milbredt, A. Steininger, M. Horauer, „An investigation of the Clique Problem in FlexRay“, Proc. 3rd IEEE Symposium on Industrial Embedded Systems, Las Vegas, 2008.
A. Steininger page 63
Related publications of my group (4)[18] P. Tummeltshammer and A. Steininger, „Power Supply Induced Common Cause
Faults — Experimental Assessment of Potential Countermeasures“, 9th IEEE International Conference on Dependable Systems and Networks, Estoril, 2009.
[19] E. Armengaud, A. Steininger, M. Horauer, R. Pallierer, “A Layer Model for the Systematic Test of Time-Triggered Automotive Communication Systems”, 5th IEEE Int’l Workshop on Factory Communication Systems, Vienna, 2004.
[20] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on Emerging Technologies and Factory Automation, Prague 2006.
[21] E. Armengaud and A. Steininger, “Pushing the Limits of Remote Online Diagnosis in Embedded Real-Time Networks”, 6th IEEE Int’l Workshop on Factory Communication Systems, Torino, 2006.
[22] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic Design, Test and Applications (DELTA 2008), Hong Kong, 2008.
A. Steininger page 64
Related PhD theses of my groupT. Kottke, “Untersuchung von fehlertoleranten Prozessorarchitekturen für
sicherheitsrelevante Automobilanwendungen”, PhD thesis, Vienna University of Technology, 2005. (German)
C. Scherrer, “Zuverlässigkeit zweifach redundanter Architekturen unter besonderer Berücksichtigung latenter Fehler”, PhD thesis, Vienna University of Technology, 2002. (German)
K. Thaller, “A Transparent Online Memory Test”, PhD thesis, Vienna University of Technology, 2001.
E. Armengaud, “A Transparent Online Test Approach for Time-Triggered Communication Protocols”, PhD thesis, Vienna University of Technology, 2008.
P. Tummeltshammer, “An Analysis of Common Cause Failures in Dual Core Architectures”, PhD thesis, Vienna University of Technology, 2009.
G. Fuchs, “Fault-Tolerant Distributed Algorithm for Robust Tick Synchronization: Concepts, Implementations and Evaluations”, PhD thesis, Vienna University of Technology, 2009
A. Steininger page 65
Related ProjectsSTEACS (Systematic Test of Embedded Automotive Communication Systems)
http://embsys.technikum-wien.at/projects/steacs/index.html
EXTRACT (Exploiting Synchrony for Transparent Communication Services Testing)http://ti.tuwien.ac.at/ecs/research/projects/extract
DARTS (Distributed Algorithms for Robust Tick Synchronization)http://ti.tuwien.ac.at/ecs/research/projects/DARTS