Download pptx - New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

New Approaches to Fault-Tolerant Systems Design

Andreas Steininger

Vienna University of Technology

A. Steininger page 2

My contact data

Andreas SteiningerVienna University of Technology

Faculty of InformaticsInstitute of Computer Engineering

Embedded Computing Systems Group

Treitlstrasse 3A- 1040 Vienna

Austria

[email protected]

http://ti.tuwien.ac.at/ecs


Main Contributors to this Material

Dr. Thomas Kottke R. Bosch AG / EADS

Dr. Peter Tummeltshammer R. Bosch AG / Thales

Dr. Christoph Scherrer Alcatel / Thales

Dr. Eric Armengaud DecomSys / VirtualVehicle

Dr. Karl Thaller DecomSys / Elektrobit Austria

Dr. Martin Horauer UAT Technikum Wien

Paul Milbredt AUDI AG


Outline• Fault tolerance – some (very) basics• Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node

– Basic architecture– Temporal diversity– Treatment of common cause faults– Switching performance mode / safety mode– Fault-tolerance validation by fault injection


Faults, Errors and Failures

computer

1 0

fault

error failure


Error Detection

computer

1 0

fault

error failure

Fault detection: usually too difficult (too many possibilities)


Error Detection

computer

1 0

error failure

Failure detection: too late:want to prevent failure!

A. Steininger

0

page 8

Error Detection

computer

1 0

error

To decide that „1“ is wrong we need a reference.Where to get this reference from?

Option 1:Perform same compu-tation a second time (hopefully the fault is gone by then…)

Time redundancy


Error Detection

computer

1 0

error


A. Steininger

0

page 10

Error Detection

computer

1 0


Option 2:Use a second computer in parallel (hopefully this one works well…)

Space redundancy


Error Detection

computer

1 0

error


Option 3:Add additional information (hopefully not affected as well…)

Information redundancy0

A. Steininger

computer ED

computer

computer

page 12

Achieving Fault Tolerance

Fail safe: system can be safelystopped when error is detected example: train

computer

computer

ED

Fail operational: system must keep on working when error is detected example: autopilot in airplanecomputer

computer

computer

ED


Outline Fault tolerance – some (very) basics• Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node



Electronics in Cars – some Facts

high proportion of value: up to 30%

high development potential:more than 80% of the innovations

high number of Electronic Control Units (ECUs)up to 70

complex distributed systemdifferent networks & topologies


Electronics in Cars - Benefits

cheap alternative to existing mechanical solutions– lighter, smaller, cheaper, more flexible,…

enabler for further optimizations– electronic ignition, motor management, …

key to new functionality– safety: ESP, active suspension, crash sensing…– comfort: air conditioning, infotainment,…– security: immobilizer, alarm, electronic key, GPS tracking,…– autonomy: anticipatory braking, lane keeping,…


Key Demands

Safety

Real-Time

Low Cost

Robustness

Testability


Key Demands

Safety

Real-Time

Low Cost

Robustness

Testability

– high risk potential (energy!)

– high public awareness

– no safe state (in general)

– certification required(EN 61508, ISO 26262)

– high complexity of system & application

– legal issues (liability)


Key Demands

Safety

Real-Time

Low Cost

Robustness

Testability

– engine: 6000 rpm = 1/10ms

– VDM: 100km/h = 28cm/10ms

– need to synchronize distributed activities

– real-time communication

– image processing tasks


Key Demands

Safety

Real-Time

Low Cost

Robustness

Testability

– extreme competition

– high cost inhibits introduction

– tailored safety concepts minimum degree of replication use structural redundancies

– generic solutions scalable, configurable, flexible

– marginal costs beat NRE


Current Status

fail safe functions realized:– shut off upon error– mechanical fall-back system assumes control

no true “by wire” functions– single-channel solutions sufficient

tolerance against random faults– avoid design faults by field experience => no diversity– avoid common cause faults by design (?)

single fault assumption– keep faults rare (shielding, etc.)


Outline Fault tolerance – some (very) basics Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node



A Fault Tolerant Node

mission: make a node (processor) fault tolerant

need to consider CPU and memory

aim is “fail safe” (but keep option for fail op in mind)– simplex unit with error detection capabilities– duplication and comparison– hybrid approach


Options for the CPU Core

Single core + ED

Dual core + cmp

Superscalar proc.+ cmp + ED

modify custom CPU core

– parity for buses

– two-rail coding for signals

– self-checking implemen-tation of simple units

– duplicate & compare for complex units

– careful layout


Options for the CPU Core

Single core + ED

Dual core + cmp

Superscalar proc.+ cmp + ED

duplicate custom CPU core

– master/checker operation

– shared (safe) memory

– validity check for inputs

– self-checking comparator checks equality of outputs

– option: clock delay

– option: mode switch


Solution Example “Dual Core Frame”

benefitscan use custom core without modificationssafety analysis valid for other cores as wellpromises high ED coverage with moderate effortsCPU is hard to protect otherwise

crucial pointsenable easy recovery ( => keep outage short)

eliminate single points of failuredetect common cause faults


Instr. Mem Data Mem

=? =?=?

Instr. Addr. Instr. Data Addr.Data out Data in


Core 1 (Master)

Core 2 (Checker)

Error_Sig„Safe memories“

Parity for buses

Dual-Rail CodingSelf-Checking Comparators

Protection in the Dual Core Frame


Potential for Common Cause Faults

identical input data identical clock (lock step) shared clock generator shared power supply both processors on same die

(physical proximity; thermal & mechanical coupling)


Temporal Diversity

operate checker with a delay against master– same fault hits at different point of computation– therefore different effect => detect by comparison– different critical paths emerge

store master output for comparison choose delay of 1 / 1.5 / 2 clock cycles

– larger delay causes high effort for little gain (=>experiments)

– error detection latency is equal to the delay– need to delay memory write and outputs by this amount


Instr. Mem Data Mem

=? =?=?



Core #1 (Master)

Core #2 (Checker)

ErrorDT

Temporal Diversity: Implementation


Fail Safe Dual Core Frame – Summary safe memories for instructions and data comparison of all core outputs parity protection for buses (data, address)

dual rail coding for single signals (int, rst, err)

totally self-checking comparators temporal diversity

How safe is the proposed solution?


Assessment of the Solution’s Quality

How measure quality? ( Aim is fail safe)error detection coverage => detect all errors

error detection latency => detect them quickly

Which method to choose? theoretical analysis / modelling experimental fault injection field observation


Fault Injection Experiment

2 SPEAR cores in fail safe frame (= DUT)

synthesized to EDIF netlist injected one by one into netlist exhaustive list of stuck-at-1 and stuck-at-0 faults download to FPGA, application run “golden device” as reference (= REF)

upon mismatch (DUT REF) => check comparator


master slave frame overalldetected no effect 204 51170 3517 54891

before effect 19047 98 734 19879

during effectRD 0 0 0 0WR 559 0 921 1480

after effectRD 31455 0 87 31542WR 0 0 0 0

not detected

no effect 4269 4276 1073 9618with effect 0 0 0 0

overall 55534 55544 6332 117410

No change of memory contents in case of errorErroneous read access is uncritical

Results of FI Experiment


Enabling fast Recovery error signal (dual rail)

notifies external component / memory turns any further WR into RD (error confinement) triggers processor interrupt

status register (memory mapped) updated by HW indicates source of error (data parity, address mismatch,…)

recovery can build on uncorrupted status can benefit from detailed status information


Why is fast Recovery important? application specific fault-tolerance time

application can “survive” without computer even in fail-operational case typ. some 10ms for car (recall: 100km/h = 28cm/10ms)

meaning of fast recovery if failed computer recovers within FT time,

no need for hot standby => COST! re-booting after failure is

- pragmatic- safe- expensive!


Fail Safe Dual Core – Summary 1 duplicate & compare

generic approach, applicable to any core typecovers all (local) errorsneed to carefully eliminate single points of failureneed to complement with protection for signals & buses

temporal diversity mitigates (many) common cause failures requires output delay to ensure error confinement


Possible Sources of CCFs Design & process

design fault or (latent) process deficiency

Thermal coupling hot spot affects both replica in the same way

Mechancial defectaffects both replica symmetrically

Electrical couplingwire bound (shared lines: VDD, reset, clock) wireless (EMI)


Komp.error

Why use Single Die then?

cheaper and fasteruse two instances of same designfast & comprehensive comparison

CCFs on single dieintuitively higher threadquantification of thread?mitigation techniques?


The Actual Problem with CCFs

One fault event affects both replicaAND

is not detected by comparator i.e. leads to “symmetric” fault effect

AND

produces an erroneous outputi.e. does not crash the cores


Possible Countermeasures for CCFs Design & process





diversity, burn-in,fault avoidance

asymmetric propagation paths

asymmetric critical paths

asymmetric antennas (?)


Possible Countermeasures for CCFs Design & process





asymmetric propagation paths


Propagation Speed Comparison

Thermal & mechanical propagation arerelatively slow

10000s of clock cycles within 1ms


Experimental Assessment

Evaluation Experiments1) single corresponding points

with offset t

2) multiple corresp. points with offset t

3) single non-corresp. points no offset

Core 1 Core 2

Core 1 Core 2

Core 1 Core 2

Master

Compare unit

Checker

GoldenNode

Da

ta

Ad

dr

Iad

dr

Da

ta

Ad

dr

Iad

dr

We

We

Erroneous write

access?


Symmetry Requirements for CCF

even a small offset…

fault multiplicity …

asymmetry of impact …

…improve detection coverage


Symmetry Requirements for CCF

even a small offset…

fault multiplicity …

asymmetry of impact …

…improve detection coverage

RF (7028)

ExVecTab(8202)

ALU (2472)

PSW (308)

DEC (152)

P2 (158)

PC+P1(182)


Squeezing our more Efficiency dual core is expensive normally yields performance improvement

would be welcome here as well: increasing performance demand @ limited clock rates

but: exclusively dedicated to safety here

observation: not all tasks are safety critical

enable flexible switching between “safety mode” and “performance mode”


Operation in Performance Mode

cores execute different instruction streams in parallel both cores have direct access to memory / peripherals instruction caches introduced to minimize penalties from

conflicting access temporal diversity disabled comparator disabled


Requirements on the Mode Switching coherent operation in safety mode

internal states of cores must be aligned before switching to safety mode (register file, cache)

safe operation in safety mode switching must not introduce safety leakageno corruption of safety-relevant data in perform. mode

low performance penalty for mode switchingslow or complicated switching would spoil the

anticipated performance gain


Implementation of the Split Core Frame

InstructionRAM

Control

Instruction

safedata memory

safe instructionmemory

DataRAM

Control

Instruc-tion-

cache

Instruc-tion-

cache

Mode-SwitchDetect

Mode-Switch

Core 1Instructionaddress Instruction

Dataaddress

Dataout

Datainclk

WaitSignal Interrupt

Core 2

InstructionDataaddress

Dataout

Datainclk


modeswitch

modeswitch

Address

Adress parity

Instruction parity

Address with parity

Datawith parity

Data with parity

Instructionaddress

Mode-SwitchDetect

InstructionRAM

Control

Instruction

safedata memory

safe instructionmemory

DataRAM

Control

Instruc-tion-

cache

Instruc-tion-

cache

Mode-SwitchDetect

Mode-Switch

Core 1Instructionaddress Instruction

Dataaddress

Dataout

Datainclk


Core 2

InstructionDataaddress

Dataout

Datainclk


modeswitch

modeswitch

Address

Adress parity

Instruction parity

Address with parity

Datawith parity

Data with parity

Instructionaddress

Mode-SwitchDetect


Mode Switch: Safety => Performance

core1 signal

message2

wait1

wait2

message1

clk

status safety mode

clk_core2

core2 signal

safety modeperformance mode

core1 signalcore1 signal

message2message2

wait1wait1

wait2wait2

message1message1

clkclk

status safety modestatus safety mode

clk_core2clk_core2


safety modeperformance mode safety modeperformance mode

LDL r1, 248LDH r1, 255

mode switchingLDW r2, r1BTEST r2, 1

JMPI_CT

load ID reg address

mode switch instr=> core1 wait=> core2 wait=> clk align=> switch mode

load & check ID bit=> cond branch core2


Mode Switch: Performance => Safety

core1 signal

message2

wait1

wait2

message1

clk

status safety mode

clk_core2

core2 signal



message2message2

wait1wait1

wait2wait2

message1message1

clkclk


clk_core2clk_core2



core1 signal

message2

wait1

wait2

message1

clk

status safety mode

clk_core2

core2 signal



message2message2

wait1wait1

wait2wait2

message1message1

clkclk


clk_core2clk_core2



core1 encounters mode switch instr=> trigger MSU (core1 signal)

=> halt core1 (wait1)

=> interrupt core2 (message2)core2 encounters interrupt=> save context=> jump to mode switch instr

core2 executes mode switch=> halt core2 & switch clock=> resume core1=> resume core2 after delay


master slave frame overall

detected no effect 1029 56962 5334 63325

before effect 5026 0 1324 6350

within 1,5cy 50956 0 569 51525

later 0 0 0 0

not detected

no effect 7055 7102 4275 18432

with effect 0 0 0 0

overall 64066 64064 11502 139632

Delayed WR still ensures error confinement

Fault Injection in Safety Mode


Fault Injection in Performance Mode

detection in perf mode safety mode

nevereffect in early late stuck ≤1.5cy >1.5c

yperf only 1149 423 25617 34583 458both modes -- -- -- 0 0 0safety only -- -- -- 9654 0 0none 1473 47715 18560

fault injected in performance mode, then switch to safety mode

No undetected effects / late detections in safety modeWatchdog important to prevent hang-up in perf mode


We still need a “Safe Memory”

detect bit flips in storage cellsparity (or EDC/ECC)

detect erroneous address decodingspecial decoder logic design

protect interfaces parity for data, address and control buses

prevent illegal WR access provide mask input for write enable

Why not duplicate & compare?


We still need a “Safe Memory”

detect bit flips in storage cellsparity (or EDC/ECC)

detect erroneous address decodingspecial decoder logic design

protect interfaces parity for data, address and control buses

prevent illegal WR access provide mask input for write enable


Possible Address Decoder Errors

correct behavior:any given address activates exactly

one assigned memory cell

erroneous behaviors: an address activates no memory cell at all an address activates more than one memory cell an address activates a wrong memory cell


Checking the Address Decoder

large decoders built from cascade of smaller ones

memory cell array

dual-railchecker

pe

dual-railchecker

XOR

XOR

AP

&

A0

A1

A2

&

&

&

&

&

&

&

re-check parity behind cell array:OR over even cells parity ?

check for missing or multiple cell activations:XOR(upper half) XOR(lower half) ?


Summary the automotive domain has its own laws and rules

need “extremely cost-effective robust solutions for safety-critical real-time applications, versatile and custom tailored”

on node level different redundancy concepts applicable example: dual core CPU and memory with protection mech’s on-line testing for memory may be required

on system level crucial role of communication infrastructure advantages of time triggered approach insufficient suitability of structural testing


Hungry for more?

http://ti.tuwien.ac.at/ecs

[email protected]


Related publications of my group (1)[1] T. Kottke and A. Steininger, “A Fail-Silent Memory for Automotive Applications”, 9th

IEEE European Test Symposium, Corsica 2004.

[2] T. Kottke and A. Steininger, “A Generic Dual Core Architecture with Error Containment”, Journal of Computing and Informatics, vol. 23, no.5, 2004.

[3] T. Kottke and A. Steininger, “A Reconfigurable Generic Dual-Core Architecture”, Int’l Conference on Dependable Systems and Networks (DSN2006), Philadelphia, 2006.

[4] T. Kottke and A. Steininger, “A Fail-Silent Reconfigurable Superscalar Processor”, 13th IEEE Pacific Rim Int’l Symposium on Dependable Computing, Melbourne, 2007.

[5] C. El Salloum, A. Steininger, P. Tummeltshammer and W. Harter, “Recovery Mechanisms for Dual Core Architectures”, 21st IEEE Int’l Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’06), Washington, 2006.

[6] A. Steininger and C. Temple, “Economic Self-Test in the Time-Triggered Architecture”, IEEE Design & Test of Computers, vol 3/1999

[7] A. Steininger, “Testing and Built-in Self-Test – A Survey”, Journal of Systems Architecture 46(2000)


Related publications of my group (2)[8] A. Steininger and C. Scherrer, “On the Necessity of BIST in Safety-Critical

Applications – A Case Study”, 29th Annual Int’l Symposium on Fault-Tolerant Computing (FTCS’29), Madison, 1999.

[9] C. Scherrer and A. Steininger, “How does Resource Utilization Affect Fault Tolerance?”, 2000 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’00), Yamanashi, 2001.

[10] C. Scherrer and A. Steininger, “How to Tune the MTTF of a Fail-Silent System”, 2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’01), San Francisco, 2001

[11] C. Scherrer and A. Steininger, “Dealing with Dormant Faults in an Embedded Fault-Tolerant Computer System”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003.

[12] K. Thaller and A. Steininger, “A Transparent Online Memory Test for Simultaneous Detection of Functional Faults and Soft Errors in Memories”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003.


Related publications of my group (3)[13] E. Armengaud, F. Rothensteiner, A. Steininger, R. Pallierer, M. Horauer, M. Zauner, “A

Structured Approach for the Systematic Test of Embedded Automotive Communication Systems”, Int’l Test Conference 2005, Austin 2005.

[14] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on Emerging Technologies and Factory Automation, Prague 2006.

[15] E. Armengaud, A. Steininger, M. Horauer, „Towards a Systematic Test of Embedded Automotive Communication Systems“, IEEE Transactions on Industrial Informatics vol 4, no 3

[16] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic Design, Test and Applications, Hong Kong, 2008.

[17] P. Milbredt, A. Steininger, M. Horauer, „An investigation of the Clique Problem in FlexRay“, Proc. 3rd IEEE Symposium on Industrial Embedded Systems, Las Vegas, 2008.


Related publications of my group (4)[18] P. Tummeltshammer and A. Steininger, „Power Supply Induced Common Cause

Faults — Experimental Assessment of Potential Countermeasures“, 9th IEEE International Conference on Dependable Systems and Networks, Estoril, 2009.

[19] E. Armengaud, A. Steininger, M. Horauer, R. Pallierer, “A Layer Model for the Systematic Test of Time-Triggered Automotive Communication Systems”, 5th IEEE Int’l Workshop on Factory Communication Systems, Vienna, 2004.

[20] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on Emerging Technologies and Factory Automation, Prague 2006.

[21] E. Armengaud and A. Steininger, “Pushing the Limits of Remote Online Diagnosis in Embedded Real-Time Networks”, 6th IEEE Int’l Workshop on Factory Communication Systems, Torino, 2006.

[22] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic Design, Test and Applications (DELTA 2008), Hong Kong, 2008.


Related PhD theses of my groupT. Kottke, “Untersuchung von fehlertoleranten Prozessorarchitekturen für

sicherheitsrelevante Automobilanwendungen”, PhD thesis, Vienna University of Technology, 2005. (German)

C. Scherrer, “Zuverlässigkeit zweifach redundanter Architekturen unter besonderer Berücksichtigung latenter Fehler”, PhD thesis, Vienna University of Technology, 2002. (German)

K. Thaller, “A Transparent Online Memory Test”, PhD thesis, Vienna University of Technology, 2001.

E. Armengaud, “A Transparent Online Test Approach for Time-Triggered Communication Protocols”, PhD thesis, Vienna University of Technology, 2008.

P. Tummeltshammer, “An Analysis of Common Cause Failures in Dual Core Architectures”, PhD thesis, Vienna University of Technology, 2009.

G. Fuchs, “Fault-Tolerant Distributed Algorithm for Robust Tick Synchronization: Concepts, Implementations and Evaluations”, PhD thesis, Vienna University of Technology, 2009


Related ProjectsSTEACS (Systematic Test of Embedded Automotive Communication Systems)

http://embsys.technikum-wien.at/projects/steacs/index.html

EXTRACT (Exploiting Synchrony for Transparent Communication Services Testing)http://ti.tuwien.ac.at/ecs/research/projects/extract

DARTS (Distributed Algorithms for Robust Tick Synchronization)http://ti.tuwien.ac.at/ecs/research/projects/DARTS