Fault Tolerance in Embedded Systems

Fault Tolerance in Embedded Systems

Daniel [email protected]://site.uottawa.ca/~dshap092

mailto:[email protected]

Fault Tolerance

• This presentation is based upon [1]• Focus is on the basics as applied to embedded

systems with processors

• This presentation does not rely on Wikipedia.• See Byzantine fault tolerance on wiki

1. Trends Problems2. Fault Tolerance Definitions3. Fault Hiding4. Fault Avoidance5. Error Models6. # Simultaneous Errors7. Fault Tolerance Metrics8. Error Detection9. Error Recovery10. Fault Diagnosis11. Self-Recovery

Overview

Trends Problems

• Fault Tolerance• Goal = safety + liveness• Safe: Hide faults from

hurting the user, even in failure

• Live: performs the desired task

• Better to fail than to do harm

Cosmic rays and alpha particles

Trends Problems

• More devices/processor means more units can fail– Think CISC v.s. RISC

• More complex designs mean more failure cases exist– Think AVX v.s. MMX

• Cache faults and more generally memory faults– Recharging DRAM is

“easier” than reloading a destroyed cache line

Fault Tolerance Definitions

• Fault– Physical faults– Software faults

• May manifest as error• Masked fault does not

show up as an error• Errors may also be

masked• Otherwise the error

results in a failure

• Logical mask - 0 AND error bit

• Architectural mask – NOP reg destination error

• Application mask – silent fault like writing garbage to an unused address … produces no failure

Fault Hiding• Some faults are

automatically recovered already: branch prediction can recover from faulty branches

• Dangerous cases are the faults that are NOT masked

• Goal: mask all faults– E.g. HDD faults are

common but hidden

• Transient fault – signal glitch

• Permanent fault – wire burns

• Intermittent fault – cold soldered wire

• Fault tolerance scheme – design a system for masking the expected fault type (transient/permanent/intermittent)

Fault Avoidance

• Fault avoidance is just as good as fault tolerance

• Error detection and correction is the alternative

• Permanent faults– Physical wear-out– Fabrication defects– Design bugs

Error Models• We only care about errors,

since masked faults are innocuous

• Error models– For improving fault tolerance– E.g. stuck at 0/1 model tells us

that there is a potential error– Many many stuck at 0 errors

can mean that there is NO PROBLEM

– Reduces the need to evaluate all sources of error. Design space size↓↓

• 3 main error model parameters

• Type of error – bridging/coupling error (e.g. short, cross-talk), stuck-at error, fail-stop error, delay error

• Error duration – transient, intermittent, permanent

• # simultaneous errors – errors are rare, how many wars can you fight at once?

# Simultaneous Errors• Maybe 1 error hides

another error• E.g. 2-bit flip parity checker• Reasons for resolving:

– Mission critical– High error rate– Latent errors (undetected

and lingering) may overlap with other errors. Think about an incorrectly stored word: the error occurs upon NEXT read of the word

• Better to detect the first error AND to have double error correction since the error rate trends are against us.

Fault Tolerance Metrics

• Availability– 99.999% = five nines of

availability• Reliability– P(time t and still no

failure)– Most errors are not

failures

• Mean != probability• Variance (2 and 20 v.s.

11 and 12)• MTTF – Mean Time to

Failure• MTTR – Mean Time To

Repair• MTBF = MTTF+MTTR

Fault Tolerance Metrics• Failures in Time (FIT)

– Rate– # failures / 1 billion hours – Additive– α 1/MTTF– Arbitrary– Raw rate includes masked failures– Effective rate excludes masked

failures• Effective FIT = FIT*AVF

– Helps locate transient error vulnerability

– Shown to be a good lower bound on reliability

• Architectural Vulnerability Factor (AVF)– Architecturally Correct Execution

=ACE state– Otherwise = un-ACE state– E.g. PC state = ACE; branch pred=un-

ACE– Fraction of time in ACE state

• Component AVF = – avg # ACE bits per cycle / # state bits.

• If many ACE bits reside in a structure for a long time, that structure is highly vulnerable. Large AVF

Error Detection

• Helps to provide safety• Without redundancy

we cannot detect errors

• What kind of redundancy do we need?

• Redundancy– Physical (majority gate = TMR,

dual modular redundancy =DMR, NMR where N is odd>3)

– Temporal (run twice & compare results)

– Information (extra bits like parity)

• Boeing 777 uses “triple-triple” modular redundancy, 2 levels of triple voting, where each vote is from a different architecture

DMR

Error Detection

• Physical Redundancy• Heterogeneous

hardware units can provide physical redundancy– E.g. Watchdog timer– E.g. Boeing 777 different

architectures running same program and then voting on results.

– Design Diversity

• Unit replication– Gate level– Register level– Core level

• Wastes lots of area & power• NMR impractical for PCs• False error reporting

becomes more likely• Using different hardware

for the voters avoids the possibility of design bugs

Error Detection

Temporal Redundancy• Twice the active power but

not twice the area• Can find transient but not

permanent errors• Smart pipelining can have

the votes arrive 1 cycle apart, but wastes pipeline slots

Information Redundancy• Error-Detecting Code (EDC)• Words mapped to code

words like checksums and CRC

• Hamming Distance (HD)• Single-Error Correcting (SEC)

Double-Error Detecting (DED) with HD of 4

Error Detection

Error Detection• For ALU we can compare bitcount of inputs out outputs, but this is not

common• Many other techniques exist like BIST or calculating a known quantity

and comparing to a ROM with the answer in it.• ReExecution with Shifted Operands (RESO) finds permanent errors.• Redundant multithreading: use empty slots to run redundancy threads• Checking invariant conditions• Anomaly detection like behavioural antivirus (look at data and/or

traces)• Error Detection by Duplicated Instructions (EDDI) – let software look

into the hardware using randomly inserted dummy code• Way way more stuff about caches, CAMs, consistency, and more.

Error Recovery• Safety from detection but

what about liveness?• Forward Error Recovery

– FER– Once detected, the error is

seamlessly corrected• FER implemented using

physical, information, or temporal redundancy

• More HW needed to correct than detect– E.g. DMR can detect but

TMR or triple-triple can correct (spatial)

• HD=k (information redundancy)

– k-1 bit errors detection– (k-1)/2 error correction– (HD,Detect,correct)

• (5,4,2)

• TMR by repetition (temporal)

Error Recovery• Backwards Error Recovery

– BER– Rollback / Safe point– Restore point– Recovery line for multicore

(cool!)– How do we model

communication in MP /w caches??

– Just log everything? Nope, save it distributed and in the caches. Possibly use software.

– Way more crazy algorithm selection magic….

• The Output Commit Problem– Sphere of recoverability– Don’t let bad data out– Wait for error detection

hardware to complete– Latency is usually hidden– Processor state is

difficult to store/restore

Error Recovery

FER when DRAM module fails – RAID-M/chipkill

Fault Diagnosis• Diagnosis hardware

– FER and BER do not solve livelock

– E.g. mult fails, recover, mult again.. livelock

• Idea: be smart, figure out what components are toast

• BIST– Compare boundary scan

data or stored tests to a ROM with the right answers

• Run BIST at fixed intervals or at end of context switch

• Commit changes if error free, otherwise restore

• Try to test all components in system, ideally all gates in the system

• MPs/NoC typically have dedicated diagnosis hardware

Self-Repair• BIST can tell you what broke, but

not how to fix it.• i7 can respond to errors on the on-

chip busses at runtime. Partial bus shorts do not kill the system. Data is transferred like a packet (NoC)

– Because of all the prediction, lanes, and issue logic, superscalar has much more redundancy than RISC

– For RISC just steal a core from the grid and mark the old core dead

– CISC has some very crazy metrics for triggering self-repair

• Remember the infinite loop mult we diagnosed?• Alternative: notice that mult is

dead, use shift-add booth• Another cool idea: if shift breaks

use the mult with base 2 inputs (hot spare)

• A cold spare would be a fully dedicated redundant unit

– CellBE only uses 7 cores and has an 8th cold spare SPE! So cool!

Conclusions• Things are getting a bit

crazy in error detection and correction

• Multicore and caches complicated everything

• Although up until now this fault stuff was known, it is only now entering the PC market because the error rate is increasing with process technology

• Like the byzantine generals problem, we start to worry about who to trust in the running but broken chip

• Voting works best for transient errors. For permanent errors too, but land the plane or you will end up crashing.

• You can prove that it is easier to detect a problem than fix it.

References

[1] Daniel J. Sorin, “Fault Tolerant Computer Architecture (Synthesis Lectures on Computer Architecture),” 2010.

Questions?

Documents

Fault Tolerance in Embedded Systems