Upload
janice
View
42
Download
0
Embed Size (px)
DESCRIPTION
Fault Tolerance in Embedded Systems. [email protected] http://site.uottawa.ca/~dshap092. Daniel Shapiro. Fault Tolerance. This presentation is based upon [1 ] Focus is on the basics as applied to embedded systems with processors This presentation does not rely on Wikipedia . - PowerPoint PPT Presentation
Citation preview
Fault Tolerance in Embedded Systems
Daniel [email protected]://site.uottawa.ca/~dshap092
Fault Tolerance
• This presentation is based upon [1]• Focus is on the basics as applied to embedded
systems with processors
• This presentation does not rely on Wikipedia.• See Byzantine fault tolerance on wiki
1. Trends Problems2. Fault Tolerance Definitions3. Fault Hiding4. Fault Avoidance5. Error Models6. # Simultaneous Errors7. Fault Tolerance Metrics8. Error Detection9. Error Recovery10. Fault Diagnosis11. Self-Recovery
Overview
Trends Problems
• Fault Tolerance• Goal = safety + liveness• Safe: Hide faults from
hurting the user, even in failure
• Live: performs the desired task
• Better to fail than to do harm
Cosmic rays and alpha particles
Trends Problems
• More devices/processor means more units can fail– Think CISC v.s. RISC
• More complex designs mean more failure cases exist– Think AVX v.s. MMX
• Cache faults and more generally memory faults– Recharging DRAM is
“easier” than reloading a destroyed cache line
Fault Tolerance Definitions
• Fault– Physical faults– Software faults
• May manifest as error• Masked fault does not
show up as an error• Errors may also be
masked• Otherwise the error
results in a failure
• Logical mask - 0 AND error bit
• Architectural mask – NOP reg destination error
• Application mask – silent fault like writing garbage to an unused address … produces no failure
Fault Hiding• Some faults are
automatically recovered already: branch prediction can recover from faulty branches
• Dangerous cases are the faults that are NOT masked
• Goal: mask all faults– E.g. HDD faults are
common but hidden
• Transient fault – signal glitch
• Permanent fault – wire burns
• Intermittent fault – cold soldered wire
• Fault tolerance scheme – design a system for masking the expected fault type (transient/permanent/intermittent)
Fault Avoidance
• Fault avoidance is just as good as fault tolerance
• Error detection and correction is the alternative
• Permanent faults– Physical wear-out– Fabrication defects– Design bugs
Error Models• We only care about errors,
since masked faults are innocuous
• Error models– For improving fault tolerance– E.g. stuck at 0/1 model tells us
that there is a potential error– Many many stuck at 0 errors
can mean that there is NO PROBLEM
– Reduces the need to evaluate all sources of error. Design space size↓↓
• 3 main error model parameters
• Type of error – bridging/coupling error (e.g. short, cross-talk), stuck-at error, fail-stop error, delay error
• Error duration – transient, intermittent, permanent
• # simultaneous errors – errors are rare, how many wars can you fight at once?
# Simultaneous Errors• Maybe 1 error hides
another error• E.g. 2-bit flip parity checker• Reasons for resolving:
– Mission critical– High error rate– Latent errors (undetected
and lingering) may overlap with other errors. Think about an incorrectly stored word: the error occurs upon NEXT read of the word
• Better to detect the first error AND to have double error correction since the error rate trends are against us.
Fault Tolerance Metrics
• Availability– 99.999% = five nines of
availability• Reliability– P(time t and still no
failure)– Most errors are not
failures
• Mean != probability• Variance (2 and 20 v.s.
11 and 12)• MTTF – Mean Time to
Failure• MTTR – Mean Time To
Repair• MTBF = MTTF+MTTR
Fault Tolerance Metrics• Failures in Time (FIT)
– Rate– # failures / 1 billion hours – Additive– α 1/MTTF– Arbitrary– Raw rate includes masked failures– Effective rate excludes masked
failures• Effective FIT = FIT*AVF
– Helps locate transient error vulnerability
– Shown to be a good lower bound on reliability
• Architectural Vulnerability Factor (AVF)– Architecturally Correct Execution
=ACE state– Otherwise = un-ACE state– E.g. PC state = ACE; branch pred=un-
ACE– Fraction of time in ACE state
• Component AVF = – avg # ACE bits per cycle / # state bits.
• If many ACE bits reside in a structure for a long time, that structure is highly vulnerable. Large AVF
Error Detection
• Helps to provide safety• Without redundancy
we cannot detect errors
• What kind of redundancy do we need?
• Redundancy– Physical (majority gate = TMR,
dual modular redundancy =DMR, NMR where N is odd>3)
– Temporal (run twice & compare results)
– Information (extra bits like parity)
• Boeing 777 uses “triple-triple” modular redundancy, 2 levels of triple voting, where each vote is from a different architecture
DMR
Error Detection
• Physical Redundancy• Heterogeneous
hardware units can provide physical redundancy– E.g. Watchdog timer– E.g. Boeing 777 different
architectures running same program and then voting on results.
– Design Diversity
• Unit replication– Gate level– Register level– Core level
• Wastes lots of area & power• NMR impractical for PCs• False error reporting
becomes more likely• Using different hardware
for the voters avoids the possibility of design bugs
Error Detection
Temporal Redundancy• Twice the active power but
not twice the area• Can find transient but not
permanent errors• Smart pipelining can have
the votes arrive 1 cycle apart, but wastes pipeline slots
Information Redundancy• Error-Detecting Code (EDC)• Words mapped to code
words like checksums and CRC
• Hamming Distance (HD)• Single-Error Correcting (SEC)
Double-Error Detecting (DED) with HD of 4
Error Detection
Error Detection• For ALU we can compare bitcount of inputs out outputs, but this is not
common• Many other techniques exist like BIST or calculating a known quantity
and comparing to a ROM with the answer in it.• ReExecution with Shifted Operands (RESO) finds permanent errors.• Redundant multithreading: use empty slots to run redundancy threads• Checking invariant conditions• Anomaly detection like behavioural antivirus (look at data and/or
traces)• Error Detection by Duplicated Instructions (EDDI) – let software look
into the hardware using randomly inserted dummy code• Way way more stuff about caches, CAMs, consistency, and more.
Error Recovery• Safety from detection but
what about liveness?• Forward Error Recovery
– FER– Once detected, the error is
seamlessly corrected• FER implemented using
physical, information, or temporal redundancy
• More HW needed to correct than detect– E.g. DMR can detect but
TMR or triple-triple can correct (spatial)
• HD=k (information redundancy)
– k-1 bit errors detection– (k-1)/2 error correction– (HD,Detect,correct)
• (5,4,2)
• TMR by repetition (temporal)
Error Recovery• Backwards Error Recovery
– BER– Rollback / Safe point– Restore point– Recovery line for multicore
(cool!)– How do we model
communication in MP /w caches??
– Just log everything? Nope, save it distributed and in the caches. Possibly use software.
– Way more crazy algorithm selection magic….
• The Output Commit Problem– Sphere of recoverability– Don’t let bad data out– Wait for error detection
hardware to complete– Latency is usually hidden– Processor state is
difficult to store/restore
Error Recovery
FER when DRAM module fails – RAID-M/chipkill
Fault Diagnosis• Diagnosis hardware
– FER and BER do not solve livelock
– E.g. mult fails, recover, mult again.. livelock
• Idea: be smart, figure out what components are toast
• BIST– Compare boundary scan
data or stored tests to a ROM with the right answers
• Run BIST at fixed intervals or at end of context switch
• Commit changes if error free, otherwise restore
• Try to test all components in system, ideally all gates in the system
• MPs/NoC typically have dedicated diagnosis hardware
Self-Repair• BIST can tell you what broke, but
not how to fix it.• i7 can respond to errors on the on-
chip busses at runtime. Partial bus shorts do not kill the system. Data is transferred like a packet (NoC)
– Because of all the prediction, lanes, and issue logic, superscalar has much more redundancy than RISC
– For RISC just steal a core from the grid and mark the old core dead
– CISC has some very crazy metrics for triggering self-repair
• Remember the infinite loop mult we diagnosed?• Alternative: notice that mult is
dead, use shift-add booth• Another cool idea: if shift breaks
use the mult with base 2 inputs (hot spare)
• A cold spare would be a fully dedicated redundant unit
– CellBE only uses 7 cores and has an 8th cold spare SPE! So cool!
Conclusions• Things are getting a bit
crazy in error detection and correction
• Multicore and caches complicated everything
• Although up until now this fault stuff was known, it is only now entering the PC market because the error rate is increasing with process technology
• Like the byzantine generals problem, we start to worry about who to trust in the running but broken chip
• Voting works best for transient errors. For permanent errors too, but land the plane or you will end up crashing.
• You can prove that it is easier to detect a problem than fix it.
References
[1] Daniel J. Sorin, “Fault Tolerant Computer Architecture (Synthesis Lectures on Computer Architecture),” 2010.
Questions?