Upload
gwendoline-nichols
View
215
Download
1
Embed Size (px)
Citation preview
2. Fault Tolerance
Reliable System Design 2010by: Amir M. Rahmani
2 matlab1.ir
Fault - Error - Failure
Fault = physical defect or flow occurring in some component (hardware or software)
Error = incorrect behavior caused by a fault• – manifestation of fault
Failure = inability of the system to perform its specified service
Latent Fault = which has not yet produced error Latent Error = which has not yet produced failure
3 matlab1.ir
Fault - Error - Failure
Note: presents of fault does not ensure that error will occur, e.g. memory stuck-at-0
4 matlab1.ir
Origin of Defects in Objects (HW/SW)
Good object wearing out with age• – Hardware (software can age too)• – Incorrect maintenance/operation
Good object, unforeseen hostile environment• – Environmental fault
Marginal object: occasionally fails in target environment
• – Tight design/bad inputs Implementation mistakes Specification mistakes
Note: From Top to Down-> Increasing human responsibility
5 matlab1.ir
Bathtub Curve
Three phases of system lifetime• – Infant mortality• – Normal lifetime• – Wear-out period
6 matlab1.ir
Life-time of a Software system (1)
Life-time of a Software system (2)
Software failure rate during useful life depends on the following factors:
• 1 software process used to develop the design and code• 2 complexity of software,• 3 size of software,• 4 experience of the development team,• 5 percentage of code reused from a previous stable
project,• 6 depth of testing at test/debug (I) phase.
7 matlab1.ir
8 matlab1.ir
Faults Characteristics
1- Cause– Specification errors
very dangerousgeneric fault
– Implementation errors• very hard to formally verify
– Random component faults• random, not manufacturing defects
– External disturbance• noise, EMP, vibration, radiation• much like random component
9 matlab1.ir
Faults Characteristics
2- Origin– software or hardware
• • Physical device level (HW)• • Logic level (HW)• • Chip level (HW)• • System level (HW/SW)
– interfacing, specifications, …
– don’t care, except: • hardware can be analog
• indeterminate voltage level
10 matlab1.ir
Faults Characteristics
3- Duration– Permanent fault
• occurs and doesn’t go away• easiest to diagnose
– Transient fault• occurs once and disappears• 10 times as expected as permanent fault
– Intermittent fault• occurs occasionally• may appear to be transient (if long period)• hard and expensive to detect
11 matlab1.ir
Faults Characteristics
4- Extent– Global
• A power supply fault
– Local• A memory fault
5- Value– Determinate
• memory stuck-at-0
– Indeterminate• A sensitive fault to data or time
12 matlab1.ir
What to do about Faults
Finding & identifying faults:• • Fault detection: is a fault there?• • Fault location: where?• • Fault diagnosis: which fault it is?
Automatic handling of faults• • Fault containment: blocking error flow
– Fault masking: fault has no effect• • Fault recovery: back to correct operation
13 matlab1.ir
System Response to faults
Error on output: may be acceptable in non-critical systems if happens only rarely
Fault masking: output correct even when fault from a specific class occurs
• – Critical applications: air/space/manufacturing
Fault-secure: output correct or error indication• – Retryable: banking, telephony
Fail safe: output correct or in safe state• – Flashing red traffic light, disabled ATM
14 matlab1.ir
What is Fault-Tolerance?
A fault-tolerant system is one that continues to perform at desired level of service according to their specification, in the presence of faults.
There are no failures in a fault-tolerant system.
Fault-tolerance is the ability of a system to provide a service complying with the specification in spite of faults.
A better title might have been Dependable or Reliable or Available computing
15 matlab1.ir
Fault Tolerance In the physical universe:
• - Fault detection• - Fault location• - Fault containment• - Fault recovery• - Continue servicing
In the informational universe:• - Error detection• - Error location• - Error containment• - Error recovery• - Continue servicing
16 matlab1.ir
Fault Recovery
How quickly is the fault detected? How soon can recovery begin?
• – Does is require human intervention• – How is the system admin notified?
How long does recovery take?• – Restore from backup?• – Purchase new HW?
17 matlab1.ir
Fault Coverage (C) Measure of system’s ability to perform:
• – fault detection• – fault location• – fault containment• – (and/or fault recovery)
C = P (fault detection | fault occurrence) C = P (fault recovery | fault occurrence) Note:
• – recovery implies that the system as a whole is operational• this does not imply that a repair occurred• – e.g. duplex system with benign fault can recover to
continue operation on one non-faulty processor
18 matlab1.ir
Design Philosophies to Combat Faults
Fault avoidance (off-line)• Attempts to prevent fault in the:
• Design review• Component selecting• Quality control• Shielding• Testing
Fault masking (on-line)• Attempts to prevent a fault in a system from introducing
errors• Error correcting memory• Majority voting
Fault tolerance (on-line)• Attempts to provide a system to continue performing its
expected tasks after the occurrence of faults
19 matlab1.ir
Design Philosophies to Combat Faults
Fault avoidance Fault masking Fault tolerance
20 matlab1.ir
Fault Avoidance vs. Tolerance
Fault avoidance: eliminate problem sources• – Remove defects: Testing and debugging• – Robust design: reduce probability of defects• – Minimize environmental stress: Radiation shielding etc• – Impossible to avoid faults completely
Fault tolerance: add redundancy to mask effect• – Additional resources needed (more later)• – Examples:
• Error correction coding• Backup storage• Spare tire etc
21 matlab1.ir
Fault Forecasting vs. Tolerance Fault Tolerance
• Execution-time techniques that handle with the effects of faults
Fault Forecasting• Estimate current number, future incidence and likely
consequences• You can’t tolerate what you don’t expect• But if we expected it, we would avoid or eliminate the fault!• In general: We can itemize the classes of faults that can occur• We can define what we want done if the fault occurs and if
the error is detected• Example: Automobile tire
• Lose air• Do not expect it to experience electrical overload
22 matlab1.ir
Fault Tolerant computing
Deterministic approaches• – Based on simplifying assumptions: “fault model”• – Obtain methods using the models: test generation• – Evaluation of effectiveness• – Used for Testing & combinatorial fault-tolerance
Probabilistic approaches• – We can’t predict exactly when a person will die, but
we can still get “life expectancy = 77.2”, if we have data• – Used for evaluating, achieving and optimizing
reliability• – Random testing
23 matlab1.ir
Fault Tolerant vs. Performance There are many Fault-tolerance approaches that
sacrifice performance to tolerate faults Ex. 1:
• – Periodically stop the system and checkpoint its state to disk.
• * If fault occurs, recover state from checkpoint and resume Ex. 2:
• – Log all changes made to system state in case recovery is needed
• * During recovery, undo the changes from the log Ex. 3:
• – Run two identical systems in parallel, compare their results before using them
Ex. 4:• – Run software with lots of error checking
24 matlab1.ir
Fault Tolerant vs. Cost
There are many Fault-tolerance approaches that sacrifice cost to tolerate faults
Ex. 1:• – Replicate the hardware 3 times and vote to determine
correct output
Ex. 2:• – Mirror the disks (RAID-1) to tolerate disk failures
Ex. 3:• – Use multiple independent versions of software to
tolerate bugs (Called N-version programming)
25 matlab1.ir
Fault Tolerant vs. Power There are many Fault-tolerance approaches
that sacrifice power to tolerate faults Ex. 1, 2 & 3 (same as previous slide)
• – Replicate the hardware 3 times and vote to determine correct output
• – Mirror the disks (RAID-1) to tolerate disk failures• – Use multiple independent versions of software to
tolerate bugs Ex. 4
• – Add continuously running checking hardware to system
Ex. 5• – Add extra code to check for software faults
26 matlab1.ir
Need for Fault Tolerance: Universal
Natural objects:• • Fat deposits in body: survival in starvation• • Duplication of eyes: graceful degradation
upon failure Man-made objects
• • Redundancy in ordinary text• • Asking for password twice during initial set-
up• • Duplicate tires in trucks
27 matlab1.ir
Forms of Redundancy Hardware redundancy
• – add extra hardware for detection or tolerating faults
Software redundancy• – add extra software for detection and possibly
tolerating faults Information redundancy
• – extra information, i.e. codes Time redundancy
• – extra time for performing tasks for fault tolerance
28 matlab1.ir
Redundancy base
Time
Try Retry Retry
Space
Try
Try
Try
Time redundancy Suppose the data is transmitted over a parallel bus
• 1- At time t0, the original data is transmitted. • 2- Then, the data is complemented and re-
transmitted at time t0+ ΔD. • 3- The two results are compared to check whether
they are complements of each other• 4- Any disagreement indicates a fault
29 matlab1.ir
Time redundancy - Example
30 matlab1.ir
Alternating logic concept can be used for detecting fault in logic circuits which implement self-dual functions.