Upload
stephanie-perry
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
SW_FT_1, Slide 1ECE 442/CS 436, SPR 2004
Design of Reliable Systems and NetworksECE 442/CS 436
Software Fault ToleranceN-Version Programming and Recovery Blocks
Prof. Ravi K. IyerDepartment of Electrical and Computer Engineering and
Coordinated Science LaboratoryUniversity of Illinois at Urbana-Champaign
http://www.crhc.uiuc.edu/DEPEND
SW_FT_1, Slide 2ECE 442/CS 436, SPR 2004
Outline
• Recovery blocks
• N-version programming
• Practice of high-availability system design
– IBM mainframe example
SW_FT_1, Slide 3ECE 442/CS 436, SPR 2004
Motivation for Software Fault Tolerance
• Can increase software reliability via fault avoidance using software engineering and testing methodologies
• Large and complex systems fault avoidance not successful
• Redundancy in software may be needed to detect, isolate, and recover software failures
• Both static redundancy and dynamic redundancy models exist
• Hardware fault tolerance easier to assess
• Software is difficult to prove correct
HARDWARE FAULTS SOFTWARE FAULTS
1. Faults time-dependent Faults time-invariant2. Duplicate hardware detects Duplicate software not effective3. Random failure is main cause Complexity is main cause
SW_FT_1, Slide 4ECE 442/CS 436, SPR 2004
Consequences of Software Failure
• General Accounting Office reports $4.2 mission lost annually due to software errors
• Launch failure of Mariner I (1962)
• Destruction of French satellite (1988)
• Problems with Space Shuttle and Apollo missions
• STAR WARS (SDI) = funding billions of dollars for “correct” software development
• AT&T blockages (error in recovery-recognition software)(1990)
• SS7 (signaling system) protocol implementation - untested patch (mistyped character) (1997)
• Therac 25 (overdose of medical radiation 1000’s of rads in excess of prescribed dosage)
SW_FT_1, Slide 5ECE 442/CS 436, SPR 2004
Experiences with Current Software
• Many computer crashes are due to software
• Although recent data shows HW transients and design errors on the increase
• Even though one expects software to be correct, it never is
• Mature software exhibits fairly constant failure frequency
• Number of failures is correlated with
– Execution time
– Code density
– Software timing, synchronization points
SW_FT_1, Slide 6ECE 442/CS 436, SPR 2004
Experiences with Current Software (cont.)
Defect Detection Time Constant s 17.2 WeeksDefect Repair Time Constant t 4.7 WeeksCode Delivery 589810 LinesInitial Error Density 0.00387 Defects per LineDefect Reintroduction Rate 33 PercentDeployment Time T Week 100Estimated Remaining Defects ERDT 664 DefectsEstimated Current Defects ECDT 445 DefectsTesting Process Quality TPQT 90 PercentTesting Process Efficiency TPET 60 Percent
Key parameters and variables (with defect reintroduction)
300
250
200
150
100
50
0
1000 2000 3000 4000 5000
FAILRATE
CumulativeTest
Failures(t)
Number of Tests t
Cumulative test failures (t)
SW_FT_1, Slide 7ECE 442/CS 436, SPR 2004
Difficulties
• Improvements in software development methodologies reduce the incidence of faults, yielding fault avoidance
• Need for test and verification
• Formal verification techniques, such as proof of correctness, can be applied to rather small programs
• Potential exists of faulty translation of user requirements
• Conventional testing is hit-or-miss. “Program testing can show the presence of bugs but never show their absence,” - Dikstra, 1972.
• There is a lack of good fault models.
SW_FT_1, Slide 8ECE 442/CS 436, SPR 2004
Approaches to Software Fault Tolerance
• ROBUSTNESS: The extent to which software continues to operate despite introduction of invalid inputs.
Example: 1. Check input data
=>ask for new input
=>use default value and raise flag
2. Self checking software
• FAULT CONTAINMENT: Faults in one module should not affect other modules.
Example: Reasonable checks
Watchdog timers
Overflow/divide-by-zero detection
Assertion checking
• FAULT TOLERANCE: Provides uninterrupted operation in presence of program fault through multiple implementations of a given function
SW_FT_1, Slide 9ECE 442/CS 436, SPR 2004
Recovery Blocks/N-Version Programming
• Two relevant observations from 1830:
– Anon.“The most certain and effectual check upon errors which arise in the process of
computation is to cause the same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computations by different methods.”
– Charles Babbage“When the formula to be computed is very complicated, it may be algebraically
arranged for computation in two or more totally distinct ways, and two or more sets of cards may be made. If the same constants are now employed with each set, and if under these circumstances the results agree, we may be quite secure of the accuracy of them all.”
SW_FT_1, Slide 10ECE 442/CS 436, SPR 2004
N-Version Programming Basic Model
J
ax
j
cx
j
bx
Version 1
Version 2
Version 3
i
ax
i
cx
i
bx
Lane A
Lane B
Lane C
EEExecutive Support Functions
DecisionAlgorithm
Executive Environment(EE)
J-th N-VersionSoftware Unit
Lane A
Lane B
Lane C
ConsensusResults
Software Unit Enhancements for Fault-Tolerant Execution
The N-version software (NVS) model with n=3
SW_FT_1, Slide 11ECE 442/CS 436, SPR 2004
Recovery Blocks Basic Model
Execution SupportFunctions
Recovery Cache
AcceptanceTest
Software Unit Enhancements for Fault-Tolerant Execution
Alternate 1
Alternate 2
xi No No
No
Take NextAlternate
xij
AcceptedResults
EEExecution Environment (EE)
J -th Recovery Block Software Unit
Yes
The Recovery Block (RB) Model
SW_FT_1, Slide 12ECE 442/CS 436, SPR 2004
Execution Models for Software Fault-Tolerance Approaches
Recovery blocks
Start software execution
End softwareexecution
Failedsoftware
Primaryalternate execution
End primaryalternate execution
Acceptancetest
execution
Acceptancetest
passed
Acceptancetest
not passed
Alternate selection
Alternate #1selected
lz 2, ….N
Alternate #1executionlz 2,…..N
End alternate #1execution
No alternateany more available
N-version programming
...
Start software execution
Version #1 execution
End ExecutionVersion #1
Version #2 execution
Version #N execution
End ExecutionVersion #2
End ExecutionVersion #N
Gatheringversions results
No acceptableresult
provided
Acceptableresult
provided
Failedsoftware
End softwareexecution
Decisionalgorithmexecution
Start decision algorithm execution
N
SW_FT_1, Slide 13ECE 442/CS 436, SPR 2004
Execution Models for Software Fault-Tolerance Approaches (cont.)
N self-checking programming
Start software execution
FailedsoftwareEnd software
execution
No acceptableresult
Acceptableresult
provided
No acceptableresultAcceptable
resultprovided
Acceptableresult
provided
Self-checkingcomponent #1
execution
Self-checkingcomponent #2
execution
Self-checkingcomponent #N
execution
...
Resultselected
No resultselected
N
SW_FT_1, Slide 14ECE 442/CS 436, SPR 2004
Software Fault-Tolerance Approaches and Their Equivalent Hardware Counterparts
• RB is equivalent to the stand-by sparing (of passive dynamic redundancy) in HW fault-tolerant architectures
• NVP is equivalent to N-modular redundancy (static redundancy) in HW fault-tolerant architectures
• NSCP is equivalent to active dynamic redundancy
– A self-checking component results either from:
• The association of an acceptance test to a version
• The association of two variants with a comparison algorithm
– Fault-tolerance is provided by the parallel execution of N 2 self-checking components
SW_FT_1, Slide 15ECE 442/CS 436, SPR 2004
Concepts of N-Version Programming• N 2 versions of functionally equivalent programs
• “Independent” generations of programs carried out by N groups of individuals who do not talk to each other with respect to programming process (different algorithms, different programming languages, translation)
• Initial specification formally done in some formal spec. language
states unambiguously the functional requirements
leaves widest possible choice of implementation
• By making the development process diverse it is hoped that the versions will contain diverse faults
• The inventors of NVP emphasized that:
– “the definition of NVP has never postulated an assumption of independence and that NVP is a rigorous process of software development”
SW_FT_1, Slide 16ECE 442/CS 436, SPR 2004
An Assumption of Independence in N-Version Programming ?
• Do the N versions of a program fail independently (similar to hardware)? Are faults unrelated?
Does Prob (failure of N-version system) = Prob (failure of one version)N ??
– If so, then the system reliability can be very high
• Why such an assumption may be false?
– People make same mistakes, e.g. incorrect treatment of boundary conditions
– Some parts of a problem more difficult than others
• statistics show similarity in programmer’s view of “difficult” regions
SW_FT_1, Slide 17ECE 442/CS 436, SPR 2004
Observation from Experiments• Assumption of independence of failures of versions DOES NOT hold
• This does not mean N-version programming is not useful
• The reliability of the system will not be as high as in the case when the faults in different versions are independent
• Example: PODS (Project on Diverse Software)
– All faults were caused by omissions and ambiguities in the requirement specifications
– Two common faults were found in two versions
– Three different versions of software with failure rate 1.5 * 10-6, 0.8 * 10-3, and 0.8 * 10-3, resulted in the failure rate of 0.8 * 10-3 after majority voting
– The common/coincident faults could not be excluded by majority voting
SW_FT_1, Slide 18ECE 442/CS 436, SPR 2004
Limitation of N-Version Programming
• All N -versions originate from the same initial specifications whose correctness, completeness, and unambiguity should be assumed
– Use formal correctness proofs on specs, rather than proofs on implementations
– Exhaustive validation
• Based on an assumption that software faults are distinguishable:
– faults that will cause disagreement between versions at specified voting points might be a result of independent programming efforts to remove identical software defects
SW_FT_1, Slide 19ECE 442/CS 436, SPR 2004
Concepts of Recovery Blocks
• Characteristics:
– Incorporates general solution to the problem of switching to spare
– Explicitly structures a software system so that extra software for spares and error detection does not reduce system reliability
– First to consider a single sequential process; later extended to
• Multiple processes within one system
• Multiple processes in multiple systems => distributed recovery blocks
• Can view progress as sequences of basic operations, assignments to stored variable
• Structured program has BLOCKS of code to simplify understanding of the functional description
• Choose blocks as units for error detection and recovery.
SW_FT_1, Slide 20ECE 442/CS 436, SPR 2004
Alternates
• Primary alternate is the one that is to be used normally
• Other alternates attempt less desirable options
• One source of alternates is earlier release of primary alternates
• Gracefully degraded alternates
SW_FT_1, Slide 21ECE 442/CS 436, SPR 2004
Acceptance Tests
• Function: ensure the operation of recovery blocks is satisfactory
• Should access variables in the program, NOT local to the recovery block, since these cannot have effect after exit. Also, different alternates use different local variables.
• Need not check for absolute “correctness” - cost/complexity trade-off
• Run-time overheads should be LOW
• NO RESIDUAL EFFECTS should be present, since variables, if updated, might result in passing of successive alternates
SW_FT_1, Slide 22ECE 442/CS 436, SPR 2004
Restoration of System State
• Restoring system state is automatic
• Taking a copy of entire system state on entry to each recovery block is too costly
• Use Recovery Caches or “Recursive” Caches
• When a process is to be backed up, it is to a state just before entry to primary alternate
• Only NONLOCAL variables that have been MODIFIED have to be reset
SW_FT_1, Slide 23ECE 442/CS 436, SPR 2004
Process Conversations
• A systematic methodology of extending recovery blocks across processes by taking process interactions into considerations (considers time/space)
SW_FT_1, Slide 24ECE 442/CS 436, SPR 2004
Process Conversations (cont.)
• Recovery block spanning two or more processes is called a conversation
• Within a conversation, processes communicate among themselves, NOT with others
• Operations of a conversation
– Within a conversation, communication is only among participants, not external
– On entry, a process establishes a checkpoint
– If an error is detected by any process, then all processes restore their checkpoints
– Next to ALL processes execute their available alternative
– All processes leave the conversation together (perform their acceptance tests just prior to leaving)
• At the end of the conversation, ALL processes must satisfy their respective acceptance tests, and none may proceed otherwise
SW_FT_1, Slide 25ECE 442/CS 436, SPR 2004
Nested Conversions
P1
P2
P3
Inter-process communication
Conversation boundary
Checkpoint
Acceptance test
SW_FT_1, Slide 26ECE 442/CS 436, SPR 2004
Comparison of Recovery Blocks vs. N-Version Programming
• Advantages of Recovery Block
– Most software systems evolve by replacement of some modules by new ones - can be used as alternates
– Nice hierarchical design - structured approach
• Disadvantages of Recovery Block
– System state must be saved before entry to recovery block -- excessive storage
– Difficult to handle multiple processes -- might have domino effect
– Difficult to undo effects in real-time systems
– Effectiveness of acceptance test
– Higher coverage is more complex
– Lack of formal method to check
SW_FT_1, Slide 27ECE 442/CS 436, SPR 2004
Comparison of Recovery Block vs.N-Version Programming (cont.)
• Advantages of N-Version Programming
– Immediate masking of software faults -- no delay in operation
– Self-checking (acceptance tests) not required
– Conventional fault tolerant systems HW and SW have redundant hardware e.g. TMR (easier to include N-version software on redundant hardware)
• Disadvantages of N-Version Programming
– How to get N-versions?
• Impose design diversity, since randomness does not give uncorrelated software faults
– Extremely dependent on input specifications (formal correctness proofs…)
SW_FT_1, Slide 28ECE 442/CS 436, SPR 2004
High-Availability System DesignIBM Mainframe 30xx
SW_FT_1, Slide 29ECE 442/CS 436, SPR 2004
IBM 30xx Simplified System Model
Expanded storage
Central memory
System controller
CPU’sProcessorcontroller
Channelcontrol
Power distributionand cooling
Channel adaptersand servers
SW_FT_1, Slide 30ECE 442/CS 436, SPR 2004
Fault Isolation Using Hardware Checkers
• Error checker placement determined by Fault Isolation Domains (FID)
• Checkers define the boundary of fault containment
• If checkers detect all faults, fault isolation stop the processor, identify checker i.e. the relevant FID and possibly Field Replacement Unit (FRU)
SW_FT_1, Slide 31ECE 442/CS 436, SPR 2004
Fault Isolation Using Hardware Checkers
Register C
Memoryarray 2
Register B
Memoryarray 1
Register A
Drivers
...
Checker 3
Checker 1
Checker 2
Decoder
Cable
FRU 1
FRU 5
FRU 4
FRU 3
FRU 2
Red = Fault Isolation DomainBlue = Field Replaceable Unit
If checker 2 is triggered and if register C is input to register B implicated set FRUs is 3,4
SW_FT_1, Slide 32ECE 442/CS 436, SPR 2004
Mapping of Fault Isolation Domains to Field Replaceable Units
Function FID FRU Syndrome
Memory array 1 1 1 C1Register A 1 1 C1Checker 1 1 1 C1Drivers 2 1 C2Cable 2 2 C2Memory array 2 2 3 C2Register B 2 3 C2Checker 2 2 3 C2Register C 2 4 C2Decoder 3 5 C3Checker 3 3 5 C3
SW_FT_1, Slide 33ECE 442/CS 436, SPR 2004
IBM 30XX Data Path OverviewExpanded storageStorage controllerwith hardware-assisted memory tester
ECC
Central storageStorage controllerwith hardware-assisted memory tester
ECC
ECC
System controller
I/Oprocessor
Primarydatastager
Secondarydatastager
Secondarydatastager
ECC
P
P
Processorcontroller
Channel adapter LSSG
Channel server
P
PP
Cache
InstructionFetch/decode
Instructionexecution
Vectorexecution
Control Storage Parity
P
P
CPUChannel control element
P = parity
P
P
LSSG = Logic Support Station Group
SW_FT_1, Slide 34ECE 442/CS 436, SPR 2004
Hardware-Based RetryInstruction Execution
Restart execution
Get instructions/datafrom retry buffers
Operands intoretry buffers
Test
Freeze execution
Errordetected
?
Retrypermitted
?
Signal OS for
SWrecovery
No retry or threshold crossed
Instruction andexecution elements
Communicate back toprocessor controllerthrough LSS
Errors are detected byparity checks on registercontents and on databuses and by pattern validity checks in control logic circuits.
Instructionexecution
Stop onerror andrestoreoperands
Instructionretry
No
Yes
SW_FT_1, Slide 35ECE 442/CS 436, SPR 2004
Checkers in The Central Processor
• Byte parity on data path registers
• Parity checks on input/output of adders
• Parity on microstore
• Parity on microstore addresses
• Encoder/decoder checks
• Single-bit error detection in cache for data received from memory
• Additional illegal pattern checks
SW_FT_1, Slide 36ECE 442/CS 436, SPR 2004
Machine checkInterruption
Levels of Error Recovery
Systemcontinues
System operation
Functionalrecovery
System continues
Task terminated
1 Perform instruction retry
Systemrecovery
Successful
Unsuccessful
2 Terminate affected task and continue system operation
System reloaded
Successful
Unsuccessful
Systemsupported restart
3 Restart system operation, stop for repair not required Notify operator
external repair
Successful
Unsuccessful
Systemrepair
4 Stop, repair, & restart
SW_FT_1, Slide 37ECE 442/CS 436, SPR 2004
Recovery Processing Overview Handling Hardware and Software Errors
CONTROL
PROGRAMRECOVERY
TERMINATIONMANAGER
TERMINATIONROUTINES
RETRYROUTINES
ABEND(Abnormal
Termination)
RECOVERYROUTINES
SW_FT_1, Slide 38ECE 442/CS 436, SPR 2004
System Level Facilities for Error Detection and Recovery
• Installation error detection capability
• Tools to build “profiles” of system software modules and inspect correct usage of system resources.
• Software facilities to detect the occurrences of selected events, e.g., appendages allow user control of I/O; SLIP (serviceability level indication processing) aids in error detection and diagnosis (e.g., access to traps that cause a program interruption).
• User defines detection mechanisms to detect programmer-defined exceptions, e.g., incorrect address or attempting privileged instructions.
• The operator detects evident error conditions, e.g., loop conditions, endless wait states
• The data management and supervisor routines ensure valid data is processed and non-conflicting requests are made