SW_FT_1, Slide 1 ECE 442/CS 436, SPR 2004 Design of Reliable Systems and Networks ECE 442/CS 436 Software Fault Tolerance N-Version Programming and Recovery

SW_FT_1, Slide 1ECE 442/CS 436, SPR 2004

Design of Reliable Systems and NetworksECE 442/CS 436

Software Fault ToleranceN-Version Programming and Recovery Blocks

Prof. Ravi K. IyerDepartment of Electrical and Computer Engineering and

Coordinated Science LaboratoryUniversity of Illinois at Urbana-Champaign

[email protected]

http://www.crhc.uiuc.edu/DEPEND


Outline

• Recovery blocks

• N-version programming

• Practice of high-availability system design

– IBM mainframe example


Motivation for Software Fault Tolerance

• Can increase software reliability via fault avoidance using software engineering and testing methodologies

• Large and complex systems fault avoidance not successful

• Redundancy in software may be needed to detect, isolate, and recover software failures

• Both static redundancy and dynamic redundancy models exist

• Hardware fault tolerance easier to assess

• Software is difficult to prove correct

HARDWARE FAULTS SOFTWARE FAULTS

1. Faults time-dependent Faults time-invariant2. Duplicate hardware detects Duplicate software not effective3. Random failure is main cause Complexity is main cause


Consequences of Software Failure

• General Accounting Office reports $4.2 mission lost annually due to software errors

• Launch failure of Mariner I (1962)

• Destruction of French satellite (1988)

• Problems with Space Shuttle and Apollo missions

• STAR WARS (SDI) = funding billions of dollars for “correct” software development

• AT&T blockages (error in recovery-recognition software)(1990)

• SS7 (signaling system) protocol implementation - untested patch (mistyped character) (1997)

• Therac 25 (overdose of medical radiation 1000’s of rads in excess of prescribed dosage)


Experiences with Current Software

• Many computer crashes are due to software

• Although recent data shows HW transients and design errors on the increase

• Even though one expects software to be correct, it never is

• Mature software exhibits fairly constant failure frequency

• Number of failures is correlated with

– Execution time

– Code density

– Software timing, synchronization points


Experiences with Current Software (cont.)

Defect Detection Time Constant s 17.2 WeeksDefect Repair Time Constant t 4.7 WeeksCode Delivery 589810 LinesInitial Error Density 0.00387 Defects per LineDefect Reintroduction Rate 33 PercentDeployment Time T Week 100Estimated Remaining Defects ERDT 664 DefectsEstimated Current Defects ECDT 445 DefectsTesting Process Quality TPQT 90 PercentTesting Process Efficiency TPET 60 Percent

Key parameters and variables (with defect reintroduction)

300

250

200

150

100

50

0

1000 2000 3000 4000 5000

FAILRATE

CumulativeTest

Failures(t)

Number of Tests t

Cumulative test failures (t)


Difficulties

• Improvements in software development methodologies reduce the incidence of faults, yielding fault avoidance

• Need for test and verification

• Formal verification techniques, such as proof of correctness, can be applied to rather small programs

• Potential exists of faulty translation of user requirements

• Conventional testing is hit-or-miss. “Program testing can show the presence of bugs but never show their absence,” - Dikstra, 1972.

• There is a lack of good fault models.


Approaches to Software Fault Tolerance

• ROBUSTNESS: The extent to which software continues to operate despite introduction of invalid inputs.

Example: 1. Check input data

=>ask for new input

=>use default value and raise flag

2. Self checking software

• FAULT CONTAINMENT: Faults in one module should not affect other modules.

Example: Reasonable checks

Watchdog timers

Overflow/divide-by-zero detection

Assertion checking

• FAULT TOLERANCE: Provides uninterrupted operation in presence of program fault through multiple implementations of a given function


Recovery Blocks/N-Version Programming

• Two relevant observations from 1830:

– Anon.“The most certain and effectual check upon errors which arise in the process of

computation is to cause the same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computations by different methods.”

– Charles Babbage“When the formula to be computed is very complicated, it may be algebraically

arranged for computation in two or more totally distinct ways, and two or more sets of cards may be made. If the same constants are now employed with each set, and if under these circumstances the results agree, we may be quite secure of the accuracy of them all.”


N-Version Programming Basic Model

J

ax

j

cx

j

bx

Version 1

Version 2

Version 3

i

ax

i

cx

i

bx

Lane A

Lane B

Lane C

EEExecutive Support Functions

DecisionAlgorithm

Executive Environment(EE)

J-th N-VersionSoftware Unit

Lane A

Lane B

Lane C

ConsensusResults

Software Unit Enhancements for Fault-Tolerant Execution

The N-version software (NVS) model with n=3


Recovery Blocks Basic Model

Execution SupportFunctions

Recovery Cache

AcceptanceTest

Software Unit Enhancements for Fault-Tolerant Execution

Alternate 1

Alternate 2

xi No No

No

Take NextAlternate

xij

AcceptedResults

EEExecution Environment (EE)

J -th Recovery Block Software Unit

Yes

The Recovery Block (RB) Model


Execution Models for Software Fault-Tolerance Approaches

Recovery blocks

Start software execution

End softwareexecution

Failedsoftware

Primaryalternate execution

End primaryalternate execution

Acceptancetest

execution

Acceptancetest

passed

Acceptancetest

not passed

Alternate selection

Alternate #1selected

lz 2, ….N

Alternate #1executionlz 2,…..N

End alternate #1execution

No alternateany more available

N-version programming

...


Version #1 execution

End ExecutionVersion #1

Version #2 execution

Version #N execution

End ExecutionVersion #2

End ExecutionVersion #N

Gatheringversions results

No acceptableresult

provided

Acceptableresult

provided

Failedsoftware

End softwareexecution

Decisionalgorithmexecution

Start decision algorithm execution

N


Execution Models for Software Fault-Tolerance Approaches (cont.)

N self-checking programming


FailedsoftwareEnd software

execution

No acceptableresult

Acceptableresult

provided

No acceptableresultAcceptable

resultprovided

Acceptableresult

provided

Self-checkingcomponent #1

execution

Self-checkingcomponent #2

execution

Self-checkingcomponent #N

execution

...

Resultselected

No resultselected

N


Software Fault-Tolerance Approaches and Their Equivalent Hardware Counterparts

• RB is equivalent to the stand-by sparing (of passive dynamic redundancy) in HW fault-tolerant architectures

• NVP is equivalent to N-modular redundancy (static redundancy) in HW fault-tolerant architectures

• NSCP is equivalent to active dynamic redundancy

– A self-checking component results either from:

• The association of an acceptance test to a version

• The association of two variants with a comparison algorithm

– Fault-tolerance is provided by the parallel execution of N 2 self-checking components


Concepts of N-Version Programming• N 2 versions of functionally equivalent programs

• “Independent” generations of programs carried out by N groups of individuals who do not talk to each other with respect to programming process (different algorithms, different programming languages, translation)

• Initial specification formally done in some formal spec. language

states unambiguously the functional requirements

leaves widest possible choice of implementation

• By making the development process diverse it is hoped that the versions will contain diverse faults

• The inventors of NVP emphasized that:

– “the definition of NVP has never postulated an assumption of independence and that NVP is a rigorous process of software development”


An Assumption of Independence in N-Version Programming ?

• Do the N versions of a program fail independently (similar to hardware)? Are faults unrelated?

Does Prob (failure of N-version system) = Prob (failure of one version)N ??

– If so, then the system reliability can be very high

• Why such an assumption may be false?

– People make same mistakes, e.g. incorrect treatment of boundary conditions

– Some parts of a problem more difficult than others

• statistics show similarity in programmer’s view of “difficult” regions


Observation from Experiments• Assumption of independence of failures of versions DOES NOT hold

• This does not mean N-version programming is not useful

• The reliability of the system will not be as high as in the case when the faults in different versions are independent

• Example: PODS (Project on Diverse Software)

– All faults were caused by omissions and ambiguities in the requirement specifications

– Two common faults were found in two versions

– Three different versions of software with failure rate 1.5 * 10-6, 0.8 * 10-3, and 0.8 * 10-3, resulted in the failure rate of 0.8 * 10-3 after majority voting

– The common/coincident faults could not be excluded by majority voting


Limitation of N-Version Programming

• All N -versions originate from the same initial specifications whose correctness, completeness, and unambiguity should be assumed

– Use formal correctness proofs on specs, rather than proofs on implementations

– Exhaustive validation

• Based on an assumption that software faults are distinguishable:

– faults that will cause disagreement between versions at specified voting points might be a result of independent programming efforts to remove identical software defects


Concepts of Recovery Blocks

• Characteristics:

– Incorporates general solution to the problem of switching to spare

– Explicitly structures a software system so that extra software for spares and error detection does not reduce system reliability

– First to consider a single sequential process; later extended to

• Multiple processes within one system

• Multiple processes in multiple systems => distributed recovery blocks

• Can view progress as sequences of basic operations, assignments to stored variable

• Structured program has BLOCKS of code to simplify understanding of the functional description

• Choose blocks as units for error detection and recovery.


Alternates

• Primary alternate is the one that is to be used normally

• Other alternates attempt less desirable options

• One source of alternates is earlier release of primary alternates

• Gracefully degraded alternates


Acceptance Tests

• Function: ensure the operation of recovery blocks is satisfactory

• Should access variables in the program, NOT local to the recovery block, since these cannot have effect after exit. Also, different alternates use different local variables.

• Need not check for absolute “correctness” - cost/complexity trade-off

• Run-time overheads should be LOW

• NO RESIDUAL EFFECTS should be present, since variables, if updated, might result in passing of successive alternates


Restoration of System State

• Restoring system state is automatic

• Taking a copy of entire system state on entry to each recovery block is too costly

• Use Recovery Caches or “Recursive” Caches

• When a process is to be backed up, it is to a state just before entry to primary alternate

• Only NONLOCAL variables that have been MODIFIED have to be reset


Process Conversations

• A systematic methodology of extending recovery blocks across processes by taking process interactions into considerations (considers time/space)


Process Conversations (cont.)

• Recovery block spanning two or more processes is called a conversation

• Within a conversation, processes communicate among themselves, NOT with others

• Operations of a conversation

– Within a conversation, communication is only among participants, not external

– On entry, a process establishes a checkpoint

– If an error is detected by any process, then all processes restore their checkpoints

– Next to ALL processes execute their available alternative

– All processes leave the conversation together (perform their acceptance tests just prior to leaving)

• At the end of the conversation, ALL processes must satisfy their respective acceptance tests, and none may proceed otherwise


Nested Conversions

P1

P2

P3

Inter-process communication

Conversation boundary

Checkpoint

Acceptance test


Comparison of Recovery Blocks vs. N-Version Programming

• Advantages of Recovery Block

– Most software systems evolve by replacement of some modules by new ones - can be used as alternates

– Nice hierarchical design - structured approach

• Disadvantages of Recovery Block

– System state must be saved before entry to recovery block -- excessive storage

– Difficult to handle multiple processes -- might have domino effect

– Difficult to undo effects in real-time systems

– Effectiveness of acceptance test

– Higher coverage is more complex

– Lack of formal method to check


Comparison of Recovery Block vs.N-Version Programming (cont.)

• Advantages of N-Version Programming

– Immediate masking of software faults -- no delay in operation

– Self-checking (acceptance tests) not required

– Conventional fault tolerant systems HW and SW have redundant hardware e.g. TMR (easier to include N-version software on redundant hardware)

• Disadvantages of N-Version Programming

– How to get N-versions?

• Impose design diversity, since randomness does not give uncorrelated software faults

– Extremely dependent on input specifications (formal correctness proofs…)


High-Availability System DesignIBM Mainframe 30xx


IBM 30xx Simplified System Model

Expanded storage

Central memory

System controller

CPU’sProcessorcontroller

Channelcontrol

Power distributionand cooling

Channel adaptersand servers


Fault Isolation Using Hardware Checkers

• Error checker placement determined by Fault Isolation Domains (FID)

• Checkers define the boundary of fault containment

• If checkers detect all faults, fault isolation stop the processor, identify checker i.e. the relevant FID and possibly Field Replacement Unit (FRU)


Fault Isolation Using Hardware Checkers

Register C

Memoryarray 2

Register B

Memoryarray 1

Register A

Drivers

...

Checker 3

Checker 1

Checker 2

Decoder

Cable

FRU 1

FRU 5

FRU 4

FRU 3

FRU 2

Red = Fault Isolation DomainBlue = Field Replaceable Unit

If checker 2 is triggered and if register C is input to register B implicated set FRUs is 3,4


Mapping of Fault Isolation Domains to Field Replaceable Units

Function FID FRU Syndrome

Memory array 1 1 1 C1Register A 1 1 C1Checker 1 1 1 C1Drivers 2 1 C2Cable 2 2 C2Memory array 2 2 3 C2Register B 2 3 C2Checker 2 2 3 C2Register C 2 4 C2Decoder 3 5 C3Checker 3 3 5 C3


IBM 30XX Data Path OverviewExpanded storageStorage controllerwith hardware-assisted memory tester

ECC

Central storageStorage controllerwith hardware-assisted memory tester

ECC

ECC

System controller

I/Oprocessor

Primarydatastager

Secondarydatastager

Secondarydatastager

ECC

P

P

Processorcontroller

Channel adapter LSSG

Channel server

P

PP

Cache

InstructionFetch/decode

Instructionexecution

Vectorexecution

Control Storage Parity

P

P

CPUChannel control element

P = parity

P

P

LSSG = Logic Support Station Group


Hardware-Based RetryInstruction Execution

Restart execution

Get instructions/datafrom retry buffers

Operands intoretry buffers

Test

Freeze execution

Errordetected

?

Retrypermitted

?

Signal OS for

SWrecovery

No retry or threshold crossed

Instruction andexecution elements

Communicate back toprocessor controllerthrough LSS

Errors are detected byparity checks on registercontents and on databuses and by pattern validity checks in control logic circuits.

Instructionexecution

Stop onerror andrestoreoperands

Instructionretry

No

Yes


Checkers in The Central Processor

• Byte parity on data path registers

• Parity checks on input/output of adders

• Parity on microstore

• Parity on microstore addresses

• Encoder/decoder checks

• Single-bit error detection in cache for data received from memory

• Additional illegal pattern checks


Machine checkInterruption

Levels of Error Recovery

Systemcontinues

System operation

Functionalrecovery

System continues

Task terminated

1 Perform instruction retry

Systemrecovery

Successful

Unsuccessful

2 Terminate affected task and continue system operation

System reloaded

Successful

Unsuccessful

Systemsupported restart

3 Restart system operation, stop for repair not required Notify operator

external repair

Successful

Unsuccessful

Systemrepair

4 Stop, repair, & restart


Recovery Processing Overview Handling Hardware and Software Errors

CONTROL

PROGRAMRECOVERY

TERMINATIONMANAGER

TERMINATIONROUTINES

RETRYROUTINES

ABEND(Abnormal

Termination)

RECOVERYROUTINES


System Level Facilities for Error Detection and Recovery

• Installation error detection capability

• Tools to build “profiles” of system software modules and inspect correct usage of system resources.

• Software facilities to detect the occurrences of selected events, e.g., appendages allow user control of I/O; SLIP (serviceability level indication processing) aids in error detection and diagnosis (e.g., access to traps that cause a program interruption).

• User defines detection mechanisms to detect programmer-defined exceptions, e.g., incorrect address or attempting privileged instructions.

• The operator detects evident error conditions, e.g., loop conditions, endless wait states

• The data management and supervisor routines ensure valid data is processed and non-conflicting requests are made

Documents

SW_FT_1, Slide 1 ECE 442/CS 436, SPR 2004 Design of Reliable Systems and Networks ECE 442/CS 436 Software Fault Tolerance N-Version Programming and Recovery