View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Copyright 2004 Koren & Krishna ECE655/SWFT .1
Fall 2006
UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer
Engineering
Fault Tolerant ComputingECE 655
Software Fault Tolerance I
Copyright 2004 Koren & Krishna ECE655/SWFT .2
Causes of Software Errors Designing and writing software is very difficult -
essential and accidental causes of software errors Essential difficulties
Understanding a complex application and operating environment
Having to construct a structure comprising an extremely large number of states, with very complex state-transition rules
Software is subject to frequent modifications - new features are added to adapt to changing application needs
Hardware and operating system platforms can change with time - the software has to adjust appropriately
Software is often used to paper over incompatibilities between interacting system components
Accidental difficulties - Human mistakes Cost considerations - use of Commercial Off-the-
Shelf (COTS) software - not designed for high-reliability applications
Copyright 2004 Koren & Krishna ECE655/SWFT .3
Techniques to Reduce Error Rate
Software almost inevitably contains defects Do everything possible to reduce the fault rate Use fault-tolerance techniques to deal with software
faults
One approach - formal proof that the software is correct - not practical for large pieces of software
Acceptance tests - used in wrappers and in recovery blocks - important fault-tolerant mechanisms
Example: If a thermometer reads -40ºC on a midsummer day - suspect malfunction
Timing Checks: Set a watchdog timer to the expected run time
If timer goes off, assume a hardware or software failure - can be used in parallel with other acceptance tests
Copyright 2004 Koren & Krishna ECE655/SWFT .4
Acceptance tests Verification of Output: Sometimes, acceptance test suggested naturally
Sorting; Square root; Factorization of large numbers; Solution of equations
Probabilistic checks: Example: multiply nn integer matrices C = A
B The naive approach takes O(n³) time Strassen's method - pick at random an n-
element vector of integers, R M1=A(BR) and M2=CR If M1 M2 - an error has occurred If M1 = M2 - high probability of correctness May repeat by picking another vector Complexity - O(m n²); m is number of checks
Copyright 2004 Koren & Krishna ECE655/SWFT .5
Range Checks Set acceptable bounds for output; if output
outside bounds - declare a fault Bounds - either preset or simple function of inputs
(probability of faulty test software should be low) Example: remote-sensing satellite taking thermal
imagery of earth - bounds on temperature range Further, use spatial correlations - excessive
differences between temperature in adjacent areas Balance sensitivity and specificity Sensitivity - conditional probability that test fails,
given output is erroneous Specificity - conditional probability that it is indeed
an error given acceptance test flags an error Narrower bounds - increase sensitivity by also
increase false-alarm rate and decrease specificity
Copyright 2004 Koren & Krishna ECE655/SWFT .6
Wrappers
Robustness-enhancing interfaces for software modules
Examples: operating system kernel, middleware, applications software
Inputs are intercepted by the wrapper, which passes them or signals an exception
Similarly, outputs are filtered by the wrapper Example: using COTS software for high-
reliability applications COTS components are wrapped to reduce
their failure rate - prevent inputs (1) outside specified range or (2) known to cause failures
Similarly, outputs pass an acceptance test
Wrapper
Wrapped Software
Copyright 2004 Koren & Krishna ECE655/SWFT .7
Example 1: Dealing with Buffer Overflow
C language does not perform range checking for arrays - can cause accidental or malicious damage
Write a large string into a small buffer: buffer overflow - memory outside buffer is overwritten
Accidentally - causes a memory fault Maliciously - overwrite portions of program stack
or heap - a well-known hacking technique Stack-smashing attack: A process with root
privileges stores its return address in stack. The malicious program overwrites this return address, and control flow is redirected to a memory location where the hacker stored the attacking code
Attacking code now has root privileges, and can destroy the system
Copyright 2004 Koren & Krishna ECE655/SWFT .8
Wrapper to Protect against Buffer Overflow
All malloc calls from the wrapped program are intercepted by wrapper
Wrapper keeps track of the starting position of allocated memory and size
Writes are intercepted, to verify that they fall within allocated bounds
If not, wrapper does not allow the write to proceed and instead flags an overflow error
Copyright 2004 Koren & Krishna ECE655/SWFT .9
Example 2: Checking correctness of scheduler
Wrapper around task scheduler in a fault-tolerant, real-time system
Such schedulers do not use round-robin scheduling; may use Earliest Deadline First (EDF) - execute task with earliest deadline among tasks ready to run
Subject to preemptability constraints (i.e., tasks in certain parts of execution may not be preemptable)
A wrapper can verify that scheduling algorithm is being correctly executed, i.e., the ready task with earliest deadline was picked, and that any arriving task with an earlier deadline preempts the executing task (assuming the latter is preemptable)
Wrapper needs information about the tasks which are ready to run and their deadlines, and also whether the currently executing task is preemptable
Copyright 2004 Koren & Krishna ECE655/SWFT .10
Example 3: Using software with known bugs
Found through testing or field reports that software fails for a certain set of inputs, S
Wrapper to intercept the inputs to that software and check if inputs are in set S
If not, forward to software module for execution If yes, return a suitable exception to system Alternatively, redirect input to some
alternative, custom-written, code that handles these inputs
Example 4: Using a wrapper to check for correct output
Includes an acceptance test to filter output If the output passes test, it is forwarded outside If not, exception is raised, and system has to
deal with a suspicious output
Copyright 2004 Koren & Krishna ECE655/SWFT .11
Factors in Successful Wrapping
(a) Quality of acceptance tests: Application-dependent - has direct impact on
ability of wrapper to stop faulty outputs
(b) Availability of necessary information from wrapped component: If wrapped component is a ``black box,'' (i.e.,
observes only the response to given input), wrapper will be somewhat limited
Example: a scheduler wrapper is impossible without information about status of tasks waiting to run
(c) Extent to which wrapped software module has been tested: Extensive testing identifies inputs for which the
software fails
Copyright 2004 Koren & Krishna ECE655/SWFT .12
Software Rejuvenation
Example: Rebooting a PC As a process executes, it acquires memory and file-
locks without properly releasing them. Also, memory space tends to become increasingly fragmented
The process can become faulty and stop executing To head this off, proactively halt the process, clean
up its internal state, and then restart it Rejuvenation can be time-based or prediction-
based Time-Based Rejuvenation - periodically Rejuvenation period - balance benefits against cost
Copyright 2004 Koren & Krishna ECE655/SWFT .13
Time-Based Rejuvenation - Analytical Model
N(t) - expected number of failures in interval [0,t]
K - Cost of failure K - Cost of each rejuvenation P - Rejuvenation period Adding costs due to rejuvenation and failure,
the overall expected cost of rejuvenation over one rejuvenation period
Rate of rejuvenation cost
f
r
Copyright 2004 Koren & Krishna ECE655/SWFT .14
Analytical Model - Examples (1) - constant failure rate over time - N(P)=P
C = K + K / P
Optimal P is P* = - no software rejuvenation Rejuvenation heads off the increased rate in failure as
the software ages - no aging in this case
(2) N(P) = P² C = P K + K / P Optimal P: P* = K / K
(3) N(P) = P (n>1) C = P K + K / P Optimal P: P* = [ K / (n-1) K ]
Conclusions: P increases with K /K , rate of increase goes down with
n P decreases with , rate of decrease goes down with n
rate
rfrate
f r
frn
raten-1
f r 1/nr f
r f
Copyright 2004 Koren & Krishna ECE655/SWFT .15
Optimal Rejuvenation Period
Estimating the parameters K , K , N(t) - Can be done experimentally - running simulations
on the software System can be made adaptive - some default initial
values and adjusting the values as more statistics are gathered
fr
K /Kr f
Copyright 2004 Koren & Krishna ECE655/SWFT .16
Prediction-Based Rejuvenation Monitoring system characteristics - amount of
memory allocated, number of file locks held, etc. - predicting when system will fail
Example - a process consumes memory at a certain rate, the system estimates when it will run out of memory, rejuvenation can take place just before predicted crash
The software that implements prediction-based rejuvenation must have access to enough state information to make such predictions
If prediction software is part of operating system - such information is easy to collect
If it is a package that runs atop operating system with no special privileges - constrained to using interfaces provided by operating system
Copyright 2004 Koren & Krishna ECE655/SWFT .17
Example - Unix Operating System
Unix provides the following utilities for collecting status information -
vmstat - provides information about processor utilization, memory and paging activity, traps, and I/O
iostat - outputs percentage CPU utilization at user and system levels, as well as a report on usage of each I/O device
netstat - indicates network connections, routing tables and a table of all network interfaces
nfsstat - provides information about network file server kernel statistics
Copyright 2004 Koren & Krishna ECE655/SWFT .18
Least Squares Failure Time Prediction
Once status information is collected - trends must be identified and failure time predicted
Example: tracking allocation of memory to a process
Points are given - (t ) is the allocated memory at time t
We can do a least square fit of a polynomial so that is minimized
A straight line f(x)=mx+c is the simplest such fit This polynomial can be used to extrapolate into
the future and predict when the process will run out of memory
ii
Copyright 2004 Koren & Krishna ECE655/SWFT .19
Weighted Least Squares In standard least-squares fit, each observed
point (t ) has equal weight in determining the fit
Variation: different weights - select weights w ,…,w - then find coefficients of f(t) to minimize
Weights allow greater emphasis to certain points Example: w < w < ... < w - more recent data
influences the fit more than older data Curve-fitting is vulnerable to the impact of a few
points that are unusually high or low - distorting the fit and the prediction
Techniques are available to make the fit more robust by reducing the impact of these points
i
1 p
1 p0
Copyright 2004 Koren & Krishna ECE655/SWFT .20
Combined Approach
Prediction-based rejuvenation with a timer reset on rejuvenation
If timer goes off - rejuvenation is done regardless of when next failure is predicted to happen
Rejuvenation Level Either application or node level - depending on
where resources have degraded or become exhausted
Rejuvenation at the application level - suspending an individual application, cleaning up its state (e.g., by garbage collection, re-initialization of data structures, etc.), and then restarting
Rejuvenation at the node level - rebooting node - affects all applications running on that node
Copyright 2004 Koren & Krishna ECE655/SWFT .21
N-Version Programming
N independent teams of programmers develop software to same specifications - N versions are run in parallel - output voted on
If programs are developed independently - very unlikely that they will fail on same inputs
Assumption - failures are statistically independent; probability of failure of an individual version = p
Probability of exactly i failures out of N versions -
P(N,i,p) =
Probability of system failure (N is odd) -
P(N,i,p)i=(N+1)/2
N
Copyright 2004 Koren & Krishna ECE655/SWFT .22
Consistent Comparison Problem
N-version programming is not simple to implement
Even if all versions are correct - reaching a consensus is difficult
Example : V ,…,V - N independently written versions
for computing a quantity x and comparing it to C
x - value of x computed by version V B - Boolean variable - B (x > C) The comparison with C is said to be consistent
if B = B for all i,j = 1,..., N
N1
i
i
i
ii
j
Copyright 2004 Koren & Krishna ECE655/SWFT .23
Consistency Requirement Example: A function of pressure and temperature,
f(p,t), is calculated, action A is taken if f(p,t) C, A is taken if f(p,t)>C
Each version outputs the action to be taken - ideally all versions are consistent - output the same action
Versions are written independently - use different algorithms to compute f(p,t) - values will differ slightly
C=1.0000; N=3 All three versions operate correctly - output values:
0.9999, 0.9998, 1.0001 B = B = FALSE - action is A B = TRUE - action is A Not consistent although all versions are correct
1
2
1 1
2
2
3
Copyright 2004 Koren & Krishna ECE655/SWFT .24
Consistency problem Theorem: Except for the trivial algorithm that
maps every input to the same number, there exists no algorithm which will guarantee that two n-bit integers that differ by less than 2 will be mapped to the same m-bit output, where m+k n
Proof: We proceed by contradiction. Suppose such an algorithm does exist that does not map every input to the same number. Pick the case k=1. 0 and 1 differ by less than 2 , so both will be mapped to the same number, say L. Similarly, 1 and 2 will be mapped to L. Similarly, we show that 3,4,... will all be mapped by this algorithm to L. This contradicts the hypothesis in the beginning.
Exercise : Show that a similar result holds for real numbers that differ even slightly from one another -
k
k
Copyright 2004 Koren & Krishna ECE655/SWFT .25
Consensus Comparison Problem
If versions don’t agree - they may be faulty or not
Multiple failed versions can produce identical wrong outputs due to correlated fault - system will select wrong output
Can bypass the problem by having versions decide on a consensus value of the variable
Before checking if x>C, the versions agree on a value of x to use - adding the requirement: specify order of comparisons for multiple comparisons
This can reduce version diversity, increasing potential for correlated failures
Can also degrade performance - versions that complete early would have to wait
Copyright 2004 Koren & Krishna ECE655/SWFT .26
Confidence Signals
Each version calculates |x-C| ; if < for some given , version announces low confidence in its output
Voter gives lower weights to low confidence versions
Problem: if a functional version has |x-C|< , high chance that this will also be true of other versions, whose outputs will be devalued by voter
The frequency of this problem arising, and length of time it lasts, depend on nature of application
E.g., in applications where calculation depends only on latest inputs and not on past values - consensus problem may occur infrequently and go away quickly
Copyright 2004 Koren & Krishna ECE655/SWFT .27
Version Dependence Correlated failures between versions can
increase overall failure probability by orders of magnitude
Example: N=3, can tolerate up to one failed version for any input; p = 10 - an incorrect output once every ten thousand runs
If versions stochastically independent - failure probability of 3-version system = p³+ 3p²(1-p) 3x10
Suppose stochastic dependence and there is one fault common to two versions, exercised once every million runs - causing system failure
Failure probability of 3-version system increases to over 10 , more than 30 times the failure probability of uncorrelated system
-8
-4
-6
Copyright 2004 Koren & Krishna ECE655/SWFT .28
Version Correlation Model Input space divided to regions: different probability
of input from region to cause a version to fail Example: Algorithm may have numerical instability
in an input subspace - failure rate greater than average
Assumption: Versions are stochastically independent in each given subspace Si - Prob{both V1 and V2 fail | input from Si} = Prob{V1
fails | input from Si} x Prob{V2 fails | input from Si}
Unconditional probability of failure of a version Prob{V1 fails} =
Prob{V1 fails | input from Si} x Prob{input from Si}
Unconditional probability that both fail Prob{V1 and V2 fail} =
Prob{V1 and V2 fail | input from Si} x Prob{input from Si} Prob{V1 fails} x Prob{V2 fails} i
i
Copyright 2004 Koren & Krishna ECE655/SWFT .29
Numerical Examples: Example 1 Two input subspaces S1,S2 - probability 0.5 each Conditional failure probabilities:
Version S1 S2 V1 0.01 0.001 V2 0.02 0.003
Unconditional failure probabilities: P(V1 fails) = 0.01x0.5 + 0.001x0.5 =0.0055
P(V2 fails) = 0.02x0.5 + 0.003x0.5 =0.0115
If versions were independent, probability of both failing for same input = 0.0055x0.0115 = 6.33x10
Actual joint failure probability, is somewhat higher P(V1 & V2
fail)=0.01x0.02x0.5+0.001x0.003x0.5=1.02x10
The two versions are positively correlated: both are more prone to failure in S1 than in S2
-4
-5
Copyright 2004 Koren & Krishna ECE655/SWFT .30
Numerical Examples: Example 2 Conditional failure probabilities:
Version S1 S2 V1 0.010 0.001 V2 0.003 0.020
Unconditional failure probabilities -same as Example 1
Joint failure probability - P(V1 & V2 fail) =0.01x0.003x0.5+0.001x0.02 x0.5= 2.5 x10
Much less than the previous joint probability or the product of individual probabilities
Tendencies to failure are negatively correlated: V1 is better in S1 than in S2, opposite for V2 -
V1 and V2 make up for each other's deficiencies Ideally - multiple versions negatively correlated In practice - positive correlation - since versions
are solving the same problem
-5
Copyright 2004 Koren & Krishna ECE655/SWFT .31
Causes of Version Correlation
Common specifications - errors in specifications will propagate to software
Intrinsic difficulty of problem - algorithms may be more difficult to implement for some inputs, causing faults triggered by same inputs
Common algorithms - algorithm itself may contain instabilities in certain regions of input space - different versions have instabilities in same region
Cultural factors - Programmers make similar mistakes in interpreting ambiguous specifications
Common software and hardware platforms - if same hardware, operating system, and compiler are used - their faults can trigger a correlated failure
Copyright 2004 Koren & Krishna ECE655/SWFT .32
Achieving Version Independence - Incidental Diversity
Forcing developers of different modules to work independently of one another
Teams working on different modules are forbidden to directly communicate
Questions regarding ambiguities in specifications or any other issue have to be addressed to some central authority who makes any necessary corrections and updates all teams
Inspection of software must be carefully coordinated so that inspectors of one version do not directly or indirectly leak information about another version
Copyright 2004 Koren & Krishna ECE655/SWFT .33
Achieving Version Independence - Methods for Forced Diversity
Diverse specifications Versions with differing capabilities Diverse programming languages Diverse development tools and compilers Diverse hardware and operating systems
Copyright 2004 Koren & Krishna ECE655/SWFT .34
Diverse Specifications
Most software failures are due to requirements specification
Diversity can begin at the specification stage - specifications may be expressed in different formalisms
Specification errors will not coincide across versions - each specification will trigger a different implementation fault profile
Copyright 2004 Koren & Krishna ECE655/SWFT .35
Versions With Differing Capabilities
Example: One rudimentary version providing less
accurate but still acceptable output A second simpler, less fault-prone and more
robust algorithm If the two do not agree - a third version can
help determine which of the two disagreeing versions is correct
If third version is very simple, formal methods may be used to prove correctness
Copyright 2004 Koren & Krishna ECE655/SWFT .36
Diverse Programming Languages
Programming language affects software quality Examples:
Assembler - more error-prone than a higher-level language
Nature of the errors different - in C programs - easy to overflow allocated memory - impossible in a language that strictly manages memory
No faulty use of pointers in Fortran - has no pointers Lisp is a more natural language for some artificial
intelligence (AI) algorithms than are C or Fortran
Diverse programming languages may have diverse libraries and compilers - will have uncorrelated (or even better, negatively-correlated) failures
Copyright 2004 Koren & Krishna ECE655/SWFT .37
Choice of Programming Language
Problem - should all versions use the best language for the problem or should some versions be written in other less suited languages?
If same language - lower individual fault rate but positively correlated failures
If different languages - some individual fault rates may be greater, but the overall failure rate of the N-version system may be smaller if there will be less correlated failures
Tradeoff is difficult to resolve - no analytical model exists - extensive experimental work is necessary
Copyright 2004 Koren & Krishna ECE655/SWFT .38
Diverse Development Tools and Compilers
May make possible ``notational diversity'' reducing extent of positive correlation between failures
Diverse tools and compilers (may be faulty) for different versions may allow for greater reliability
Copyright 2004 Koren & Krishna ECE655/SWFT .39
Diverse Hardware and Operating Systems
Output of system depends on interaction between the application software and its platform - operating system and processor
Both processors and operating systems are notorious for the bugs they contain
It is therefore a good idea to complement software design diversity with hardware and operating system diversity - running each version on a different processor type and operating system
Copyright 2004 Koren & Krishna ECE655/SWFT .40
Back-to-Back Testing Comparing intermediate variables or outputs
for same input - identify non-coincident faults
Intermediate variables provide increased observability into behavior of programs
But, defining intermediate variables constrains developers to producing these variables - reduces program diversity and independence
Copyright 2004 Koren & Krishna ECE655/SWFT .41
Single Version vs. N Versions Assumption: developing N versions - N times as
expensive as developing a single version Some parts of development process may be
common, e.g. - if all versions use same specifications, only one set of specifications needs to be developed
Management of an N-version project imposes additional overheads
Costs can be reduced - identify most critical portions of code and only develop versions for these
Given a total time and money budget - two choices: (a) develop a single version using the entire budget (b) develop N versions
No good model exists to choose between the two
Copyright 2004 Koren & Krishna ECE655/SWFT .42
Experimental Results
Few experimental studies of effectiveness of N-version programming
Published results only for work in universities One study at the Universities of Virginia and
California at Irvine 27 students wrote code for anti-missile application Some had no prior industrial experience while others
over ten years All versions were written in Pascal 93 correlated faults were identified by standard
statistical hypothesis-testing methods: if versions had been stochastically independent, we would expect no more than 5
No correlation observed between quality of programs produced and experience of programmer