9
Software Fault Tree Analysis * Nancy G. Leveson Peter R. Harvey With the increased use of software in safety critical sys- tems, software safety has become an important factor in system quality. This paper describes a technique, soft- ware fault tree analysis, for the safety analysis of soft- ware. The technique interfaces with hardware fault tree analysis to allow the safety of the entire system to be maximized. Experience with the technique and its practi- cality are discussed. INTRODUCTION In recent years, advances in computer technology have drastically changed the design of electromechanical de- vices. As more computers are used in critical real-time applications, the need increases for a methodology to develop dependable software. Computers currently con- trol reactions in nuclear power plants. They track air- plane positions in air traffic control systems. Hospitals use computers to monitor patients. Space vehicles use computers to deal with the complexities of space flight. And the military uses computers to enhance the capa- bilities of their weaponry. It is estimated, for example, that 80% of the U.S. weapons are computer controlled, and that by 1985, that figure will be 95% [3]. Justifiably, there is growing concern that these com- puter systems contain bugs which, by the nature of the instruments they control, constitute grave dangers to human life or property. Traditional areas of Computer Science such as reliability, fault tolerance, software en- gineering, testing etc. have labored for years to dem- onstrate a satisfactory methodology for creating and/or demonstrating programs that will behave as expected. Though there have been numerous advances in the state *This work was supported in part by a contract with Hughes Aircraft Company (7-656146-T-DS) and by a joint MICRO grant with the University of California and Hughes Aircraft. Address correspondence to Nancy Leveson, Computer Science Department, University of California, Irvine, Irvine, CA 92717. The Journal of Systems and Software 3, I73- 18 I (1983) 0 Elsevier North Holland, Inc., 1983 of the art within these disciplines, the problem of know- ing exactly what a program will do remains unsolved. Moreover, simple improvements in these methodol- ogies will still not solve the problem. Even the best de- sign strategy depends upon some initial determination of what the program is expected to do. The critical problem is our inability, as humans, to flawlessly de- scribe complex systems [ 21. Unfortunately, software analysis techniques, such as proofs and testing, depend upon the correct specification of the program for their results to have any meaning. So, even if it were possble to prove correctness, or to test all permutations of a crit- ical program, this would not serve to demonstrate the safety of the software. A second shortcoming of traditional methods is that almost none of them can focus upon the more important aspects of a program while ignoring less important ones. Specifications development currently tries to manage all of the program details, formalizing them so that no small items slip by. Test cases are usually de- veloped in order to test the functional aspects of the en- tire program, and there is currently no way to deter- mine which are the safetycritical cases. Proofs, by nature, must consider details and hence get hopelessly drowned in the minutiae of large systems. Computer science is currently at much the same stage that engineering was some time ago. Much is known about how to design better software, but little is known about how to make software perfectly depend- able. Faced with the same problem in mechanical sys- tems, safety analysts developed tools for safety mea- surement, such as fault tree analysis. With such tools, engineers are able to ensure the quality of their designs. Since computers are now replacing mechanical de- vices in many of these safety critical systems, a first log- ical step toward safety in software systems is the appli- cation of existing safety tools wherever possible. The purpose of this paper is to show the software application for one such tool, namely fault tree analysis. In this paper, the area of software safety is defined 173 0164-1212/83/$3.00

Software fault tree analysis

Embed Size (px)

Citation preview

Page 1: Software fault tree analysis

Software Fault Tree Analysis *

Nancy G. Leveson

Peter R. Harvey

With the increased use of software in safety critical sys- tems, software safety has become an important factor in system quality. This paper describes a technique, soft- ware fault tree analysis, for the safety analysis of soft- ware. The technique interfaces with hardware fault tree analysis to allow the safety of the entire system to be maximized. Experience with the technique and its practi- cality are discussed.

INTRODUCTION

In recent years, advances in computer technology have drastically changed the design of electromechanical de- vices. As more computers are used in critical real-time applications, the need increases for a methodology to develop dependable software. Computers currently con- trol reactions in nuclear power plants. They track air- plane positions in air traffic control systems. Hospitals use computers to monitor patients. Space vehicles use computers to deal with the complexities of space flight. And the military uses computers to enhance the capa- bilities of their weaponry. It is estimated, for example, that 80% of the U.S. weapons are computer controlled, and that by 1985, that figure will be 95% [3].

Justifiably, there is growing concern that these com- puter systems contain bugs which, by the nature of the instruments they control, constitute grave dangers to human life or property. Traditional areas of Computer Science such as reliability, fault tolerance, software en- gineering, testing etc. have labored for years to dem- onstrate a satisfactory methodology for creating and/or demonstrating programs that will behave as expected. Though there have been numerous advances in the state

*This work was supported in part by a contract with Hughes Aircraft Company (7-656146-T-DS) and by a joint MICRO grant with the University of California and Hughes Aircraft.

Address correspondence to Nancy Leveson, Computer Science Department, University of California, Irvine, Irvine, CA 92717.

The Journal of Systems and Software 3, I73- 18 I (1983)

0 Elsevier North Holland, Inc., 1983

of the art within these disciplines, the problem of know- ing exactly what a program will do remains unsolved.

Moreover, simple improvements in these methodol- ogies will still not solve the problem. Even the best de- sign strategy depends upon some initial determination of what the program is expected to do. The critical problem is our inability, as humans, to flawlessly de- scribe complex systems [ 21. Unfortunately, software analysis techniques, such as proofs and testing, depend upon the correct specification of the program for their results to have any meaning. So, even if it were possble to prove correctness, or to test all permutations of a crit- ical program, this would not serve to demonstrate the safety of the software.

A second shortcoming of traditional methods is that almost none of them can focus upon the more important aspects of a program while ignoring less important ones. Specifications development currently tries to manage all of the program details, formalizing them so that no small items slip by. Test cases are usually de- veloped in order to test the functional aspects of the en- tire program, and there is currently no way to deter- mine which are the safetycritical cases. Proofs, by nature, must consider details and hence get hopelessly drowned in the minutiae of large systems.

Computer science is currently at much the same stage that engineering was some time ago. Much is known about how to design better software, but little is known about how to make software perfectly depend- able. Faced with the same problem in mechanical sys- tems, safety analysts developed tools for safety mea- surement, such as fault tree analysis. With such tools, engineers are able to ensure the quality of their designs.

Since computers are now replacing mechanical de- vices in many of these safety critical systems, a first log- ical step toward safety in software systems is the appli- cation of existing safety tools wherever possible. The purpose of this paper is to show the software application for one such tool, namely fault tree analysis.

In this paper, the area of software safety is defined

173 0164-1212/83/$3.00

Page 2: Software fault tree analysis

174 N. G. Leveson and P. R. Harvey

and distinguished from other areas of computer science. Second, safety engineering procedures are briefly out- lined, since these methods are, for the most part, quite applicable to software safety. Finally, software fault tree analysis is described and examples of its results given.

SOFTWARE SAFETY

Software safety is a relatively new concern of software research 161. In this approach, a program is analyzed as to what it will not do, rather than what the program will do. More than a trick in semantics, the change rep- resents the fact that there is a very large number of er- roneous results from a given program, while typically there are many fewer results which would have cata- strophic consequences.

For example, in a guided missile application one may be more concerned that the missile does not deto- nate prematurely, thereby endangering friendly forces, than whether the missile hits its target. While other ap proaches try to demonstrate the functionality of the whole system, this effort focuses upon only those parts of the system which have safety ramifications.

In the software area, safety has long been confused with reliability. In certain applications such as in digital flight controllers aboard aircraft and spacecraft, the re- liable operation of the computer software is in part the same as the safe operation of the software. A failure of the system to continue operation while the craft is aloft would certainly be a safety failure. But while the reli- able operation of the computer might include the con- trol of inconsequential panel instrumentation and the like, the safe operation does not. Instrument control is not as important as flying the craft.

Safety and reliability clearly diverge in other appli- cations such as controlling a nuclear reactor, where the reliable operation of the sysem is in certain ways con- trary to its safe operation. Should a fault in the soft- ware be found, a reliable design would attempt to con- tinue operation by calling a second routine to calculate the value a different way, such as in the recovery block scheme of Randeli [ 71. On the other hand, a completely safe design might cause an immediate reactor shut- down, since the goal of avoiding hazards would be more important than continuing operations. While the relia- ble design could possibly cause a catastrophe, the com- pletely safe design might not allow the reactor to operate.

Thus the goal of software safety is to produce safe designs in the context of achieving the desired system performance. This is accomplished by separating the safety and functional aspects of the design, by analyz-

ing the effects of failures in safety related parts, and by deemphasizing functional requirements in the face of possible safety failures.

SAFETY ENGINEERING

The safety analysis of electromechanical devices is a structured methodology beginning with a procedure called preliminary hazard analysis (PHA) [4]. Using system design information, it attempts to find all haz- ards posed by the system and to categorize them ac- cording to their consequences.

Once the hazards have been determined and the de- sign of the system is unde~ay, detailed analysis can proceed. Fault tree analysis (FTA) is employed from this point forward, sometimes along with other analysis methods, to review the design in progress. The objective is to give probability estimates for the major hazards determined by the PHA, as well as to point out single point failure modes and guide further design in the most fruitul direction for hazard elimination and reduction,

SAFETY ANALYSIS OF SO~WARE

The safety analysis of software proceeds in a manner similar to that of hardware. This analysis requires a representation of the program logic such as a detailed design and a list of safety failures to be analyzed. These failures can be derived from the safety requirements. The analysis attempts to look at the logic from a differ- ent perspective than those who are interested in the functional aspects of the program. The goal is to find failure modes or failure scenarios which are or could lead to the specified safety failures, or alternatively, to show that the logic contained in the design is not likely to produce any safety failures. The method essentially assumes that any procedure, value, or input has the po- tential to be incorrect. The purpose of the analysis is to determine which incorrect values, functions, or combi- nations of these will cause a catastrophe. Information regarding these possible failure modes can then be used to guide the further design of the software, to include additional safety features, and to facilitate thorough testing by pinpointing critical functions and test cases.

Software Fault Tree Analysis

The basic procedure of fault tree analysis is (1) assume that the system has failed in the way described by the hazard analysis, and (2) work backwards to determine the set of possible causes for the condition to occur. At the root of a fault tree is the event which is to be ana-

Page 3: Software fault tree analysis

Software Fault Tree Analysis 175

lyzed, the “loss event.’ Necessary preconditions are de- scribed at the next level with either an AND or an OR relationship. Each subnode is expanded in a similar way until all leaves describe events of calculable probability or are unable to be analyzed for some reason. Thus, a fault tree is a representation of the internal conditions of the instrument and the inputs necessary for an event to occur. Figure l(a) is an example of a high-level sys- tem fault tree.

Tree reductions. Once a fault tree is built, it can be converted to an equivalent form for easier analyses. Typically used is a grammar consisting of the symbols * and + (for AND and OR) and the labels that have been assigned to each node. The strings generated by this grammar form what are called “cause statements” in safety engineering. Each such string represents a failure sequence or scenario, i.e., the set of circum- stances which together can cause the loss event or ca- tastrophe. False redundancies are removed by applying Boolean logic to these “cause statements,” creating “primal cause statements.”

Sensitivity analysis. Once the primal cause state- ments are known, a numerical analysis can be applied to determine the most likely causative agents of the loss event. Sensitivity analysis measures the effect of each event on the loss event statistic [ 11. By observing which events in the tree make the most dramatic changes in the loss statistic, it can be determined which parts of the design are most sensitive to failure and, hence, which parts should receive the most time and effort in design and testing.

Symbols. The symbols used in software fault trees are a subset of those currently in use in their hardware counterparts [see Figure l(b)]. It is clearly advanta- geous to use a consistent set of symbols so that hard- ware and software trees can be linked together at their interfaces. This allows entire systems to be analyzed with FTA.

The r~~~~ng~~ is used, as in hardware trees, for events which must be further analyzed in the tree. The circle is used for primary failures of routines. If a rou- tine deviates from its specification, then a primary fail- ure has occurred and a circle is used to represent this fact. The failure probability for the node may be ob- tained from testing, operational experience, or by any other software reliability measurement techniques.

The house indicates the nonfailure of a routine, i.e., the correct operation of that part of the system. The probability for this node is simply the reliability of the routine.

The diamond represents events which are not pri- mary failures and which are not further analyzed. A routine which fails because of an improper environment is an example. This obviates the need to analyze the set of all environments under which the routine could fail, and thus reduces the amount of work required.

The ovaE indicates a necessary condition of the pro- gram’s internal state. For example, the condition that a variable has a particular value, or range of values, is represented with an oval. It is neither a success nor fail- ure in itself. It is simply a condition.

Finally, the “AND” and “OR” gate symbols are used in exactly the same way as in hardware trees. The AND requires all input events and conditions to pro- duce the output event. The OR gate requires one or more of the inputs to generate the output event.

Procedure. The major difference between hardware and software fault trees, as one might expect, is in their procedures. While hardware com~nents are connected by simple wires, software components are arranged in a variety of ways using many different language con- structs. Since hardware fault trees are built by working backwards through a design wire by wire, software fault trees must be built by working backwards through a program statement by statement.

Software fault tree analysis begins with an assump tion that the loss event has occurred. Thus, the code responsible for the output is the starting place for the analysis. Working backwards, one deduces both how the program got to this part of the code and what the current values of variables must be.

In this paper we will consider the following program constructs: IF-THEN-ELSE statements, ASSIGN- MENT statements, FUNCTION calls, and WHILE statements. Other construct can be inferred by these, i.e., input and output are essentially assignment, IF- THEN statements are similar to IF-THEN-ELSE statements, etc. For the sake of generality, the event is left unnamed, and the failures and conditions are re- ferred to by syntactic position in the statement.

Z~-~~E~-E~E Statements. Figure 2(a) describes the general format for an IF-THEN-ELSE statement and assumes that the event occurred within the bounds of that statement. In other words, executing the state- ment in some environment caused the event, so we wish to build a tree which describes that environment.

The syntax of the IF-THEN-ELSE descibed in Fig- ure 2(a) is:

IF (COND)THEN (THEN-PART)ELSE (ELSE-PART),

where (COND) is a simple conditional, i.e., not a Bool- ean function call. This simplifies the tree considerably, yet no power is lost. To analyze an IF-THEN-ELSE

Page 4: Software fault tree analysis

176 N. G. Leveson and P. R. Harvey

AT MAXIMUM

1

GAS ON ~0~s MOTORS OFF

/

0 cl 0 c> cl

a

The rectangle indicates an event to be analyzed further.

The circle represents a basic fault event or primary failure of a component. It requires no further development, and Its probability of occurrence is derived from the generic rate of the part.

Figure 1. (a) Firewheel spin control, top level fault tree. (b) Relevant fault tree symbols from MIL-STD-882A.

The house is used for events which normally occur in the system. It represents the con- tinued operation of the component, and its probability is the reliability of the part.

Tbe diamond is used for non-primal events which are not developed further for lack of information or insufficient consequences,

The oval is used to indicate a condition. It defines the state of the system that permits a fault sequence to occur. It may be normal or result from failures.

The AND gate serves to indicate that all inpu=vents are required in order to cause the output event.

The JJ$ gate indicates that one or more of the input events are required to produce the output event.

Page 5: Software fault tree analysis

Software Fault Tree Analysis 177

, CONDITION TRUE,

THEN-PART CAUSES WENT

CONDITION FALSE, ELSE-PART

Figure 2. (a) General fault tree format for an IF-THEN-ELSE. (b) Example of IF-THEN structure.

CAUSES X> 100 CAUSES X)100

A> B PRIOR TO AdB PRIOR TO x:- 10 I"-THEN-ELSE CAUSES X)100 IF-THEN-ELSE CAUSES X> 100

with a Boolean function for a conditional, transform it into

B: =F(X);

IFBTHEN (THEN-PART)ELsE (ELSE-PART),

where F(X) is the Boolean function call and B is a Bool- ean variable.

Generating the fault tree for the IF-THEN-ELSE begins by noting that there are only two places, namely the THEN-PART and the ELSE-PART, which can operate or cause things to happen. If the event occurred in the THEN-PART, we know from the control seman- tics of the IF-THEN-ELSE that the COND must have been TRUE prior to the execution of the statement. Further, the THEN-PART must be capable of causing the event. This can be either a YES/NO answer, or the subject of more fault tree analysis. The ELSE-PART is analyzed in a manner symmetric with the THEN- PART.

For example, a simple statement

IFA>BTHENX:=F(X)ELSEX:=lO

when analyzed for the event “X > 100” yields a fault tree of the form given in Figure 2(b). Note that node G5 describes an impossible situation, hence it is im- mediately assigned a zero probability. The subtree of which it is a part is also awarded a zero probability, since it is an input to an AND gate with one of the nodes equal to zero.

The fault tree basically transforms this problem into two problems operating on one less line of code. To solve the original problem or to provide the failure sce- nario for the IF-THEN-ELSE is to analyze the soft- ware which operates prior to that statement for the events “A > B” and “F(X) > 100.” Both are neces- sary conditions for the initial loss event. And in turn, any and all conditions which are needed for these sub- events are also preconditions for the initial loss event.

Page 6: Software fault tree analysis

178 N. G. Leveson and P. R. Harvey

FIRE CALLED

P,Q CAUSE A<B

P CAUSES AC.01 r’l cl RAN(O) < .Ol

Figure 3. Assignments to variables cause goal replacement.

Thus the recursive application of fault tree analysis will generate the full set of faults and conditions necessary for the loss event.

The conditional nodes can be applied to decreasing the test set for nonanalyzed functions. Cause state- ments for this example will have a node “A > B” AND’d to “F(X) > 100.” The test set for F(X) need only include cases in which A is greater than B, since the failure occurred with that as a known attribute of the environment.

Assignment statements. An assignment statement is the mechanism by which a variable is associated with a value. Throughout software fault trees, variables will be referenced and conditions placed on ranges of their val- ues. The assignment statement is therefore the end point, i.e., the ultimate goal statement, for a backward search to determine whether or not the variable can have the prescribed range of values.

Most assignment statements have the effect of sim- ple goal replacement. Suppose for example, the original goal is to find “A > x” in a sequence of statements Pl...Pn, where Pn is an assignment statement

A:= expression.

The new goal is to find the event “expression > X” in sequence Pl...Pn-1 . The variable A is removed from further consideration.

Expressions having more than a simple term can be treated with AND or OR conditionals. For example, if the statement “A := B/C” is analyzed for the event “A TOO LOW,” then the preconditions for this are “C TOO HIGH” or “B TOO LOW.” The analysis contin- ues one statement above the assignment.

A simple example of goal replacement is given by this three line program, analyzed for the event “FIRE CALLED’:

P: A :=RAN(O);

Q: B :=Ol;

IFA CBTHENFIRE;

The fault tree for this is given in Figure 3. Function calls. There are two possible reasons for

analyzing a function call. One is that the value it re- turns is part of the event of interest. The other is that the routine has side effects which concern the event. A goal of current software engineering practice is to min- imize the occurrence of side effects, and this paper will not consider the side effect problem at length. Briefly stated, however, if any of the function’s side effects can influence the event of interest, then the conditions under which those side effects occur must be OR’d to- gether with the normal function tree.

The fault tree for a function call considers both pa- rameter and algorithm failures. Figure 4 shows the gen- eral format for the function subtree. The failure of the function can be either left as a primary failure node, or it can be further analyzed using fault trees. Each pa- rameter is analyzed for its possible role in the event, under the assumption that the algorithm works. What is needed to continue the investigation is a specification about what each parameter is supposed to be and how each relates to the returned value of the function. A violation of that specification is then the event to be analyzed in the program prior to the function call.

For example, consider the very simple function for square roots “SQRT.” Analyzing a call of “SQRT(X)” with an event “SQRT(X) < 100,” we conclude that either (1) the SQRT routine could fail low, (2) X could be in violation of the specification for SQRT parame- ters, namely X negative, or (3) SQRT could function properly and therefore “X < 10000” would cause the event.

WHILE Loops. Figure 5(a) illustrates the general format for fault trees which consider an event to have occurred at a while statement or before. The left sub- tree is concerned with the possibility that the event oc- curred prior to the while statement. The right part ana-

Figure 4. Function calls must consider both parameter and algorithm failures.

F(Pl,PZ....) CAUSES EVENT

I

Pl OR P2 . . . CAUSE F TO

c CAUSE EVENT _

1

F FAILS CAUSING THE

EVENT

Page 7: Software fault tree analysis

Software Fault Tree Analysis

Figure 5. (a) Format for WHILE statements. (b) Example of WHILE statement.

WHILE-STMT CAUSES EVENT

EVENT PRIOR TO NTH ITERATION

WHILE-SMT CAUSES EVENT WHILE-STMT WHILE-STMT

lyzes the statement part of the while loop, and only

BODY EXECUTED

!I2 looks at the one iteration in which the event took place.

Suppose we would like to analyze the while statement

WHILEB >XDO

BEGINB :=B-1;

2: =z +10;

END;

for the event “Z > 100.” The left subtree of Figure 5(b) supposes that the WHILE statement never exe- cuted. This implies that Z had to be greater than 100 initially and B less than or equal to X. The right subtree examines the modification of Z within the body of the WHILE. Letting n represent the unknown number of times the loop will iterate, and letting Z, be the value of Z on the nth iteration, we can say that “Z, > 100” is the event within the loop. We now must find an expression for n and Z, in terms of Z,,, the original Z before the loop.

The assignment statement for Z yields an expression for Z,= Z, -I- lO*n. The loop iterator n is defined by the assignment to B : = B - 1. After an assumed n

iterations, B, = B0 - n. Since the iteration stopped at n times, we know that the condition of the WHILE be- came false. Hence B, > = X, B, - n > = X, and B, - X > = n. Combining expressions, we deduce 100 < Z, < = Z, + 10(Bo - X). Now we have an event to analyze again in terms of variables before the WHILE statement. This process is essentially the same as that employed in proof of correctness in which a loop invariant is determined. However, the invariant in SFTA need involve only the critical variables of the loop and the loop control mechanism.

Analyses of software fault trees. Reducing soft- ware fault trees to grammars proceeds as described ear- lier, but a numerical analysis of the loss statements is not as straightforward as in hardware. The basic differ- ence between hardware and software trees is that hard- ware components fail independently of each other, whereas software component failures are typically cor- related. Although modern software design techniques aim toward independence of modules, it is unlikely that numerical analyses of software trees will ever be as pre- cise as they are for hardware trees.

Page 8: Software fault tree analysis

180 N. G. Leveson and P. R. Harvey

This is not to say, however, that a numerical study cannot yield information, since the loss expression can be treated as a relative probability of the loss event. That is, the figures can be used for comparing altema- tive designs. Moreover, by ranking the component fail- ures according to their effect on the loss statistic, it is possible to determine on which routines to spend the most effort in design and testing. Furthermore, the fail- ures which are most likely to cause a safety failure of the system can be identified and test cases tailored to accurately determine the probabilities of these failures.

Experience with SFTA. In order to test the proce- dure, a number of different levels of software were viewed. Programs, after all, are represented in different ways at different stages of the life cycle, and it is im- portant to discover just what the technique could con- tribute at each stage. The most dramatic results were found when the technique was applied to a complex program, i.e., one which controls a spacecraft [ 51.

The program chosen consists of over 1250 lines of Intel 8080 assembly code and controls the flight and telemetry for a U.C. Berkeley spacecraft, designed for launch from a mother ship called FIREWHEEL (NASA/ESA). It is a real-time system of sufficient complexity to show the analytic power of fault trees in larger projects.

The mission of the spacecraft was to sample electric fields in the earth’s magnetotail using wire booms de- ployed by the microprocessor after launch. Electrosen- sitive spheres at the end of these booms were continu- ously sampled by the microprocessor, which transmitted the information in the telemetry format. Other instruments were controlled and readings taken by the same program. Due to cost considerations, there was no earth-satellite command control. Hence the mi- croprocessor had to make all its own decisions.

During the deployment of the booms, it was posited that the spacecraft could fail such that the spin rate would cause excessive centrifugal force on the spheres. Beyond 25 times gravity, the wires would be ripped off the spacecraft. Yet it was known that early in the de- ployment, the force would be at least 20 times gravity. The question was, how likely is it that the spacecraft will rip its booms apart?

Applying fault tree analysis to the spacecraft pro- gram, a critical failure scenario was located rather deep in the code. The scenario indicated that two sun-inter- rupts within 64 msec of each other were capable of crashing the microprocessor, and if coupled with cer- tain conditions of the deployment sequence, would re- sult in destroying the booms. Though it was nearly im-

possible for real sun pulses to occur so close together, one of them need only be a gamma ray induced spike which is a common occurrence in space electronics. A simple check, or “blocking element,” could have been inserted to prevent this condition.

The analysis required a close look at about 12% of the FIREWHEEL program, and took only a couple of days, most of which was spent translating the program from assembly code into Pascal and drawing diagrams. By comparison, the analysis effort required to find such a failure through proofs or testing, would surely dwarf that which was needed for fault tree generation. In fact, substantial testing by an independent group had failed to reveal the software problem that the above analysis found.

Placement in the life cycle. The placement of soft- ware FfA in the life cycle is a question which has both a theoretical and a practical part. First, how much in- formation about the design is required in order for the analysis to make any predictions? And second, how much work does it take to modify fault trees when the design changes?

Experience gained in this project indicates that fault tree analysis could be effectively applied as early in the software life cycle as the detailed design stage. The analysis needs to know which functions will be used, what variables each function will use, and what vari- ables each will change. It is not important to know ex- actly how the low level functions perform their tasks. Failures of these functions could be left as simple nodes in the tree and perhaps analyzed later after the code is written.

Use of fault tree analysis (or any analysis method for that matter) too early in the life cycle would, of course, result in wasted effort. Repeated transformations of the design in the life cycle will unfortunately cause corre- sponding changes to the fault tree representation. The desire to build fault trees early in the life cycle will therefore depend upon how easy it is to modify the trees once they are built. Modular programming might have a modularizing effect upon the trees and thus changes to code and to the trees should be proportional, but there is no evidence to support this yet. Of course, the longer SFIA is delayed, the more difficult it will be to make the changes in the design of the software which the SFTA procedure shows as necessary.

The cost of SFTA. A second and very important concern is one of cost. The use of SFTA is clearly de- pendent upon practicality. As the FIREWHEEL ex-

Page 9: Software fault tree analysis

Software Fault Tree Analysis 181

ample shows, each analysis involves only those parts of the system that are needed to represent the event. Only 12% of the program was involved in making the trees for the FIREWHEEL example. However, only one event was analyzed. Since a tree represents the analysis of a given event, analyzing multiple events necessitates the generation of multiple trees. Thus, for a set of events, one may have to analyze the program multiple times.

SUMMARY

Fortunately, there are a number of methods one can employ to reduce the fault tree analysis effort. First, the trees can be pruned of analysis branches which are in- puts to AND gates by exploring means to assure zero probability on any other node in the AND. In general, one should analyze the easiest of the nodes in an AND gate before all others. This will cut the analysis to the minimal amount necessary.

Second, any conditional nodes in the tree which are outside of functional bounds can be removed by asser- tions. That is, if there is a node “X>lOO” in the tree and the functional requirements of the program do not need such values of X, then using an assertion for X in the program can remove this node from consideration. If this node is part of an AND, then all other input events can be removed from further consideration.

In summary, the fault tree analysis technique seems well suited for safety analysis of software. It has the focus needed to give priority to catastrophic events. It is capable of completely analyzing a design for its pro- pensity to cause a given event, with the results of the analysis being the failure scenarios. It yields informa- tion about the most important and the least important functions in the design. It details where to place effec- tive asertions in a program as well as how to properly test the program for safety. And it can directly inter- face with hardware fault tree analysis so that both the hardware and software of computer-controlled systems can be analyzed. In all, it appears to be an understand- able and practical technique for analyzing software safety. Future work is aimed at getting more practical experience in using SFTA.

REFERENCES

1.

2.

A last resort to reducing the amount of fault tree analysis is simply to stop analyzing. Fault tree analysis need not analyze every function or procedure in order to yield loss event information. Any function or proce- dure can be left as a node in the tree, with testing or another analysis method used to determine its failure statistic. The point at which fault tree analysis should stop and testing should begin will have to be studied.

3.

4.

5.

6. Finally, the use of a software tool to aid in the pro-

duction and analysis of fault trees will obviously have a positive effect on the practicality of using SFTA. Such a tool is currently under development.

7.

R. L. Browning, The Loss Rate Concept in Safety Engi- neering, Marcel Dekker, New York, 1980. S. L. Gerhart and L. Yelowitz, Observations on the Fal- libility in Applications of Modern Programming Meth- odologies, IEEE Trans. Software Engin. SE-4, 5 (1976). J. G. Griggs, A Method of Software Safety Analysis, Proceedings of the Fifth International System Safety Conference, System Safety Society, Denver, 198 1, W. Hammer, Handbook of System and Product Safety, Prentice Hall, Englewood Cliffs, N.J. 1972. P. R. Harvey, Fault Tree Analysis of Software, Master’s Thesis, University of California Irvine, Irvine, California, 1982. N. G. Leveson, Software Safety from a Software View- point, Proceedings of the Fifth International System Safety Conference, System Safety Society, Denver, 198 1. B. Randell, System Structure for Software Fault Toler- ance, IEEE Trans. Software Engin. SE-I, 2 (1975).