SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova Preventing and Eliminating Software Faults through the Life Cycle PI: Katerina Goseva-Popstojanova

SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova

Preventing and Eliminating Software Faults through the Life Cycle

PI: Katerina Goseva-PopstojanovaStudent: Margaret Hamill

Lane Dept. Computer Science and Electrical EngineeringWest Virginia University, Morgantown, WV

E-mail: [email protected]

SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 2

Problem

• NASA spends time and effort to track problem reports/change data for every project. These repositories are rich, underused sources about the way software systems fail and the software faults that cause these failures.

• Our goal: Based on systematic and thorough analysis of the available empirical data, build quantitative and qualitative knowledge that contributes towards improvement of software quality by– preventing introduction of faults into the system – more efficiently eliminating them through the life cycle– compiling lessons learned & recommendations to be used

by the pilot project and throughout the Agency


Approach

Propose appropriate classification scheme

Refine

Identify the most frequent fault types & most frequent failure types

Compile Lessons Learned & Recommendations document

Compile a check list to support avoidance & elimination of different faults types

Explore change tracking systems & automate data extraction

Classify fault/failure data from the pilot project

Identify common patterns and unusual dependencies

Quantify multiple dimensions for each fault type


Pilot study: Basic facts

• We used a large NASA mission as a pilot study– 21 Computer Software Configuration Items (CSCIs)– millions of lines of code– over 8,000 files – developed at two different locations

• We analyzed – over 2,800 Software Change Requests (SCRs) entered due

to non-conformance with requirements• collected through the software life cycle (i.e.,

development, testing and on-orbit)• over a period of almost 10 years

• To the best of our knowledge, this is the largest dataset considered so far in the published literature


Data quality

• We collaborate closely with both the IV&V team and project team which provide invaluable support to our work

• Our analysis is based only on data for which quality has been confirmed by the IV&V and project teams

• We report results to and incorporate feedback from both teams as soon as results become available (at least quarterly)

– domain knowledge– feedback on data quality– guidance & verification of database

queries– insights into significance and impact of

the research results


Major accomplishments

• Sources of failures (i.e., type of faults)– Identified most common fault types– Showed both the internal and external validity of the results

• Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) – Only 3% of SCRs are on-orbit– Identified dominant faults types during Development & testing and On-orbit

• Severity – Only around 8% of SCRs are safety critical (less than 1% On-orbit)– Analyzed severity of different fault types

• Compiled internal document on lessons learned & recommendations for product and process improvement



• Sources of failures (i.e., type of faults)– Identified most common fault types

– Showed both the internal and external validity of the results

• Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) – Only 3% of SCRs are on-orbit

– Identified dominant faults types during Development & testing and On-orbit

• Severity – Only around 8% of SCRs are safety critical (less than 1% On-orbit)

– Analyzed severity of different fault types



Source of failures

• Terminology– A fault is introduced when a human error results in a

mistake in some software artifact (e.g. requirements, design, source code)

– A failure is a departure of the system behavior from its required or expected behavior

• Fault and failure relationship is a cause-effect relationship– SCRs entered due to non-conformance with requirements

throughout the life cycle (i.e., development, testing, and on-orbit) are indications of failures

– The ‘source’ field of an SCRs categorizes the fault(s) that caused the failure

• 93% of non-conformance SCRs identify the source of failure


Larger CSCIs tend to have more non-conformance SCRs Statistically significant correlation of 0.79

Correlation of number of SCRs & size of CSCIs

0

200

400

600

800

1000

0 500 1000 1500

# of files

# o

f non-c

onfo

rmace S

CR

s Outliers discussed with the IV&V and project teams


Source of failures: Most common fault types

Most common sources of failure for all 21 CSCIs grouped together– Requirements faults (incorrect, changed & missing requirements): 33% – Coding faults: 33%– Data problems: 14%


Source of failures: Early vs. Late life cycle activities

• Distribution of sources of failures (i.e., fault types)– Requirements & Design: 38.25%

• Requirements faults: 32.65%• Design faults: 5.60%

– Coding, Interface & Integration: 48.57%• Coding Faults: 32.58%• Data Problems: 13.72%• Integration Faults: 2.27%

– ‘Other’ 5.80% and ‘Not given’ 7.38%

• This distribution of faults across life cycle activities contradicts the common belief that majority of faults are introduced during early life cycle activities, i.e., requirements and design, which dates back to some of the earliest empirical studies [Boehm et. al 75, Endres 1975, Basili et. al 1984]

• Important question: Internal & External validity of our results


CSCI id# of

releases# offiles

#non confSCRs

non conf SCRsper file

1 1 207 392 1.892 2 200 24 0.123 2 287 9 0.034 2 228 16 0.075 2 269 13 0.056 2 321 22 0.077 2 289 77 0.278 2 270 107 0.409 2 125 21 0.17

10 3 356 27 0.0811 3 444 43 0.1012 3 277 12 0.0413 3 599 121 0.2014 3 280 83 0.3015 3 81 108 1.3316 3 169 135 0.8017 4 747 201 0.2718 4 587 211 0.3619 5 552 263 0.4820 7 1368 861 0.6321 7 415 112 0.27

Total -- 8071 2858 0.35

• CSCIs have different maturity

• We compared the distribution of fault types across groups of CSCIs based on the number of current releases

Source of failures: Internal validity


Source of failures: Internal validityCSCIs with 3 releases

Consistent results: requirements faults (34%), coding faults (25%) and data problems (17%) are the most common sources of failures


Source of failures: Internal validityCSCIs grouped by the number of releases

# of releases

% files

% scrs

% cscis

% requirementsfaults

% coding faults

% data problems

1 2.56 13.72 4.76 31.12 39.54 13.01

2 24.64 10.11 38.10 39.79 34.60 11.42

3 27.33 18.51 33.33 33.65 25.14 16.82

4 16.53 14.42 9.52 28.40 43.45 9.47

5 6.84 9.20 4.76 31.94 24.71 11.03

7 22.09 34.04 9.52 32.58 30.73 15.52

• The same three most common sources are consistently dominating the fault types, accounting for 68% to 86% of the SCRs in each group– Requirements faults: 28% - 40%– Coding faults: 25% - 43%– Data problems: 9% - 17%


Source of failures: External validity

• We compared our results based on 2,858 non-conformance SCRs with results from several recent large empirical studies – 199 anomaly reports, 7 JPL unmanned spacecrafts [Lutz et al. 2004 ]– 600 software faults from several releases, switching system [Yu 1998] – 427 pre- and post-release modification requests, optical network element

[Leszak et al, 2002]– 668 faults, 12 open source projects [Duraes et al. 2006] – 408 faults, IBM operating system [Christmansson et al. 1996]

• Consistent trend across different domains, languages, development processes & organizations

Percentage of problems reported due to coding, interface and integration faults together is approximately the same or even higher than the percentage of faults due to early life cycle activities (i.e., requirements and design)



• Sources of failures (i.e., type of faults)– Identified most common fault types

– Showed both the internal and external validity of the results

• Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) – Only 3% of SCRs are on-orbit

– Identified dominant faults types during Development & testing and On-orbit

• Severity – Only around 8% of SCRs are safety critical (less than 1% On-orbit)

– Analyzed severity of different fault types



Activity when the problem was discovered

The activity being performed when the problem was discovered is identified for 99% of the non-conformance SCRs

analysis48.71%

other1.43%

n/a1.01%

on orbit3.01%

acceptance testing0.17%

FQT Test6.68%

stage testing1.29%

integration testing4.41%

regression testing0.24%

other testing3.99% simulation

0.03%

dry run testing22.64% inspection/audit

6.37%

Only 3% On-orbit

39%

dis

cove

red

by

test

ing

act

iviti

es


1

10

100

1000

simula

tion

prob

lem

fabr

icatio

n/m

anuf

actu

ring

faul

t

i/o p

robl

em

com

piler

/linke

r/sfw

dev

or t

est t

ool e

rror

proc

edur

al no

ncom

plian

ce

integ

ratio

n fa

ults

proc

ess p

robl

em

desig

n fa

ults

data

pro

blem

(rea

d/writ

e/va

lue)

requ

irem

ents

fault

s

codin

g fa

ults

Source of failures

Nu

mb

er

of

no

n-c

on

form

an

ce

SC

Rs

Development & testing

On-orbit

Only 3% of SCRs are on-orbit. Note the logarithmic scale of Y axes.

Source of failures distribution: Development & testing vs. On-orbit


Source of failures distribution: Development & testing vs. On-orbit

52.33%

5.81%

12.79%

1.16%

0.00% 0.00%

13.95%

0.00%

10.47%

2.33%

1.16%

35.96%

34.56%

15.11%

0.27%

5.82%

2.58%

2.19%1.72%

0.62%

0.51%

0.66%

requirements fault

coding fault

data problem(read/write/value)design fault

process problem

integration fault

procedural noncompliance

compiler/linker/sfw dev or testtool errori/o problem

fabrication/manufacturing fault

simulation problem

On orbitDevelopment & testing

The contributions of coding, design & integration faults increase on orbit, while the contributions of requirements & data problems decrease








Severity

• Severity is assigned by the review board when deciding whether the SCR needs to be addressed– Sev 1: result in loss of a safety critical function– Sev 1N: sev1 with an established workaround– Sev 2: result in loss of a critical mission support capability– Sev 2N: sev2 with an established workaround– Sev 3: perceivable by operator but neither sev1 or sev2– Sev 4: discrepancy not perceivable to the FSW user and usually

insignificant violation of FSW requirements– Sev 5: not perceivable to the FSW user and usually a case where

programming standard is violated

Around 8% of all SCRs are safety critical (Development & testing 7% and On-orbit 1%)


32.56%

54.65%

12.79%

Development & testing vs. On-orbit severity

Larger percentage of On-orbit SCRs are safety critical

7.00%

52.99%

40.01%critical

non-critical

unclassified

On orbitDevelopment & testing


Severity distribution across types of faults

0%

5%

10%

15%

20%

25%

30%

35%

othe

r

integ

ratio

n fa

ults

proc

ess a

nd p

roce

dure

desig

n fa

ults

data

pro

blem

s

codin

g fa

ults

requ

irem

ents

fault

s

Source of failures

% o

f S

CR

s

unclassified

non-critical

critical

The highest percentage of safety critical SCRs comes from coding faults (3.60%), followed by requirement faults, design faults & data problems

(3.35% total)








Benefit to the pilot project

• Based on our results and the feedback from the IV&V team and project team we compiled a document for internal use which summarizes the Lessons Learned & Recommendations for Product and Process Improvement

– Prevent the introduction of faults and improve the effectiveness of detection

• Example: Increase effort spent on design and implementation of data repository used to share data between CSCIs

– Improve the quality of the data & change tracking process• Example: Ensure the changes to the software artifacts (e.g.,

requirements, code, etc) made to fix the problem are recorded and can be easily associated with a specific SCR


Broader benefit to NASA

• Understanding why, how, and when faults manifest as failures is essential to determining how their introduction into the software systems can be prevented and when and how they can be eliminated most effectively

– For the pilot project and many other NASA missions that undergo incremental development and require sustained engineering for a long period of time, these results can be used to improve the system quality

– The internal and external validity of our results indicate that several observed trends are not project specific. Rather, they seem to be intrinsic characteristics of software faults and failures which apply across projects

– Parts of our Lessons Learned document which are related to improvement of the problem/change tracking systems and data quality can be used by newer initiatives such as Constellation, thus proactively avoiding common pitfalls, leading to more accurate data and more cost efficient improvement of software quality


Technical challenges

• Assuring data quality is an important step of any empirical research effort– Inaccurate data may lead to misleading observations – Both the IV&V team and the project team have been

extremely valuable in helping us to understand the change tracking system, determine the meaning of different attributes, and verify the quality of the data

• The research approach and analysis techniques can be used by any project that tracks problem reports/change data– However, due to the lack of a unified change tracking

system, some amount of unique work on exploration of the data format and automation of data extraction may be needed


Future work [FY08-10]

• Classify faults and failures using several additional attributes

• Conduct more complex, multivariable analysis

• Continually update the Lessons Learned & Recommendations for Improvement document

• Explore the best ways to prevent and eliminate most common faults and failures throughout the life cycle; compile the results in a check list

• Increase awareness of our work so other projects within NASA can benefit from it


Acknowledgements

We thank the following NASA civil servants and contractors for their valuable support!

• Jill Broadwater• Pete Cerna• Susan Creasy• Randolph Copeland • James Dalton • Bryan Fritch• Nick Guerra• John Hinkle

• Lynda Kelsoe• Debbie Miele• Lisa Montgomery• Don Ohi• Chad Pokryzwa• David Pruett • Timothy Plew • Scott Radabaugh

Documents

SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova Preventing and Eliminating Software Faults through the Life Cycle PI: Katerina Goseva-Popstojanova