Upload
gladys-jenkins
View
218
Download
1
Embed Size (px)
Citation preview
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova
Preventing and Eliminating Software Faults through the Life Cycle
PI: Katerina Goseva-PopstojanovaStudent: Margaret Hamill
Lane Dept. Computer Science and Electrical EngineeringWest Virginia University, Morgantown, WV
E-mail: [email protected]
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 2
Problem
• NASA spends time and effort to track problem reports/change data for every project. These repositories are rich, underused sources about the way software systems fail and the software faults that cause these failures.
• Our goal: Based on systematic and thorough analysis of the available empirical data, build quantitative and qualitative knowledge that contributes towards improvement of software quality by– preventing introduction of faults into the system – more efficiently eliminating them through the life cycle– compiling lessons learned & recommendations to be used
by the pilot project and throughout the Agency
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 33
Approach
Propose appropriate classification scheme
Refine
Identify the most frequent fault types & most frequent failure types
Compile Lessons Learned & Recommendations document
Compile a check list to support avoidance & elimination of different faults types
Explore change tracking systems & automate data extraction
Classify fault/failure data from the pilot project
Identify common patterns and unusual dependencies
Quantify multiple dimensions for each fault type
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 4
Pilot study: Basic facts
• We used a large NASA mission as a pilot study– 21 Computer Software Configuration Items (CSCIs)– millions of lines of code– over 8,000 files – developed at two different locations
• We analyzed – over 2,800 Software Change Requests (SCRs) entered due
to non-conformance with requirements• collected through the software life cycle (i.e.,
development, testing and on-orbit)• over a period of almost 10 years
• To the best of our knowledge, this is the largest dataset considered so far in the published literature
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 5
Data quality
• We collaborate closely with both the IV&V team and project team which provide invaluable support to our work
• Our analysis is based only on data for which quality has been confirmed by the IV&V and project teams
• We report results to and incorporate feedback from both teams as soon as results become available (at least quarterly)
– domain knowledge– feedback on data quality– guidance & verification of database
queries– insights into significance and impact of
the research results
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 6
Major accomplishments
• Sources of failures (i.e., type of faults)– Identified most common fault types– Showed both the internal and external validity of the results
• Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) – Only 3% of SCRs are on-orbit– Identified dominant faults types during Development & testing and On-orbit
• Severity – Only around 8% of SCRs are safety critical (less than 1% On-orbit)– Analyzed severity of different fault types
• Compiled internal document on lessons learned & recommendations for product and process improvement
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 7
Major accomplishments
• Sources of failures (i.e., type of faults)– Identified most common fault types
– Showed both the internal and external validity of the results
• Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) – Only 3% of SCRs are on-orbit
– Identified dominant faults types during Development & testing and On-orbit
• Severity – Only around 8% of SCRs are safety critical (less than 1% On-orbit)
– Analyzed severity of different fault types
• Compiled internal document on lessons learned & recommendations for product and process improvement
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 8
Source of failures
• Terminology– A fault is introduced when a human error results in a
mistake in some software artifact (e.g. requirements, design, source code)
– A failure is a departure of the system behavior from its required or expected behavior
• Fault and failure relationship is a cause-effect relationship– SCRs entered due to non-conformance with requirements
throughout the life cycle (i.e., development, testing, and on-orbit) are indications of failures
– The ‘source’ field of an SCRs categorizes the fault(s) that caused the failure
• 93% of non-conformance SCRs identify the source of failure
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 9
Larger CSCIs tend to have more non-conformance SCRs Statistically significant correlation of 0.79
Correlation of number of SCRs & size of CSCIs
0
200
400
600
800
1000
0 500 1000 1500
# of files
# o
f non-c
onfo
rmace S
CR
s Outliers discussed with the IV&V and project teams
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 10
Source of failures: Most common fault types
Most common sources of failure for all 21 CSCIs grouped together– Requirements faults (incorrect, changed & missing requirements): 33% – Coding faults: 33%– Data problems: 14%
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 11
Source of failures: Early vs. Late life cycle activities
• Distribution of sources of failures (i.e., fault types)– Requirements & Design: 38.25%
• Requirements faults: 32.65%• Design faults: 5.60%
– Coding, Interface & Integration: 48.57%• Coding Faults: 32.58%• Data Problems: 13.72%• Integration Faults: 2.27%
– ‘Other’ 5.80% and ‘Not given’ 7.38%
• This distribution of faults across life cycle activities contradicts the common belief that majority of faults are introduced during early life cycle activities, i.e., requirements and design, which dates back to some of the earliest empirical studies [Boehm et. al 75, Endres 1975, Basili et. al 1984]
• Important question: Internal & External validity of our results
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 12
CSCI id# of
releases# offiles
#non confSCRs
non conf SCRsper file
1 1 207 392 1.892 2 200 24 0.123 2 287 9 0.034 2 228 16 0.075 2 269 13 0.056 2 321 22 0.077 2 289 77 0.278 2 270 107 0.409 2 125 21 0.17
10 3 356 27 0.0811 3 444 43 0.1012 3 277 12 0.0413 3 599 121 0.2014 3 280 83 0.3015 3 81 108 1.3316 3 169 135 0.8017 4 747 201 0.2718 4 587 211 0.3619 5 552 263 0.4820 7 1368 861 0.6321 7 415 112 0.27
Total -- 8071 2858 0.35
• CSCIs have different maturity
• We compared the distribution of fault types across groups of CSCIs based on the number of current releases
Source of failures: Internal validity
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 13
Source of failures: Internal validityCSCIs with 3 releases
Consistent results: requirements faults (34%), coding faults (25%) and data problems (17%) are the most common sources of failures
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 14
Source of failures: Internal validityCSCIs grouped by the number of releases
# of releases
% files
% scrs
% cscis
% requirementsfaults
% coding faults
% data problems
1 2.56 13.72 4.76 31.12 39.54 13.01
2 24.64 10.11 38.10 39.79 34.60 11.42
3 27.33 18.51 33.33 33.65 25.14 16.82
4 16.53 14.42 9.52 28.40 43.45 9.47
5 6.84 9.20 4.76 31.94 24.71 11.03
7 22.09 34.04 9.52 32.58 30.73 15.52
• The same three most common sources are consistently dominating the fault types, accounting for 68% to 86% of the SCRs in each group– Requirements faults: 28% - 40%– Coding faults: 25% - 43%– Data problems: 9% - 17%
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 15
Source of failures: External validity
• We compared our results based on 2,858 non-conformance SCRs with results from several recent large empirical studies – 199 anomaly reports, 7 JPL unmanned spacecrafts [Lutz et al. 2004 ]– 600 software faults from several releases, switching system [Yu 1998] – 427 pre- and post-release modification requests, optical network element
[Leszak et al, 2002]– 668 faults, 12 open source projects [Duraes et al. 2006] – 408 faults, IBM operating system [Christmansson et al. 1996]
• Consistent trend across different domains, languages, development processes & organizations
Percentage of problems reported due to coding, interface and integration faults together is approximately the same or even higher than the percentage of faults due to early life cycle activities (i.e., requirements and design)
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 16
Major accomplishments
• Sources of failures (i.e., type of faults)– Identified most common fault types
– Showed both the internal and external validity of the results
• Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) – Only 3% of SCRs are on-orbit
– Identified dominant faults types during Development & testing and On-orbit
• Severity – Only around 8% of SCRs are safety critical (less than 1% On-orbit)
– Analyzed severity of different fault types
• Compiled internal document on lessons learned & recommendations for product and process improvement
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 17
Activity when the problem was discovered
The activity being performed when the problem was discovered is identified for 99% of the non-conformance SCRs
analysis48.71%
other1.43%
n/a1.01%
on orbit3.01%
acceptance testing0.17%
FQT Test6.68%
stage testing1.29%
integration testing4.41%
regression testing0.24%
other testing3.99% simulation
0.03%
dry run testing22.64% inspection/audit
6.37%
Only 3% On-orbit
39%
dis
cove
red
by
test
ing
act
iviti
es
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 18
1
10
100
1000
simula
tion
prob
lem
fabr
icatio
n/m
anuf
actu
ring
faul
t
i/o p
robl
em
com
piler
/linke
r/sfw
dev
or t
est t
ool e
rror
proc
edur
al no
ncom
plian
ce
integ
ratio
n fa
ults
proc
ess p
robl
em
desig
n fa
ults
data
pro
blem
(rea
d/writ
e/va
lue)
requ
irem
ents
fault
s
codin
g fa
ults
Source of failures
Nu
mb
er
of
no
n-c
on
form
an
ce
SC
Rs
Development & testing
On-orbit
Only 3% of SCRs are on-orbit. Note the logarithmic scale of Y axes.
Source of failures distribution: Development & testing vs. On-orbit
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 19
Source of failures distribution: Development & testing vs. On-orbit
52.33%
5.81%
12.79%
1.16%
0.00% 0.00%
13.95%
0.00%
10.47%
2.33%
1.16%
35.96%
34.56%
15.11%
0.27%
5.82%
2.58%
2.19%1.72%
0.62%
0.51%
0.66%
requirements fault
coding fault
data problem(read/write/value)design fault
process problem
integration fault
procedural noncompliance
compiler/linker/sfw dev or testtool errori/o problem
fabrication/manufacturing fault
simulation problem
On orbitDevelopment & testing
The contributions of coding, design & integration faults increase on orbit, while the contributions of requirements & data problems decrease
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 20
Major accomplishments
• Sources of failures (i.e., type of faults)– Identified most common fault types– Showed both the internal and external validity of the results
• Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) – Only 3% of SCRs are on-orbit– Identified dominant faults types during Development & testing and On-orbit
• Severity – Only around 8% of SCRs are safety critical (less than 1% On-orbit)– Analyzed severity of different fault types
• Compiled internal document on lessons learned & recommendations for product and process improvement
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 21
Severity
• Severity is assigned by the review board when deciding whether the SCR needs to be addressed– Sev 1: result in loss of a safety critical function– Sev 1N: sev1 with an established workaround– Sev 2: result in loss of a critical mission support capability– Sev 2N: sev2 with an established workaround– Sev 3: perceivable by operator but neither sev1 or sev2– Sev 4: discrepancy not perceivable to the FSW user and usually
insignificant violation of FSW requirements– Sev 5: not perceivable to the FSW user and usually a case where
programming standard is violated
Around 8% of all SCRs are safety critical (Development & testing 7% and On-orbit 1%)
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 22
32.56%
54.65%
12.79%
Development & testing vs. On-orbit severity
Larger percentage of On-orbit SCRs are safety critical
7.00%
52.99%
40.01%critical
non-critical
unclassified
On orbitDevelopment & testing
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 23
Severity distribution across types of faults
0%
5%
10%
15%
20%
25%
30%
35%
othe
r
integ
ratio
n fa
ults
proc
ess a
nd p
roce
dure
desig
n fa
ults
data
pro
blem
s
codin
g fa
ults
requ
irem
ents
fault
s
Source of failures
% o
f S
CR
s
unclassified
non-critical
critical
The highest percentage of safety critical SCRs comes from coding faults (3.60%), followed by requirement faults, design faults & data problems
(3.35% total)
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 24
Major accomplishments
• Sources of failures (i.e., type of faults)– Identified most common fault types– Showed both the internal and external validity of the results
• Activities when the problem was discovered (e.g., inspection, testing, analysis, on-orbit) – Only 3% of SCRs are on-orbit– Identified dominant faults types during Development & testing and On-orbit
• Severity – Only around 8% of SCRs are safety critical (less than 1% On-orbit)– Analyzed severity of different fault types
• Compiled internal document on lessons learned & recommendations for product and process improvement
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 25
Benefit to the pilot project
• Based on our results and the feedback from the IV&V team and project team we compiled a document for internal use which summarizes the Lessons Learned & Recommendations for Product and Process Improvement
– Prevent the introduction of faults and improve the effectiveness of detection
• Example: Increase effort spent on design and implementation of data repository used to share data between CSCIs
– Improve the quality of the data & change tracking process• Example: Ensure the changes to the software artifacts (e.g.,
requirements, code, etc) made to fix the problem are recorded and can be easily associated with a specific SCR
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 26
Broader benefit to NASA
• Understanding why, how, and when faults manifest as failures is essential to determining how their introduction into the software systems can be prevented and when and how they can be eliminated most effectively
– For the pilot project and many other NASA missions that undergo incremental development and require sustained engineering for a long period of time, these results can be used to improve the system quality
– The internal and external validity of our results indicate that several observed trends are not project specific. Rather, they seem to be intrinsic characteristics of software faults and failures which apply across projects
– Parts of our Lessons Learned document which are related to improvement of the problem/change tracking systems and data quality can be used by newer initiatives such as Constellation, thus proactively avoiding common pitfalls, leading to more accurate data and more cost efficient improvement of software quality
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 27
Technical challenges
• Assuring data quality is an important step of any empirical research effort– Inaccurate data may lead to misleading observations – Both the IV&V team and the project team have been
extremely valuable in helping us to understand the change tracking system, determine the meaning of different attributes, and verify the quality of the data
• The research approach and analysis techniques can be used by any project that tracks problem reports/change data– However, due to the lack of a unified change tracking
system, some amount of unique work on exploration of the data format and automation of data extraction may be needed
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 28
Future work [FY08-10]
• Classify faults and failures using several additional attributes
• Conduct more complex, multivariable analysis
• Continually update the Lessons Learned & Recommendations for Improvement document
• Explore the best ways to prevent and eliminate most common faults and failures throughout the life cycle; compile the results in a check list
• Increase awareness of our work so other projects within NASA can benefit from it
SAS_08_Preventing_Eliminating_SWfaults_Goseva-Popstojanova 29
Acknowledgements
We thank the following NASA civil servants and contractors for their valuable support!
• Jill Broadwater• Pete Cerna• Susan Creasy• Randolph Copeland • James Dalton • Bryan Fritch• Nick Guerra• John Hinkle
• Lynda Kelsoe• Debbie Miele• Lisa Montgomery• Don Ohi• Chad Pokryzwa• David Pruett • Timothy Plew • Scott Radabaugh