Research Heaven, West Virginia A Framework for Early Reliability Assessment Bojan Cukic, Erdogan Gunel, Harshinder Singh, Lan Guo, Dejan Desovski West

Research Heaven,West Virginia

A Framework for Early Reliability Assessment

Bojan Cukic, Erdogan Gunel, Harshinder Singh,

Lan Guo, Dejan Desovski

West Virginia University

Carol Smidts, Ming Li

University of Maryland

(WVU UI: Integrating Formal Methods and Testing in a Quantitative Software Reliability Assessment Framework 2003)

2


Overview

• Introduction and Motivation.• Software Reliability Corroboration

Approach. • Case Studies.• Applying Dempster Shafer Inference

to NASA datasets. • Summary and Further Work.

3


Introduction

• Quantification of the effects of V&V activities is always desirable.

• Is software reliability quantification practical for safety/mission critical systems?– Time and cost considerations may limit the appeal.

• Reliability growth applicable only to integration testing, the tail end of V&V.

• Estimation of operational usage profiles is rare.

4


Is SRE Impractical for NASA IV&V?

• Most IV&V techniques are qualitative in nature.– Mature software reliability estimation methods based

exclusively on testing.

• Can IV&V techniques be utilized for reliability?– Requirements readings, inspections, problem reports and

tracking, unit level tests…

Req Design Code Test (Verification & Validation)Unit Integration Acceptance

Life cycle long IV&V Implementation

TraditionalSoftware ReliabilityAssessmentTechniques

5


Contribution

• Develop software reliability assessment methods that build on:– Stable and mature development environments.– Lifecycle long IV&V activities.– Utilize all relevant available information

• Static (SIAT), dynamic, requirements problems, severities.

– Qualitative (formal and informal) IV&V methods.

• Strengthening the case for IV&V across NASA enterprise.– Accurate, stable reliability measurement and tracking. – Available throughout the development lifecycle.

6


Assessment vs. Corroboration

• Current thinking– Software reliability “tested into” the product through

the integration and acceptance testing.

• Our thinking– Why “waste” the results of all the qualitative IV&V

activities.– Testing should corroborate that the life-cycle long

IV&V techniques are giving the “usual” results, that the project follows usual quality patterns.

7


Approach

Software qualityMeasures (SQM)

Reliability PredictionSystems (RPS)

RPS CombinationTechniques

SW Reliability CorroborationTesting

SQM1SQM3

SQM2

SQM4

SQM6

SQM5SQMi

SQMj

RPS1 RPS2 RPSk RPSm. . .

RPS Combination (Experience, Learning, Dempster-Schafer…)

BHT software reliability corroboration

Null Hypothesis, H0

Alternative Hypothesis, Ha

Softw

are Developm

ent Lifecycle

Trustworthy Software Reliability Measure

8


Software Quality Measures (roots)

• The following ones used in experiments.– Lines of code– Defect density

• No defect that remain unresolved after testing, divided by the LOC.

– Test coverage• LOCtested / LOCtotal.

– Requirements traceability• RT= #_requirements_implemented/#_original_requirements.

– Function points

– . . .

• In principle, any measures available could/should be taken into account. – Defining appropriate Reliability Prediction Systems (RPS).

9


Reliability Prediction Systems

• An RPS is a complete set of measures from which software reliability can be predicted.

• The bridge between an RPS and software reliability is a MODEL.

• Therefore, select (and collect) those measures that have the highest relevance to reliability. – Relevance to reliability ranked from expert opinions

[Smidts 2002].

10


RPS for Test Coverage

))1(1ln(2

10

0

MFPkILOCMFPkTLOC

aLeaa

N

T

K

s ep

RPS Model

Test coverage

Root measures Notation

:Test coverageSupport measures: Implemented LOC (LOCI)

Tested LOC (LOCT)

The number of defects found by test (N0) Missing function point (FPM)

Backfiring coefficient (k) Defects found by test (DT)

Linear execution time (TL)

Execution time per demand ()Fault exposure ratio (K)

C0 defect coverageC1 test coverage (statement

coverage)a0,a1,a2 coefficients

N0 the number of defects found by testN the number of defects remainingK fault exposure ratioTL linear execution time

the average execution time per demand

11


Approach





SQM1SQM3

SQM2SQM4

SQM6

SQM5SQMi

SQMj




Null Hypothesis, H0


Softw

are Developm

ent Lifecycle

Software ReliabilityMeasure

12


Reliability “worthiness” of different RPS

Measure/RPS Relevance to Reliability

Failure rate 0.98

Test coverage 0.83

Mutation score 0.75

Fault density 0.73

Requirement specification change request

0.64

Class coupling 0.45

No. class methods 0.45

Man hours/defect detected 0.45

Function point analysis 0.00

32 measures ranked by

five experts

13


Combining RPS

• Weighted sums used in initial experiments.– RPS results weighted by the expert opinion index. – Removing inherent dependencies/correlations.

• Dempster-Shafer (D-S) belief networks approach developed. – Network automatically built from datasets by the

Induction Algorithm.

• Existence of suitable NASA datasets? – Pursuing leads with several CMM level 5 companies.

14


Approach





SQM1SQM3

SQM2SQM4

SQM6

SQM5SQMi

SQMj




Null Hypothesis, H0


Softw

are Developm

ent Lifecycle

Software ReliabilityPrediction

15


Bayesian Inference

• Allows for the inclusion of imprecise (subjective) probability of failure.

• Subjective estimate reflects beliefs.• Hypothesis on the event occurrence

probability is combined with new evidence, which may change the degree of belief.

16


Bayesian Hypothesis Testing (BHT)

• Hypothesized reliability H0 comes as a result of RPS combination.

• Based on the level of (in)experience, the degree of belief assigned: P(H0).

• Corroboration testing now looks for the evidence in favor of the hypothesized reliability.

– Ho : <= o null hypothesis

H1 : > o alternative hypothesis.

17


o P(H0) n0 n1 n2

0.01 0.01 457 476 497

0.001 0.01 2378 2671 2975

0.0001 0.01 6831 10648 14501

0.01 0.1 228 258 289

0.001 0.1 636 1017 1402

0.0001 0.1 853 3157 6150

0.01 0.4 90 128 167

0.001 0.4 138 411 739

0.0001 0.4 146 1251 3260

0.01 0.6 50 87 126

0.001 0.6 63 269 552

0.0001 0.6 65 827 2458

The number of corroboration tests according to BHT theory

18


Controlled Experiments

• Two independently developed versions of PACS (smart card based access control).– Controlled requirements document (NSA specs).

Exception

message : char *

print()Exception()~Exception()

IOException

Display

fileName : char *

Display()write()clear()~Display()

Validator

Validator()getPIN()~Validator()

PACS

val : Validatordriver : CommuserLCD : DisplayofficerLCD : Displayaudit : Auditorname : char[21]ssn : char[10]pin : char[5]

PACS()run()init()failure()seeOfficer()openDoor()readCard()readPIN()~PACS()

Auditor

Auditor()writeEntry()~Auditor()

Comm

base : unsigned short int

Comm()readRegister()writeRegister()readCardData()waitForOne()waitForZero()~Comm()

TimeOutException

1

11

2

1

1

1

1

19


RPS Experimentation

Measure Relevance to reliability

Predicted failure rate

Code defect density 0.85 0.078

Test coverage 0.83 0.092

Requirements traceability 0.45 0.078

Function point analysis 0.00 0.0020

Bugs per line of code (Gaffney) 0.00 0.000028

RPS predictions of system failure rates:

Predicted Failure Rate: 0.084

Actual Failure Rate: 0.09

20


Reliability Corroboration

• Accurate predictors appear adequate – Low levels of trust in the prediction accuracy.– No experience in repeatability at this point in time.

oP(Ho) # corroboration tests

0.09 0.01 72

0.09 0.1 47

0.09 0.2 39

0.09 0.5 25

0.09 0.6 20

21


“Research Side Products”

• Significant amount of time spent studying and developing Dempster-Shafer inference networks.

• “No hope” of demonstrating this work within the scope of integrating RPS results. – Availability of suitable datasets.

• But, some datasets are available. So, use them for D-S demo!– Predicting fault-prone modules in two NASA projects (KC2, JM1)– KC2 contains over 3,000 modules, 520 modules of research interest

• 106 modules have errors, ranging from 1 to 13• 414 modules are error free

– JM1 contains 10,883 modules

• 2,105 modules have errors, ranging from 1 to 26• 8,778 modules are error free

– Each dataset contains 21 software metrics, mainly McCabe and Halstead

22


How D-S Networks Work

• Combining distinct sources of evidence by the D-S scheme.

• Building D-S networks by prediction logic.– Nodes connected by implication rules.– Each implication rule assigned a specific weight.

• Updating belief for the corresponding nodes – Propagating the updated belief to the neighboring nodes,

and throughout the entire network.

• D-S network can be tuned for a various range of verification requirements.

24


D-S Networks vs. ROCKY

• KC2 • JM1

55

60

65

70

75

80

85

90

95

1 2 3 4 5

Experiment No.

Per

cent

(%)

Effort-DS

Effort-Ro

Acc-DS

Acc-Ro

PD-DS

PD-Ro

20

30

40

50

60

70

80

90

100

1 2 3 4 5

Experiment No.

Perc

ent (

%)

Effort-DS

Effort-Ro

Acc-DS

Acc-Ro

PD-DS

PD-Ro

25


D-S Networks vs. See5

• KC2 • JM1

40

45

50

55

60

65

70

75

80

85

90

DecisionTree RuleSet Boosting

See5 Classifiers

Per

cent

(%)

PD-C5

PD-DS

Acc-C5

Acc-DS

0

10

20

30

40

50

60

70

80

90

DecisiontTree RuleSet Boosting

See5 Classifiers

Perc

ent (

%)

PD-C5

PD-DS

Acc-C5

Acc-DS

26


D-S Networks vs. WEKA

KC2 dataset

30

40

50

60

70

80

90

100

WEKA Classifiers

Perc

en

t (%

)

PD-WEKA

PD-DS

Acc-WEKA

Acc-DS

27


D-S Networks vs. WEKA

• JM1

0

10

20

30

40

50

60

70

80

90

100

Logistic

KernelDensity

NaiveBayesSimple J48 IBK IB1

VotedPerceptron VFI

HyperPipes

WEKA Classifiers

Per

cen

t (%

)

PD-WEKA

PD-DS

Acc-WEKA

Acc-DS

28


Status and Perspectives

• Software reliability corroboration allows:– Inclusion of IV&V quality measures and activities into

the reliability assessment.– A significant reduction in the number of

(corroboration) tests.– Software reliability of safety/mission critical systems

can be assessed with a reasonable effort.

• Research directions.– Further experimentation (data sets, measures,

repeatability).– Defining RPS based on the “formality” of the IV&V

methods.

Documents

Research Heaven, West Virginia A Framework for Early Reliability Assessment Bojan Cukic, Erdogan Gunel, Harshinder Singh, Lan Guo, Dejan Desovski West