Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Comparing Design and Code Metrics for Software Quality

Prediction

Y. Jiang, B. Cukic, T. MenziesLane Department of CSEE

West Virginia University

PROMISE 2008


Predicting Faults Earlier Matters

• Boehm observed that fault removal is 50 to 200 times less costly when performed in the design phase rather than after the deployment.

• NASA research shows that a fault introduced in the requirements, which leaks into the design, code, test, integration, and operational phases, ensues the correction cost factors of 5, 10, 50, 130, and 368, respectively.

• Therefore, the earlier we can identify fault-prone artifacts, the better.


How early?

• Do requirements metrics correlate with fault-proneness? [Jiang et. al. ISSRE 07]


Predicting From Design Metrics?

• It has been successfully demonstrated.• Ohlsson and Alberg (’96) demonstrated that design

metrics predict fault prone modules effectively.– “Design metrics are better predictors than code size (LOC)”– Telephone switching domain

• Basili validated so called CK object oriented (design) metrics using eight student developed systems.

• Nagappan, Ball & Zeller confirmed Ohlsson’s findings using OO design metrics on five Microsoft systems


Goal of This Study

• Thorough comparison of fault prediction models which utilize:

1. Design metrics

2. Static code metrics

3. Combination of both

• Statistically significant number of projects and modules within projects.


Metrics Description (1)• Code metrics


Metrics Description (2)• Design metrics


Experimental Design

Classification10x10 CV

Illustrate ResultsUsing ROC

Code

Visualize Using

Boxplot diagrams

Compare using NonparametricStatistical Tests

Evaluate ResultsUsing AUC

trapezoid rule

DesignAll


Datasets: NASA MDP

• Used every dataset which offered both design and code level metrics.


Experimental Design (2)

• 5 classification algorithms– Random forest, begging, boosting, logistic regression,

NaiveBayes

• 10 by 10 way cross-validation: – one 10 way experiment generates an ROC curve

=> 10 ROCs => 10 AUCs

• We analyzed 1950 experiments!– 13 [Data sets] *3 [Metrics sets] *5 [Classifiers] *10 [CV]

• We only show the best model from each metrics set in each data set (project).


Analysis example: PC5 data set

ROC

Boxplot

• The mean AUC • All: 0.979 • Code: 0.967 • Design: 0.956.


Typical Results


Not So Typical Results


Atypical Results


Test Statistical Significance

• Use the procedure recommended by Demsar for each of the 13 data sets.

1. Friedman test tests whether performance differs amongst design, code, and all experiments.

• If no, no further test is necessary.

• If yes, then 2. Use pairwise nonparametric tests (typically the Wilcoxon test

or the Mann-Whitney test) to determine which group of metrics is the best.

• 95% confidence level used in all experiments


Pairwise comparison

• Test the following hypotheses for pairwise comparison of two experiments A and B.

H0: There is no difference in the performance of the models from metrics from group A and group B;

H1: The performance of the group A metrics is better than that of group B metrics;

H2: The performance of the group A metrics is worse than that of group B metrics.


The Result of Hypothesis Test (1)

• Friedman’s test– Average p-value = 0.00003604 (<0.05) – Strongly suggests there is statistically significant difference

amongst the models from all, code, and design over all 13 datasets.

• Two pairwise nonparametric tests (the Wilcoxon test or the Mann-Whitney test) agree in all cases but one– PC2: the Mann-Whitney has all >code, but the Wilcoxon has

all=code, – This discrepancy does not affect our overall trend.



Findings

• Statistical significance tests utilized AUC for model comparison

• In 7 datasets: all=code; • In 6 datasets, all >code. • In all 13 datasets, all>design. • In 12 datasets, code>design. • Only exception is KC4 project, where

design>code.


Summary of Observations • The performance of models is influenced

– MORE by metrics – THAN by classification algorithms.

• Combination of design AND code metrics provides better models than code or design metrics alone.

• The models from code metrics generally perform better than that formed from design metrics only.

• Design metrics useful to predict fault prone modules earlier.

• Clear indication that integrating metrics from different phases of development is useful.


Threats to Validity

• Noise in the metrics data sets.– Would feature selection change some outcomes?

• Generality of NASA datasets.

• Design metrics reengineered from code.– More accurately reflect the code base than those computed from

design documentation.

• All metrics data contains a few independent variables which are not in Code or Design groups.– Needs correction, but the results unlikely to change.


Ensuing Research

• Software fault prediction can be improved– Improvement unlikely to come from the application of more off-

the-shelf data mining algorithms. – Accounting for project’s “business context” may contribute to

improvement.• Metrics from different development stages add information

not available from the code.

• Evaluation of effectiveness should be tailored to project-

specific (subsystem/module-specific) risks. – Reliable metrics collection.

Technology

Cukic Promise08 V3