22
HASL HASL High Assurance Systems Lab Comparing Design and Code Metrics for Software Quality Prediction Y. Jiang, B. Cukic, T. Menzies Lane Department of CSEE West Virginia University PROMISE 2008

Cukic Promise08 V3

Embed Size (px)

DESCRIPTION

Comparing Design and Code Metrics for Software Quality Prediction

Citation preview

Page 1: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Comparing Design and Code Metrics for Software Quality

Prediction

Y. Jiang, B. Cukic, T. MenziesLane Department of CSEE

West Virginia University

PROMISE 2008

Page 2: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Predicting Faults Earlier Matters

• Boehm observed that fault removal is 50 to 200 times less costly when performed in the design phase rather than after the deployment.

• NASA research shows that a fault introduced in the requirements, which leaks into the design, code, test, integration, and operational phases, ensues the correction cost factors of 5, 10, 50, 130, and 368, respectively.

• Therefore, the earlier we can identify fault-prone artifacts, the better.

Page 3: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

How early?

• Do requirements metrics correlate with fault-proneness? [Jiang et. al. ISSRE 07]

Page 4: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Predicting From Design Metrics?

• It has been successfully demonstrated.• Ohlsson and Alberg (’96) demonstrated that design

metrics predict fault prone modules effectively.– “Design metrics are better predictors than code size (LOC)”– Telephone switching domain

• Basili validated so called CK object oriented (design) metrics using eight student developed systems.

• Nagappan, Ball & Zeller confirmed Ohlsson’s findings using OO design metrics on five Microsoft systems

Page 5: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Goal of This Study

• Thorough comparison of fault prediction models which utilize:

1. Design metrics

2. Static code metrics

3. Combination of both

• Statistically significant number of projects and modules within projects.

Page 6: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Metrics Description (1)• Code metrics

Page 7: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Metrics Description (2)• Design metrics

Page 8: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Experimental Design

Classification10x10 CV

Illustrate ResultsUsing ROC

Code

Visualize Using

Boxplot diagrams

Compare using NonparametricStatistical Tests

Evaluate ResultsUsing AUC

trapezoid rule

DesignAll

Page 9: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Datasets: NASA MDP

• Used every dataset which offered both design and code level metrics.

Page 10: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Experimental Design (2)

• 5 classification algorithms– Random forest, begging, boosting, logistic regression,

NaiveBayes

• 10 by 10 way cross-validation: – one 10 way experiment generates an ROC curve

=> 10 ROCs => 10 AUCs

• We analyzed 1950 experiments!– 13 [Data sets] *3 [Metrics sets] *5 [Classifiers] *10 [CV]

• We only show the best model from each metrics set in each data set (project).

Page 11: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Analysis example: PC5 data set

ROC

Boxplot

• The mean AUC • All: 0.979 • Code: 0.967 • Design: 0.956.

Page 12: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Typical Results

Page 13: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Not So Typical Results

Page 14: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Atypical Results

Page 15: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Test Statistical Significance

• Use the procedure recommended by Demsar for each of the 13 data sets.

1. Friedman test tests whether performance differs amongst design, code, and all experiments.

• If no, no further test is necessary.

• If yes, then 2. Use pairwise nonparametric tests (typically the Wilcoxon test

or the Mann-Whitney test) to determine which group of metrics is the best.

• 95% confidence level used in all experiments

Page 16: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Pairwise comparison

• Test the following hypotheses for pairwise comparison of two experiments A and B.

H0: There is no difference in the performance of the models from metrics from group A and group B;

H1: The performance of the group A metrics is better than that of group B metrics;

H2: The performance of the group A metrics is worse than that of group B metrics.

Page 17: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

The Result of Hypothesis Test (1)

• Friedman’s test– Average p-value = 0.00003604 (<0.05) – Strongly suggests there is statistically significant difference

amongst the models from all, code, and design over all 13 datasets.

• Two pairwise nonparametric tests (the Wilcoxon test or the Mann-Whitney test) agree in all cases but one– PC2: the Mann-Whitney has all >code, but the Wilcoxon has

all=code, – This discrepancy does not affect our overall trend.

Page 18: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Page 19: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Findings

• Statistical significance tests utilized AUC for model comparison

• In 7 datasets: all=code; • In 6 datasets, all >code. • In all 13 datasets, all>design. • In 12 datasets, code>design. • Only exception is KC4 project, where

design>code.

Page 20: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Summary of Observations • The performance of models is influenced

– MORE by metrics – THAN by classification algorithms.

• Combination of design AND code metrics provides better models than code or design metrics alone.

• The models from code metrics generally perform better than that formed from design metrics only.

• Design metrics useful to predict fault prone modules earlier.

• Clear indication that integrating metrics from different phases of development is useful.

Page 21: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Threats to Validity

• Noise in the metrics data sets.– Would feature selection change some outcomes?

• Generality of NASA datasets.

• Design metrics reengineered from code.– More accurately reflect the code base than those computed from

design documentation.

• All metrics data contains a few independent variables which are not in Code or Design groups.– Needs correction, but the results unlikely to change.

Page 22: Cukic Promise08 V3

HASLHASL High Assurance Systems Lab

Ensuing Research

• Software fault prediction can be improved– Improvement unlikely to come from the application of more off-

the-shelf data mining algorithms. – Accounting for project’s “business context” may contribute to

improvement.• Metrics from different development stages add information

not available from the code.

• Evaluation of effectiveness should be tailored to project-

specific (subsystem/module-specific) risks. – Reliable metrics collection.