Improving Software Quality Prediction Using Intelligent

Improving Software Quality Prediction UsingIntelligent Computing Techniques

PhD Thesis

Zeeshan Ali Rana

2004-03-0061

Advisors: Dr. Shafay Shamail

Dr. Mian M. Awais

Department of Computer Science

School of Science and Engineering

Lahore University of Management Sciences

May 20, 2016

zar

Rectangle

zar

Rectangle

Dedicated to my family

Lahore University of Management Sciences

School of Science and Engineering

CERTIFICATE

We hereby recommend that the thesis prepared under our supervision by Zeeshan Ali Rana titled

Improving Software Quality Prediction Using Intelligent Computing Techniques be accepted in

partial fulfillment of the requirements for the degree of PhD.

Dr. Shafay Shamail (Advisor)

Dr. Mian M. Awais (Advisor)

Acknowledgements

A Malayan proverb is ”One can pay back the loan of gold but one dies forever in debt of

those who are kind”. Realizing this fact, I would like to pay special thanks to my advisors Dr.

Shafay Shamail and Dr. Mian M. Awais for providing me with every help possible. Their sincere

cooperation and guidance has helped me throughout the course of this research and made it possible

for me to complete this task. I express my gratitude to Dr. Naveed Arshad who has been kind

enough to help me whenever I requested.

My gratitude will be meaningless if I am not grateful to Allah Almighty for His kindness and

blessings. His benevolence made me capable to achieve this milestone. All praises are for the

Almighty.

I’d like to thank my family and friends who have always been supporting me. My father has

supported me in all the hard times I faced during the course of this study. My mother has prayed

for me continuously to make me achieve this goal. My wife has been patient enough when I could

not give her time. My son and daughter for being the latest motivation to complete my work. I

thank my brothers for their selfless support and prayers. My friends have made it look easy many

times. I would specially thank Malik Jahan Khan, Saqib Ilyas, Umar Suleman, Mirza Mubashar

Baig, and Junaid Akhtar for keeping me motivated and providing me the feedback on my work

when I requested. In addition to the above the following have made this experience memorable

for me: Khurram Nazir Junejo, Aadil Zia Khan, Fahad Javed, Khawaja Fahd, Malik Tahir Hassan,

Kamran Nishat, Khalid Mahmood Aamir, and Umar Faiz.

I am also thankful to Higher Education Commission (HEC), Pakistan and Lahore University of

Management Sciences (LUMS), Pakistan for funding this research.

List of Publications

Publications

Journal

1. Zeeshan Ali Rana, Mian M Awais, Shafay Shamail, Improving Recall in Software Defect

Models using Association Mining, Knowledge Based Systems (KBS), Volume 90, December

2015, Pages 1-13, Elsevier.

2. Zeeshan Ali Rana, Mian Muhammad Awais, Shafay Shamail, Nomenclature Unification of

Software Product Measures, IET Software, Volume 5, Issue 1, p.83-102, IET Digital Library,

February 2011. IET.

Conferences and Workshops

1. Zeeshan Ali Rana, Mian M Awais, Shafay Shamail, Impact of Using Information Gain in

Software Defect Prediction Models, Intelligent Computing Theory, Lecture Notes in Com-

puter Science Volume 8588, 2014, pp 637-648 , 10th International Conference, ICIC 2014,

August 3-6, 2014, Taiyuan, China. Springer-Verlag.

2. Zeeshan Ali Rana, Sehrish Abdul Malik, Shafay Shamail, Mian M Awais, Identifying Asso-

ciation between Longer Itemsets and Software Defects, Lecture Notes in Computer Science

(LNCS), In Proceedings of The 20th International Conference on Neural Information Pro-

cessing (ICONIP’13), November 03-07 2013, Daegu South Korea. Springer-Verlag.

3. Hafsa Zafar, Zeeshan Ali Rana, Shafay Shamail, Mian M Awais, Finding Focused Itemsets

from Software Defect Data, In Proceedings of The 15th International Multi Topic Confer-

ence (INMIC’12), December 13-15, 2012, Islamabad Pakistan. IEEE. (Best Paper Award).

4. Zeeshan Ali Rana, Mian Muhammad Awais, Shafay Shamail, An FIS for Early Detection

of Defect Prone Modules, Lecture Notes in Computer Science (LNCS), Vol. 5755/2009. In

Proceedings of the 5th International Conference on Intelligent Computing 2009 (ICIC’09),

September 16-19, 2009, Ulsan South Korea. Springer Berlin / Heidelberg.

5. Zeeshan Ali Rana, Shafay Shamail, Mian Muhammad Awais, Ineffectiveness of Use of Soft-

ware Science Metrics as Predictors of Defects in Object Oriented Software, In Proceedings

of World Congress on Software Engineering 2009 (WCSE’09), May 19-21, 2009, Xiamen

China. IEEE.

6. Zeeshan Ali Rana, Shafay Shamail, Mian Muhammad Awais, Towards a Generic model

for software quality prediction, In Proceedings of 6th Workshop on Software Quality 2008

(WoSQ’08) in 30th ICSE, May 10-18, 2008, Leipzig Germany. ACM.

6

Abstract

Software Quality Prediction (SQP) has been an area of interest for the last four decades. The aim

of quality prediction has been to identify the defect prone modules in software. With the help

of SQP the defect prone modules can be identified and thus improved at early stages of software

development. SQP is done using models that predict the defect prone modules. These prediction

are based on software metrics. Software metrics and defect related information is recorded in form

of datasets. These defect datasets contain instances of defect prone and not-defect prone modules.

Major motive behind quality prediction is to identify defect prone modules correctly in early

phases of development. Imbalanced datasets and late predictions are problems that affect this

motive. In most of the datasets, the number of instances of not-defect prone modules dominate

the number of instances of defect prone modules. This creates imbalance in the datasets. The

defect prone modules are not identified effectively due to the imbalance. Effectively predicting

defect prone modules and achieving high Recall using the public datasets becomes a challenging

task. Predictions based on code metrics are considered late. Majority of the metrics in the datasets

are code metrics which means that accurate predictions can be made once code metrics become

available. Another issue in the domain of software quality and metrics is that software metrics used

so far have inconsistent nomenclature which makes it difficult to study certain software metrics.

In this thesis an association mining (AM) based approach is proposed that improves prediction

of defect prone modules. The proposed approach modifies the data in a manner that a prediction

model learns defect prone modules better even if there are few instances of defect prone modules.

We use Recall to measure performance of the model developed after proposed preprocessing. The

issue of late predictions has been handled by using a model which can work with imprecise values

of software metrics. This thesis proposes a Fuzzy Inference System (FIS) based model that helps

predict defect prone modules when exact values of code metrics are not available. To handle the

issue of inconsistent nomenclature this thesis provides a unification and categorization framework

that works on the principle of chronological use of metric names. The framework has been used to

identify same metrics with different names as well as different metrics with same name.

The association mining based approach has been tested using public datasets and Naive Bayes

classifier. Naive Bayes classifier is the simplest and is considered as one of the best performers.

The proposed approach has increased Recall of the Naive Bayes classifier upto 40%. Performance

of the proposed Fuzzy Inference System (FIS), used to handle the issue of late predictions, has

been compared with models like neural networks, classification trees, and linear regression based

classifiers. The FIS model has performed as good as other models. Upto 10% improvement in

Recall has been observed in case of FIS model. The nomenclature unification of approximately

140 metrics has been done using the proposed unification framework. Out of these 140 metrics

approximately 6% different metrics have been used with same name in literature. Their naming

issues have been resolved based on the chronological use of the names.

Achieving better Recall using the proposed approach can help avoid costs incurred due to iden-

tification of a defect prone module late in software lifecycle when cost of fixing defects becomes

higher. The proposed FIS model can be used for earlier rough estimates initially. Later, better and

accurate estimates can be made when code metrics become available.

8

CONTENTS

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Software Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Software Quality Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Software Metrics for Software Quality Prediction . . . . . . . . . . . . . . 6

1.2.2 Software Quality Prediction Models . . . . . . . . . . . . . . . . . . . . . 8

1.2.3 Prediction Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.1 Data Related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.2 Model Related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.3 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.4.1 Preprocessing Imbalanced Datasets . . . . . . . . . . . . . . . . . . . . . 16

1.4.2 Early Prediction of Defects . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.3 Resolving Software Metrics Nomenclature Issues . . . . . . . . . . . . . . 17

1.5 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.6 Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2. Background Study and Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1 Software Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Software Defect Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Measuring Performance of SDP Models . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.2 ROC Curves and AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4 Software Defect Prediction Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4.1 Factor and Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . 35

2.4.2 Discriminant Analysis and Principle Component Analysis . . . . . . . . . 35

2.4.3 Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4.5 Classification/Regression Trees . . . . . . . . . . . . . . . . . . . . . . . 40

2.4.6 Case-based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.4.7 Fuzzy Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.4.8 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4.9 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4.10 Association Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.4.11 Ensemble Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4.12 Other Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.5 Performance Evaluation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.6 Studies to Remove Inconsistencies in Software Measurement Terminology . . . . . 55

2.7 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3. Software Defect Prediction Models: A Comparison . . . . . . . . . . . . . . . . . . . 61

3.1 Description of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.1.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.1.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.1.3 Composite Hypercubes on Iterated Random Projections (CHIRP) . . . . . 63

3.1.4 Decision Table - Naive Bayes (DTNB) . . . . . . . . . . . . . . . . . . . 64

ii

3.1.5 Fuzzy Unordered Rule Induction Algorithm (FURIA) . . . . . . . . . . . 65

3.2 Comparison Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


3.5.1 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4. Increasing Recall in Software Defect Prediction . . . . . . . . . . . . . . . . . . . . . 76

4.1 Proposed Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.1.1 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.1.2 Horizontal Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.1.3 Generating Frequent Itemsets . . . . . . . . . . . . . . . . . . . . . . . . 79

4.1.4 Selecting Focused and Indifferent Itemsets . . . . . . . . . . . . . . . . . 80

4.1.5 Modifying Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.1.6 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.2 Developing Defect Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.3.1 Identifying Performance Measure . . . . . . . . . . . . . . . . . . . . . . 85

4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86


4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5. Early Predictions using Imprecise Data . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1 Imprecise Inputs and Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2 Proposed Model based on Imprecise Inputs . . . . . . . . . . . . . . . . . . . . . 108

5.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.2.2 FIS Based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

iii

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112


5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6. Resolving Issues in Software Defect Datasets . . . . . . . . . . . . . . . . . . . . . . 121

6.1 Issues related to Software Defect Data . . . . . . . . . . . . . . . . . . . . . . . . 123

6.1.1 Inconsistencies in Software Product Metrics Nomenclature . . . . . . . . 123

6.1.2 Ineffective use of Software Science Metrics . . . . . . . . . . . . . . . . . 126

6.2 Proposed Approaches to Handle the Issues related to Software Defect Data . . . . 128

6.2.1 Metric Unification and Categorization (UnC) Framework . . . . . . . . . . 128

6.2.2 Proposed Approach to Show Ineffective use of SSM . . . . . . . . . . . . 131

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3.1 Application of the UnC Framework . . . . . . . . . . . . . . . . . . . . . 134

6.3.2 Ineffective use of SSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136


6.4.1 UnC Framework Resolves Nomenclature Issues . . . . . . . . . . . . . . . 138

6.4.2 Use of SSM Deteriorates Performance . . . . . . . . . . . . . . . . . . . . 142

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7. Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

A. PraTo: A Practical Tool for SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

A.1 Collecting and Combining Defect Prediction Models . . . . . . . . . . . . . . . . 171

A.2 Tool Architecture and Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

A.2.1 A Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

A.2.2 Dataset Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

A.2.3 Unified Metrics Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

A.2.4 Models Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

iv

A.2.5 Input Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

A.2.6 Context Specification and Model Selection . . . . . . . . . . . . . . . . . 178

A.2.7 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

A.2.8 Output Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

A.3 Salient Features of PraTo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

A.4 Application of PraTo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

A.4.1 Scenario Specification: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

A.4.2 Input Selection and Preprocessing: . . . . . . . . . . . . . . . . . . . . . . 183

A.5 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

A.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

B. List of Unified and Categorized Software Product Metrics . . . . . . . . . . . . . . . . 190

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

v

LIST OF FIGURES

1.1 A Software Quality Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Typical Phases in Lifecycle of a Software . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Challenges addressed in this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.4 Our approach to address challenges . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Software Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Confusion Matrix of a Defect Prediction Classifier . . . . . . . . . . . . . . . . . 32

2.3 ROC curve of three classifiers with best performance of C1 . . . . . . . . . . . . . 34

3.1 CHIRP Working in training and testing . . . . . . . . . . . . . . . . . . . . . . . . 64

3.2 Nemenyi’s Critical Difference Diagram for evaluation using AUC . . . . . . . . . 71

4.1 Major Steps of Our Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Preprocessig Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 Questionnaire Results Showing Industry’s Response Regarding Recall . . . . . . . 92

4.4 Trend of Recall across five datasets (Continued on next page) . . . . . . . . . . . . 97

4.4 (Continued from previous page) Trend of Recall across five datasets . . . . . . . . 98

4.5 Percentage Change in Recall across 5 datasets with and without the proposed pre-

processing (Continued on next page) . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.5 (Continued from previous page) Percentage Change in Recall across 5 datasets

with and without the proposed preprocessing . . . . . . . . . . . . . . . . . . . . 102

4.6 Percentage Change in FPRate across five datasets with and without the proposed

preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.1 Accuracy of estimation as project progresses (Shari L. PFleeger, 2010). . . . . . . 107

5.2 Frequency distribution of all input metrics for kc1-classlevel data (Continued on

Next Page) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.2 (Continued from Previous Page) Frequency distribution of all input metrics for

kc1-classlevel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.3 Output of phase 1: clusters and membership functions for each input of kc1-

classlevel. (The plot of the membership functions for each input appear in the

same order as the distribution of each input appears in Figure 5.2) . . . . . . . . . 115

5.4 Frequency distribution of all input metrics for jEdit bug data (Continued on Next

Page) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4 Frequency distribution of all input metrics for jEdit bug data . . . . . . . . . . . . 116

5.5 Output of phase 1: clusters and membership functions for each input of jEdit.

(The plot of the membership functions for each input appear in the same order as

the distribution of each input appears in Figure 5.4) . . . . . . . . . . . . . . . . . 117

5.6 ROC point for each model in training and testing . . . . . . . . . . . . . . . . . . 118

6.1 Role of UMD in the Generic Approach for Software Quality Prediction . . . . . . 122

6.2 SPdM Type I and Type II Inconsistencies . . . . . . . . . . . . . . . . . . . . . . 124

6.3 SPdM Unification and Categorization Framework . . . . . . . . . . . . . . . . . . 128

6.4 SPdM Unification and Categorization Framework: Detailed Design . . . . . . . . 129

6.5 Unification and Categorization of B . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.6 SPdM Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

A.1 The Generic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

A.2 Input Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

A.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

A.4 Main Screen of PraTo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

A.5 A Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

A.6 Specifying a Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

A.7 Adding New Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

vii

A.8 AHP based Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

viii

LIST OF TABLES

1.1 Typical Techniques Used for Software Quality Prediction . . . . . . . . . . . . . . 10

2.1 Selected Datasets and their Chracteristics . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Selected Software metrics reported in the datasets used . . . . . . . . . . . . . . . 27

2.3 Datasets and their attributes used in this study . . . . . . . . . . . . . . . . . . . . 28

2.4 Two Major Views in Software Defect Prediction . . . . . . . . . . . . . . . . . . . 57

3.1 Results of Classifiers over Selected Data Sets in Terms of AUC . . . . . . . . . . . 67

3.2 Mean AUC and Std. Dev. Over the Complete Range of Tuning Parameters . . . . . 68

4.1 Questions asked from the Software Industry . . . . . . . . . . . . . . . . . . . . . 84

4.3 Top 5 1-Itemsets and their Supporti in each partition . . . . . . . . . . . . . . . . 86

4.2 Minimum Support Thresholds and Itemset Counts for Each Dataset . . . . . . . . 93

4.4 τt and τf , used in this study, for each dataset . . . . . . . . . . . . . . . . . . . . . 94

4.5 Top 3 2-Itemsets and their Supporti in each partition . . . . . . . . . . . . . . . . 95

4.6 Performance of Decision Tree Model in terms of Recall . . . . . . . . . . . . . . 98

4.7 Performance of NB classifier on different number of bins with and without pro-

posed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.1 Metrics Used for this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.2 Evaluation Parameters Used for Comparison . . . . . . . . . . . . . . . . . . . . . 111

5.3 Testing Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.1 Examples of metrics with Type I inconsistency . . . . . . . . . . . . . . . . . . . 124

6.2 Examples of metrics with Type II inconsistency . . . . . . . . . . . . . . . . . . . 127

6.3 List of classification models used from WEKA(Witten et al., 2008). . . . . . . . . 132

6.4 Results of numeric classification with and without SSM (Halstead, 1977). . . . . . 137

6.5 Results of numeric classification with and without SSM (Halstead, 1977). . . . . . 138

6.6 Percentage of Preserved Labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.7 Distribution of Software Product Metrics in Software Development Paradigm with

Overlap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.8 Categorization with respect to Software Lifecycle Phase. . . . . . . . . . . . . . . 141

6.9 Effectiveness of SSM reported by all models . . . . . . . . . . . . . . . . . . . . . 143

A.1 Datasets List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

A.2 List of Models in Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

A.3 Winners in Terms of Recall and Acc . . . . . . . . . . . . . . . . . . . . . . . . . 185

B.1 Frequently Used Software Measures, Their Use and Applicability . . . . . . . . . 190

x

1. INTRODUCTION

Quality is generally considered a measure of goodness. Some authors state that quality of an en-

tity is the extent to which the entity meets the needs of its users (Kan, 2002). When describing

quality of an entity, adjectives like excellent, good and poor are attached with the word quality.

Entities with excellent quality are considered better than the entities with good or poor quality.

Entities with good quality are considered better than the entities with poor quality. Organizations

working on development of entities wish to develop good and excellent quality entities. Gener-

ally, quality is also considered as a relative term (i.e. varies from customer to customer, user to

user, stakeholder to stakeholder) and is defined in different ways. International Organization for

Standardization defines the term in ISO 9000 (for Standardization, 2005) as: “degree to which a

set of inherent characteristics fulfils requirements (i.e. needs or expectations that are stated , or

generally applied)”.

British Standards Institution (BSI) (Institution, 2008) define quality as “The totality of features and

characteristics of a product or service that bear on its ability to satisfy stated or implied needs”.

Definition of quality by Phil Crosby and Joseph Juran also focus on conformance to requirements:

Definition by Crosby (Crosby, 1979) is “Quality is conformance to requirements or specifications”,

and definition by Juran is “Quality is necessary measurable element of a product or service and is

achieved when expectations or requirements are met”.

An IEEE standard (IEEE Std. 610.12-1990) (Society, 1990) defines quality as

• “The degree to which a system, component or process meets specified requirements.

• The degree to which a system, component or process meets customer or user needs or expec-

tations”.

1.1 Software Quality

Software quality relates to conformance to requirements and expectations of users or customers

(Kan, 2002). It is commonly recognized as lack or absence of defects in software (Kan, 2002,

Jones, 2008, Maier and Rechtin, 2000). A defect is considered to be a failure to meet user’s or

customer’s requirements. The perception of quality as lack of defects corresponds to the basic

definition i.e. conformance to requirements (Kan, 2002). Software quality is defined differently,

some of them are provided below:

• Caper Jones (Jones, 2008) defines software quality as “the absence of defects that would

make software either stop completely or produce unacceptable results.”

• Tom McCabe provides definition of software quality as “high levels of user satisfaction and

low defect levels, often associated with low complexity” (Jones, 2008).

• According to John Musa, software quality is a combination of “low defect levels, adherence

of software functions to user needs, and high reliability” (Jones, 2008).

• Barry Boehm considers software quality to be “achieving high levels of user satisfaction,

portability, maintainability, robustness, and fitness for use” (Jones, 2008).

• Edward Deming thinks of software quality as “striving for excellence in reliability an func-

tions by continuous improvement in the process of development, supported by statistical

analysis of the causes of failure” (Jones, 2008).

• Watts Humphrey says that quality is “achieving excellent levels of fitness for use, confor-

mance to requirements, reliability, and maintainability” (Jones, 2008).

• James Martin says that software quality means “being on time, within budget, and meeting

user needs” (Jones, 2008).

IEEE Standard (IEEE Std. 729-1983) defines software quality as:

• “The totality of features and characteristics of a software product that bear on its ability to

satisfy given needs: for example, conform to specifications.

2

• The degree to which software possesses a desired combination of attributes.

• The degree to which a customer or user perceives that software meets his or her composite

expectations.

• The composite characteristics of software that determine the degree to which the software in

use will meet the expectations of the customer.”

These definition show that software quality is an intangible concept with multiple aspects, has

more than one “standard” definition, and is considered in terms of attributes such as reliability,

maintainability, usability etc. Jones (Jones, 2008) states that a working definition of software

quality must meet the following criteria:

1. Quality must be measurable when it occurs.

2. Quality should be predictable before it occurs.

Lack or absence of defects in software is one of the aspects of software quality (Kan, 2002,

Jones, 2008, Maier and Rechtin, 2000). This thesis uses the definition of quality that corresponds

to lack or absence of defects. The selected definition meets the above mentioned criteria of mea-

surability and predictability (Jones, 2008).

Measurement of the quality highlights the potential areas of improvement in the software prod-

uct itself as well as in the process adopted to develop the product. Measurement of quality also

lets us compare one software product with another one. Therefore, software quality becomes an

important area of interest for the organizations responsible for development of software products.

The software development organizations not only desire to build a product with good intrinsic

quality but also aspire to have minimum variation in their process of development such that all the

developed software products have similar quality. Quality of software can be measured in different

ways. For example a software can be considered of good quality if there are lesser defects faced

by a user, or its number of defects per million lines of source code is low, or number of failures

per operational hour is low. All these examples relate to the definition of software quality that the

software meets the requirements and the number of defects in the functionality of software is low.

3

Measuring defect potential and defect removal effectiveness is critical for software industry

(Jones, 2008). The best way to reduce costs and shorten schedules is to reduce quantity of defects

during development (Jones, 2008). Software community uses numerous methods to cut defect

potentials such as Team Software Process (TSP), Personal Software Process (PSP), Capability

Maturity Model (CMM and CMMI), ISO/IEC 9126, Software Process Improvement and Capabil-

ity dEtermination (SPICE), McCall’s model etc. (Shari L. PFleeger, 2010, Jones, 2008). Presence

of such models and special focus on quality by companies like IBM and HP over past decades

shows the significance of software quality (Shari L. PFleeger, 2010, Kan, 2002, Jones, 2008).

Studies show that using TSP and PSP has benefitted organizations developing application with

large size (≥10,000 function points, which are complexity and difficult) by cutting their defect po-

tentials by 50% (Jones, 2008). Jones (Jones, 2008) report empirical evidence that software quality

correlates with CMM/CMMI maturity level. When compared with similar projects being devel-

oped at CMM/CMMI Level 1, the projects being developed at Level 3 to 5 give better quality and

higher productivity. However, the use of CMM/CMMI is recommended for projects with func-

tion points ≥ 10,000. Organizations use McCall’s quality model which include quality attributes

such as correctness, reliability, efficiency, integrity, usability, maintainability, testability, flexibility,

portability, reusability, and interoperability as different attributes of quality that can be considered

when measuring quality of a software product (Shari L. PFleeger, 2010). These attributes are

also known as quality factors or quality attributes. There are other attributes which are consid-

ered when measuring quality. For example IBM measures quality of its software products in terms

of: capability (functionality), usability, performance, reliability, installability, maintainability, doc-

umentation/information, service, and overall. These quality attributes measured by IBM are also

known as CUPRIMDSO (Kan, 2002). Similarly HP measures quality as FURPS (functionality, us-

ability, reliability, performance, and serviceability) (Kan, 2002). Studies on large software systems

by corporations like IBM, AT&T, and HP have highlighted the significance of software metrics.

Literature suggests the collection of software metrics as a step to measure and control software

quality (Shari L. PFleeger, 2010, Kan, 2002, Jones, 2008). These corporations use metrics to mea-

sure quality of their systems. Studies of these systems reveals that defects have not been randomly

4

distributed in the system, rather they are present in a smaller section of modules known as “error-

prone modules”. As a rule of thumb 50% of defect reports in a large system may belong to 5% of

the modules. Identification of such modules helps in correcting the problems (Jones, 2008).

1.2 Software Quality Prediction

As mentioned earlier measuring defect potentials is one of the most effective methods to evaluate

software quality. Two important concepts required to improve software quality are:

1. Defect Prevention

2. Defect Removal

Methodologies used for defect prevention focus on lowering defect potentials and reducing number

of defects (Jones, 2008). Methods used for defect removal concentrate on improving testing effi-

ciency and defect removal efficiency. Formal design and code inspections are considered the most

effective activities to prevent and remove defects (Jones, 2008). Adopting inspections because they

are good methods to evaluate quality is not desired (Shari L. PFleeger, 2010). Prediction of defects

is another method to evaluate quality of software product. Use of evaluation technique should

depend on appropriateness of development environments (Shari L. PFleeger, 2010).

Quality of software can be predicted beforehand (i.e. before the software becomes operational)

and defect potentials can be reduced using the prediction information. This prediction is done using

a prediction model. As shown in Figure 1.1, a quality prediction model takes software metrics as

input and indicates the predicted quality (for example number of defects). A defect prediction

model either classifies a software module as Defect-prone (D) / Not Defect-prone (ND) or predicts

the number of defects in the software. The predictions can either be made on the basis of historical

data collected during implementation of same or similar projects (Wang et al., 2007), or it can

be made using the design metrics collected during design phase. Both are measurement-based

prediction methods since they involve software metrics in the prediction process. Use of software

metrics in various Software Defect Prediction (SDP) studies has been effective (Ganesan et al.,

2000, Shen et al., 1985, Thwin and Quah, 2002). Mostly two kinds of metrics have been used

5

Software Qua lity Prediction Mode lSoftwareMetrics PredictedQualityFig. 1.1: A Software Quality Prediction Model

to build SDP models; Software Product Metrics (SPdMs) and Software Process Metrics (SPcM).

These types of metrics have been discussed in Section 1.2.1.

Software Quality Prediction (SQP) helps software project managers in controlling the software

project and planning the resources and tests (Munson and Khoshgoftaar, 1990). These factors

facilitate development of better quality software resulting in higher customer satisfaction leading to

good return on investment. SQP benefits during all phases of software development. However, the

literature suggests early prediction of defect-prone modules in order to improve software process

control, reduce defect correction effort and cost, and achieve high software reliability (Kaur et al.,

2009, Fenton et al., 2008, Sandhu et al., 2010, Khoshgoftaar et al., 1996, Jiang et al., 2007, Xing

et al., 2005, Quah and Thwin, 2003, Wang et al., 2004, Yang et al., 2007). Prediction of quality (in

terms of defects) earlier in software lifecycle reduces software costs by allowing the identification

and mitigation of risks (Khoshgoftaar et al., 1996, Xing et al., 2005, Briand et al., 1993, Gokhale

and Lyu, 1997, Wang et al., 1998, Yuan et al., 2000). Early prediction further helps in preparation

of better resource allocation (Wang et al., 2004, Yuan et al., 2000) and test plans (Yuan et al.,

2000, Khosgoftaar and Munson, 1990, Ottenstein, 1979, Mohanty, 1979). SQP also helps the

maintenance of software (Yuan et al., 2000, Jensen and Vairavan, 1985).

1.2.1 Software Metrics for Software Quality Prediction

Measuring characteristics of software is helpful in engineering of software (Fenton and Pfleeger,

1998). For example it helps managers to know whether the requirements are consistent and com-

6

plete, whether the code is ready to be tested, whether the complexity of the code is high. Successful

project managers measure attributes of process and product to gauge quality of the product (Fenton

and Pfleeger, 1998). The software defect prediction studies mostly use two kinds of metrics:

1. Software Product Metrics (SPdMs)

2. Software Process Metrics (SPcM).

Software Product metrics are measurements of attributes associated with software itself, for ex-

ample size, complexity, number of defects in the software and relationship between the software

components. Software Process Metrics are measurements of attributes of the processes which

are carried out during the lifecycle of the software such as type of development model, effec-

tiveness of development process followed, number of meetings during a phase, performance of

testing, and budget overrun (Fenton and Pfleeger, 1998). Some metrics are directly collected for

a particular entity (for example a class or method) while others are derived i.e. they are originally

collected for a certain sub-entity (such as methods in a class) but represent the parent entity (i.e. the

class). A few examples of such metrics are sumLOC TOTAL (Koru and Liu, 2005a), avgHAL-

STEAD EFFORT (Koru and Liu, 2005a), sumCY CLOMATIC COMPLEXITY (Koru and

Liu, 2005a), (which are aggregations of LOC, E and V (G) respectively). These have been orig-

inally collected at method level but represent the respective classes. These metrics and defect

information

The initial work in the domain of software quality prediction was an inspiration from software

science metrics (Halstead, 1977). That work was limited to predicting the number of errors. After-

wards different organizations like Commission of the European Communities’ Strategic Program

for Research in Information Technology (Brocklehurst and Littlewood, 1992), Swedish National

Board for Industrial and Technical Development (Ohlsson and Alberg, 1996, Fenton and Ohlsson,

2000), Northern Telecom Limited, USA (Khoshgoftaar et al., 1996), Nortel, USA (Khoshgof-

taar and Allen, 1999b) started work in the area of classifying fault prone modules (Munson and

Khosgoftaar, 1992). This strategy of classifying fault prone modules was better than the earlier

approach of predicting the quality of the whole software for example predicting number of defects

(Ottenstein, 1979, 1981, Schneider, 1981). It was better in the sense that it pin pointed the modules

7

Requirements Design Coding Testing MaintenanceFig. 1.2: Typical Phases in Lifecycle of a Software

which were more fault prone and needed more attention. Software science metrics (SSM) (Hal-

stead, 1977), proposed by Halstead, are based on number of operators, operands and their usage

and have been proposed by keeping procedural paradigm in mind. These metrics are indicators

of software size and complexity (for example program length N and effort E measure size and

complexity respectively). Earlier studies have found a correlation of software size and complexity

with number of defects (Khosgoftaar and Munson, 1990, Ottenstein, 1979) and have used size and

complexity metrics as predictors of defects. Studies have used SSM for defect prediction and clas-

sification of defect prone software modules as well (Xing et al., 2005, Briand et al., 1993, Gokhale

and Lyu, 1997, Ottenstein, 1979, Jensen and Vairavan, 1985, Koru and Liu, 2005a, Munson and

Khosgoftaar, 1992, Khosgoftaar et al., 1994, Khoshgoftaar and Allen, 1999c, Khoshgoftaar and

Seliya, 2003, 2004, Li et al., 2006, Seliya and Khoshgoftaar, 2007). Fenton et al. (Fenton and

Neil, 1999) have criticized the use of SSM and other size and complexity metrics in defect predic-

tion models because 1) neither the relationship between complexity and defects is entirely causal

2) nor are the defects a function of size. Majority of the prediction models take these two as-

sumptions (Fenton and Neil, 1999). Despite the critique various studies have used SSM to study

software developed in procedural paradigm (Xing et al., 2005, Khoshgoftaar and Allen, 1999c,

Khoshgoftaar and Seliya, 2003) as well as object oriented paradigm (Koru and Liu, 2005a, Seliya

and Khoshgoftaar, 2007, Challagulla et al., 2005).

Since the public availability of software defect data at PROMISE repository (Menzies et al.,

2012), the number of studies on defect prediction has increased. Despite the ceiling effect on

performance of defect prediction models (Menzies et al., 2010), and certain quality issues in the

promise repository data (Shepperd et al., 2013) studies are being done to further investigate the

relationship between the product metrics and software defects (He et al., 2015, Okutan and Yildiz,

2014).

8

1.2.2 Software Quality Prediction Models

Typical life cycle of a software has phases like Requirements Gathering and Analysis, Design,

Coding, Testing, and Maintenance as shown in Figure 1.2. Software quality prediction studies use

software product metrics in different stages of the software development lifecyle (SDLC) (Wang

et al., 2007, Ganesan et al., 2000, Quah and Thwin, 2003, Yang et al., 2007, Gokhale and Lyu, 1997,

Khosgoftaar and Munson, 1990, Ottenstein, 1979, Jensen and Vairavan, 1985, Munson and Khos-

goftaar, 1992, Schneider, 1981, Khoshgoftaar and Seliya, 2003, Challagulla et al., 2005, Grosser

et al., 2003, Khoshgoftaar et al., 1992, Weyuker et al., 2008, Bouktif et al., 2006). Ganesan et

al. (Ganesan et al., 2000) employed case-based reasoning to predict design faults. Yang et al.

(Yang et al., 2007) have extracted rules useful in early stages of software lifecycle to predict defect

proneness and reliability. Yang et al. (Yang et al., 2007) suggest the use of fuzzy self-adaptation

learning control network (FALCON) which can predict the quality based on fuzzy inputs. More-

over, defect proneness of software modules has been predicted earlier in the development phase

using discriminant analysis (Khoshgoftaar et al., 1996). In order to collect the defect proneness

information as early as possible, studies have been conducted using requirements metrics (Jiang

et al., 2007). Xing et al. (Xing et al., 2005) have employed Support Vector Machines for early

quality prediction using design and static code metrics and have achieved correct classification

rate of upto 90%. Later, a study comparing design and code metrics suggests that design metrics

combined with static code metrics can help get better defect prediction results (Jiang et al., 2008c).

Grosser et al. (Grosser et al., 2003) suggested a technique which was suitable for object oriented

(OO) software only. Company specific (Li et al., 2006) and domain specific (such as for telecom-

munication systems (Khosgoftaar et al., 1994)) studies have also been presented. Because the work

has been done in various dimensions, need of a generic model for software quality has been felt

(Bouktif et al., 2004, Wagner, 2007, Wagner and Deissenboeck, 2007, Winter et al., 2007). But

the application and context specific nature of existing models causes difficulty in taking full ad-

vantage of the existing work. Fenton et al. (Fenton and Neil, 1999) have presented a critique of

existing models and highlighted their weaknesses (Fenton and Neil, 1999). Models generic for a

certain quality factor, such as usability (Winter et al., 2007), have also been suggested. Bouktif

9

Tab. 1.1: Typical Techniques Used for Software Quality Prediction

Technique Used in

Bayesian Belief Network (BBN) (Fenton et al., 2007b, Fenton and Neil, 1999)

Naive Bayes (NB) (Turhan and Bener, 2009)

Linear Discriminant Analysis (LDA) (Ohlsson et al., 1998)

Regression Analysis (RA) (Gokhale and Lyu, 1997)

Case Based Reasoning (CBR) (Khoshgoftaar et al., 2006)

Classification Trees (Khoshgoftaar and Allen, 1999a)

Genetic Algorithms (GA) (Bouktif et al., 2004)

Neural Networks (NN) (Wang et al., 2004, Mahaweerawat et al., 2004)

Support Vector Machine (SVM) (Xing et al., 2005)

et al. (Bouktif et al., 2004) have suggested a technique for selecting an appropriate model from

a set of existing models. Their approach reused existing models but it was restricted to selection

of an appropriate model only. Their major focus was on facilitating a company in adapting object

oriented software quality predictors to a particular context. Unavailability of large data repositories

is considered as an obstacle to generalize, validate and reuse existing models (Bouktif et al., 2004).

Software datasets are available at PROMISE website (Menzies et al., 2012) to validate existing

models and we have used kc1 for initial study. Though Wagner (Wagner, 2007) has suggested an

approach to reduce the effort to apply a prediction model, some other issues yet need to be ad-

dressed. It is difficult to avoid specific behavior of a predictor of software quality. To address this

issue models have been divided in different dimensions of specificity. Wagner et al. (Wagner and

Deissenboeck, 2007) have argued that software quality models differ along six dimensions namely

purpose, view, attribute, phase, technique and abstraction. Some examples of techniques used for

defect prediction models are given in Table 1.1.

The prediction models have been used to predict defects in future iterations/releases of a prod-

uct as well as in later phases of development based on data collected in earlier iterations/releases

10

or phases respectively (Wang et al., 2007, Ganesan et al., 2000, Xing et al., 2005, Ottenstein, 1979,

Mohanty, 1979, Ottenstein, 1981, Schneider, 1981, Bouktif et al., 2004, Xing and Guo, 2005,

Khoshgoftaar and Allen, 1999a, Brun and Ernst, 2004). For example, based on defect informa-

tion from a previous iteration, quality prediction model for iteratively developing software can

improve after the delivery of each iteration. With the help of the empirical model an organization

can atleast roughly predict quality of the next iteration. Based on these predictions, new goals can

be set and better quality may be achieved. Also, if the metrics for a particular phase are calculated

and known, then the quality of next phase can be determined and better resource management can

be done based on these predictions. The prediction model can use information from earlier phases

to predict defects in later phases, for example a rule-based model which is developed on the basis

of metrics like cyclomatic complexity and number of decision points in the software. If a software

component has large number of decision points, it has a likelihood of more programming errors.

Similarly Support Vector Machine (SVM) based technique (Xing et al., 2005) has been employed

to predict software quality in early stages of development. Hence prediction models can be devel-

oped if historical data for previous releases is available as well as in scenarios where significant

number of metrics are not available in earlier stages of SDLC.

1.2.3 Prediction Performance

Performance of the models used to predict defectiveness of software modules is measurend using

performance parameters like accuracy, recall, false positive rate, area under the Receiver Operator

Curve (ROC), F-Measure etc. All these parameters have their own significance. For example

accuracy of a model reflects the ability of the model to correctly classify a defect-prone as well as

not-defect-prone (or defect free) module. Models with high accuracy are considered good. Recall,

on the other hand, focuses on correct classification of defect-prone modules. It is desired that a

model has high recall. False positive rate indicates the incorrectly classified defect free modules

and relates to extra testing. Therefore lower values of false positive rate are desired if extra testing

is not affordable. Values for area under the ROC range from 0 to 1 and its value is desired to

be closer to 1. Further details about these performance parameters can be found in Chapter 2.

11

Many studies have focused on maximizing area under the curve (Lessmann et al., 2008, Jiang

et al., 2008a,c, 2007). It has been reported in literature that limits of achieving standard goal of

optimizing area under the curve have been reached (Menzies et al., 2010). Maximizing area under

the ROC curve sometimes results in scenarios where we have to compromise on higher values of

recall because that high recall is obtained at the cost of high false positive rate. Therefore, when

the objective is to predict as many defect-prone modules as possible, Recall seems to be a measure

preferred over the area under the curve measure. Therefore, the present study does not focus on

the principle of optimizing the area under the curve. Instead this thesis proposes a preprocessing

technique that works for increased recall of defect prediction models.

1.3 Challenges

Most of defect prediction models use software product metrics for the prediction. One reason

for product based models is the availability of public domain defect data (Menzies et al., 2012).

Existing studies based on public domain data in particular and others in general have certain lim-

itations. Major limitations of the existing work have been categorized and discussed in the rest of

the section.

1.3.1 Data Related

Benchmark Datasets

In existing studies, sometimes data used for the study is not publicly available (Khoshgoftaar et al.,

1996, Quah and Thwin, 2003, Wang et al., 2004). Availability of benchmark data has been con-

sidered important since long and benchmarking organizations like Gartner Group (GG), David

Consulting Group (DCG), International Software Benchmarking Group (ISBG), PRedictOr Mod-

els In Software Engineering (PROMISE) have been developing software data repositories (Jones,

2008, Menzies et al., 2012). NASA Metrics Data Program (MDP) (Facility) has been another

venture to make software data available for researchers. In 1997 ISBG was created such that the

benchmark data becomes available in form of CD also, unlike the case of older groups (i.e. GG

12

and DCG etc.). This data can now be purchased commercially from ISBG. In order to make the

data available publically and in soft form PROMISE repository has been created in early 2004-05.

Researchers and practitioners can now obtain and use data without purchasing it commercially.

Literature suggests that the experiments conducted using the public data can be replicated and the

studies reporting such experiments are more useful (Menzies et al., 2010, 2008).

Imbalanced Data

Available public datasets are imbalanced and have larger number of defect free modules. Imbal-

anced data is considered as a problem for classification of rare class in other domains also (Wang

et al., 2005, Alshomrani et al., 2015, Garca et al., 2014). The imbalance in datasets refrains the

models to predict defect prone modules with better performance. The literature suggests the use of

Area Under ROC curve (AUC) to asses performance of a model. AUC does not allow recall to rise

above a certain threshold because while maximizing the area AUC it includes the cost of FP rate.

Recall is another performance measure used to assess performance of prediction. Recall focuses

on correct classification of defect-prone modules which is more important for software companies

since they aim to deliver lower number of defective modules.

Quality Issues in Public Datasets

There are issues related to quality and information in the public domain datasets. The datasets

sometimes contain single valued attributes and in some cases they have repeated rows (Shepperd

et al., 2013). The datasets include examples with conflicting class labels for example there are

more than one software modules which have same values for software metrics but one of them

is labeled as defect-prone and others are labeled as not-defect-prone. Researchers have used the

software inappropriately even if their use deteriorates the performance of prediction model (Koru

and Liu, 2005a, Seliya and Khoshgoftaar, 2007, Challagulla et al., 2005).

13

1.3.2 Model Related

Models’ Performance

The metrics in public data have been extensively used for proposal of new models. However, the

performance achieved so far has been stagnant for the last few years. Though the data mining and

intelligent computing techniques are being used to get better performance but either the standard

goal needs modification or there is a need to collect different data with new information. This

ceiling effect (Menzies et al., 2008) on performance is observed when optimizing Area Under

ROC Curve. Davis et al. (Davis and Goadrich, 2006) argue that use of ROC curves may not be

useful in case of imbalanced datasets. One reason for the ceiling effect could be the use of ROC

for the models developed using imbalanced datasets. Better performance can be achieved in terms

of other metrics such as Recall.

Late Predictions

Studies use code metrics for more accurate predictions because measurements collected from soft-

ware product code perform better than the design measures (Jiang et al., 2008c). The models based

on code metrics use certain code attributes that are not necessarily applicable in certain develop-

ment paradigms. In other cases the predictions are made at a later stage in the SDLC as they use

code metrics, thus reducing the benefits of early predictions (Xing et al., 2005, Jiang et al., 2007).

Such predictions are considered late. Jiang et al. (Jiang et al., 2007) have tested early prediction of

software quality using requirements metrics without any promising results. Design and code met-

rics have been used to identify the defect prone modules with a high success rate (Khoshgoftaar

et al., 1996, Jiang et al., 2007, Xing et al., 2005, Wang et al., 2004, Jiang et al., 2008c). Jiang et al.

(Jiang et al., 2008c) have compared the prediction models based on design and code metrics and

have asserted that the models based on code metrics usually perform better than the models based

on design metrics. The combination of the design and code metrics gives better prediction results

(Jiang et al., 2008c) but delays the defect prediction until the code metrics are available. Early

prediction models based on design and code metrics are difficult to develop because precise values

of the model inputs are not available. Also, in initial phases of the SDLC data for prediction of the

14

same release is not easily available for development of prediction model. Conventional prediction

techniques require exact inputs, therefore such models cannot always be used for early predictions

when exact data is not available. Innovative prediction methods that use imprecise inputs, however,

can be applied to overcome the requirement of exact inputs.

Appropriate Model Selection

In the presence of many software prediction techniques with variable range of their applications in

SDLC and programming paradigms, selection of a prediction model becomes difficult for an orga-

nization (or project manager) (Bouktif et al., 2004). Selection of a prediction model is generally

based on a number of parameters such as software metrics available, phase of software devel-

opment lifecycle, software development paradigm, domain of the software (Jones, 2008), quality

attribute to be predicted, product based and value based views of the model (Rana et al., 2008,

Wagner and Deissenboeck, 2007) and so on. Selecting a model on the basis of so many parameters

poses a problem and results in subjective selection of a prediction model. In order to reduce this

subjectivity, a generic approach is needed which can help in objectively selecting a model.

1.3.3 Other

Metrics Nomenclature

The software metrics being used in defect prediction studies have issues in their nomenclature

(Khosgoftaar and Munson, 1990, Jensen and Vairavan, 1985, Jiang et al., 2008c, Vaishnavi et al.,

2007, Garcıa et al., 2009). They have been used with inconsistent names and interpretations. For

example model by Guo et al. (Guo and Lyu, 2000) refers to lines of code as TC (Total Code

lines) whereas many other models (like (Khosgoftaar and Munson, 1990)) refer to it as LOC. On

the other hand, models by Khoshgoftaar et al. (Khoshgoftaar et al., 1996, Khoshgoftaar and Allen,

1999b) refer to TC as ‘Total calls to other modules’. Similarly, other than some well known metrics

like Halstead’s program volume (V) and effort (E) (Halstead, 1977), and total Lines of Code many

metrics have been used with different names by different researchers. Such inconsistencies make

the existing models difficult to compare and significance of uniform naming and definition has

15

been highlighted in literature (Vaishnavi et al., 2007, Garcıa et al., 2009).

Causality of Defects

Major objection on the models based on software product metrics has been that these models do not

focus on cause of defects, and hence do not find a causal relationship between the software metrics

and defective software modules (Fenton and Neil, 1999). Studies to find the causal relationship

have also been conducted for example (Fenton et al., 2008, Fenton and Neil, 1999, Fenton et al.,

2007a,c,b). However, use of product metrics for the purpose has not reduced. The studies to find

causal relationship require expert based judgements which are difficult to get. At the same time

the data required for causal analysis is not available in the manner the data of product metrics is

available. The available public data has resolved the issue of lack of data but has introduced certain

new challenges such as imbalanced classes and data with poor quality.

Generality of Approaches

Though some attempts have been made to develop generic approaches to predict software quality

(Rana et al., 2008, Bouktif et al., 2006), but there are limitations due to different interpretations,

nomenclature and representation of model input parameters. For example, numerous product met-

rics like program Volume (V) and total Lines of Code (LOC), have been used with different names

by different researchers which has generated inconsistency (Dick and Kandel, 2003, Jensen and

Vairavan, 1985, Khosgoftaar and Munson, 1990, Koru and Liu, 2005a, Li et al., 2006, Ottenstein,

1979). The generic approach presented in (Rana et al., 2008) requires such inconsistencies to be

removed.

1.4 Our Approach

From the challenges described in the previous section, this thesis mainly addresses three limitations

shown in Figure 1.3. The first limitation is handled by finding associations between software

metrics and software defects. The second challenge is addressed by using fuzzy logic and the third

issue is resolved by standardizing naming of software metrics. This thesis focuses on software

16

Improving Software Qua lity Prediction UsingInte lligent Computing Techniques Naming Issues inSoftware MetricsImba lanced Datasets Late PredictionsFig. 1.3: Challenges addressed in this thesis

product metrics. The thesis does not discuss software process metrics. It also does not discuss

product metrics which are derived.

1.4.1 Preprocessing Imbalanced Datasets

Association between software metrics and software defects can be found and used for better pre-

diction models in presence of imbalanced datasets. This study uses Association Rule Mining

(ARM) to study the relationship between software metrics and software defects using public do-

main datasets (Menzies et al., 2012) as shown in Figure 1.4. Association Rule Mining (ARM) is an

important data mining technique and is employed for discovering interesting relationships between

variables in large databases (Sha and Chen, 2011). It aims to extract interesting correlations, fre-

quent patterns, associations or causal structures among sets of items in the transaction databases

or other data repositories (Sotiris and Dimitris, 2006). The study identifies the attributes and their

ranges that co-occur with defects. These ranges, termed as focused itemsets, can be used for better

planning of resources and testing. The ranges can further help study the relationship between the

ranges and software defects. The thesis also shortlists certain attributes and ranges that do not nec-

essarily cause defects. Frequency of attributes with critical ranges is calculated among all datasets

under study to find importance of each attribute. Attributes with critical ranges are more important

than the attributes with indifferent ranges. The focused itemsets have been used to preprocess five

public datasets. Performance of Naive Bayes (NB), the best defect prediction model over these five

preprocessed datasets has increased upto 40%. The approach has not decreased the performance

17

of the model at any instance.

1.4.2 Early Prediction of Defects

In order to address the issue of late predictions, this thesis presents a model that works with impre-

cise inputs. This model is a fuzzy inference system (FIS) that predicts defect proneness in software

using vague inputs defined as fuzzy linguistic variables and applies the model to real datasets, as

mentioned in Figure 1.4. The model can be used for earlier rough estimates when exact values of

software measurements is not available. Performance analysis in terms of recall, accuracy, mis-

classification rate and other measures shows the usefulness of the FIS application. Predictions by

the FIS model at an early stage have been compared with conventional prediction methods (i.e.

classification trees, linear regression and neural networks) that use exact inputs. In case of the FIS

model, the maximum and the minimum performance shortfalls were noticed for true negative rate

(TNRate) and F −measure respectively. Whereas for Recall, the FIS model performed better

than the other models even with the imprecise inputs. Work by Yang et al. (Yang et al., 2007) is

similar to ours in the sense that they also intend to predict the quality when exact values of software

metrics are not known and the domain knowledge and experience of the project managers can be

used to approximate the metrics values. Furthermore, they also intend to find the rule set which has

the capability to reason under uncertainty, which is a limitation of most of the quality prediction

models (Fenton and Neil, 1999). This study also extracts some rules in an attempt to overcome the

ceiling effect problem (Menzies et al., 2008) in the defect prediction models.

1.4.3 Resolving Software Metrics Nomenclature Issues

The present work removes inconsistencies found in naming of software product metrics and unifies

them by introducing a Unified Metrics Database (UMD). The UMD will be helpful for software

managers who need to identify and decide which datasets are similar to their problem domain

or which metrics to use for their projects. The UMD can further be helpful for future studies

on software product metrics and development of prediction models based on these metrics. As

a part of the development process of UMD, this thesis identifies two types of inconsistencies in

18

Improving Software Qua lity Prediction UsingInte lligent Computing Techniques Naming Issues inSoftware MetricsImba lanced Datasets Late PredictionsAssociation Mining Fuzzy Inference Nomenc latureUnificationFig. 1.4: Our approach to address challenges

naming of attributes and presents a resolution framework. The suggested framework resolves these

inconsistencies on the basis of definition of the metrics and chronological usage of metric labels.

This thesis also identifies the metrics inappropriately used for defect prediction and shows that

these metrics are ineffective for prediction of defects in object-oriented paradigm (Rana et al.,

2009b).

1.5 Our Contributions

Approach presented in the previous section has addressed the challenges in the domain of study

and has provided encouraging results. Figure 1.5 indicates our contributions when addressing

the aforementioned challenges. The association mining based preprocessing has increased perfor-

mance (Recall) of an SDPM by 40%. Fuzzy inference system developed for early prediction of

defects gives performance comparable to models that need precise inputs. The third challenge has

been addressed by developing a database of metrics with unified names. As mentioned earlier the

software defect datasets are imbalanced. This imbalance affects prediction of defect-prone mod-

ules. This thesis suggests that association between software metrics and software defects can be

used to improve prediction of defect-prone modules. The thesis proposes a preprocessing approach

that has resulted in improved Recall of Naive Bayes based model by 40% as indicated in Figure

19

1.5. These results have been obtained by experimenting with 5 datasets.

The issue of late predictions has been addressed by developing a model that can work when

exact values of software metrics are difficult to collect, however, rough estimates of the metric

values can be made. These rough estimates of metric values can be used to predict software defects

using the proposed fuzzy inference system based on Sugeno’s fuzzy inference principle as shown

in Figure 1.5. The model has been developed for 2 datasets.

The issue of different nomenclature for software metrics has been resolved by suggesting and

applying a framework that preserves the name of a metric that has been associated to it chronolog-

ically. More than 140 software metrics have been categorized using this framework and a unified

database has been developed as specified in Figure 1.5.

In addition to the above contributions, a tool for practitioners and researchers has been been

developed that helps its users to predict defects using 11 models. The tool also allows its users

to aggregate model results such that more than one models can be benefitted from. The users can

compare performance of different models on benchmark datasets, specify their scenario and select

the most suitable model for their scenario. The users can also provide a dataset and find a the

dataset from public domain which is the most similar to their dataset.

1.6 Thesis Layout

Rest of the thesis is organized as follows:

Chapter 2 presents background of the Software Defect Prediction domain, metrics and datasets

used for the prediction, and a survey of software prediction and related studies. The survey includes

studies with models based on data mining, machine learning and statistical techniques. Different

models used for defect prediction include linear regression, pace regression, support vector regres-

sion, logistic regression, neural network, naive bayes, instance-based learning, J48 trees, linear

and quadratic discriminant analysis, logistic regression, naive bayes, bayesian networks, k nearest

neighbors, support vector machines, classification and regression trees, and random forest. The

chapter also describes the prediction studies based on these models and their limitations.

Chapter 3 presents the comparative study performed to select the best model from the existing

20

Improving Software Quality Prediction UsingIntelligent Computing Techniques Naming Issues inSoftware MetricsImbalanced Datasets Late PredictionsAssociation Mining Fuzzy Inference NomenclatureUnificationRecall of NaïveBayes Classifierincreased by 40%. Predictions basedon Fuzzy inputs arecomparable withpredictions usingprecise data. Unified MetricsDatabase (UMD)developed.Fig. 1.5: Our contributions

21

ones. It presents, in detail, a comparative study of defect prediction models and the criteria to select

the best model to be used for improving Recall. This (replicated) study in inspired by the work of

Lessman et al. (Lessmann et al., 2008) and uses their method and statistics for comparison. The

study presented in Chapter 3 compares three additional models with the two models they adjudged

as best performing models in defect prediction.

Chapter 4 presents an association mining based approach to preprocess data and improve

Recall of defect prediction models. Most of the pubic datasets (Menzies et al., 2012) have sig-

nificantly larger number of examples which correspond to a Not Defect-prone (ND) module as

compared to examples corresponding to Defect-prone (D) modules. This prevents a model to learn

D modules effectively. The preprocessing approach presented in this chapter allows the models

to learn D modules in imbalanced defect datasets. The chapter presents the results of developing

Naive Bayes (NB) classifier (one of the best classifiers alongwith Random Forests in field of defect

prediction (Menzies et al., 2010, Lessmann et al., 2008)) for five datasets. Upto 40 % of perfor-

mance gain has been observed in terms of Recall. Stability of the approach has been tested by

experimenting the algorithm with different number of bins.

A fuzzy based model for early prediction of defect prone modules is given in Chapter 5. It is

desired and is useful to predict defect prone modules when code metrics are not yet available since

this can help avoid defects later in the lifecycle. Defects caught later in the lifecycle have higher

costs of fixing. This chapter proposes a model that works with imprecise inputs. The model can

be used for earlier rough estimates when exact values of software measurements are not available.

Chapter 6 addresses issues related to limitations due to different interpretations, nomenclature,

and representation of model input parameters. This chapter identifies two types of inconsisten-

cies in naming of software product metrics and presents a resolution framework. The suggested

framework resolves these inconsistencies on the basis of definition of the metric and chronological

usage of the metric name. A Unified Metrics Database (UMD) is also introduced, in this chapter.

Moreover, the chapter identifies the metrics that are not necessarily applicable in certain devel-

opment paradigms (for example object oriented paradigm) shows that these metrics degrade the

performance of prediction models.

22

Chapter 7 presents the conclusion and future directions.

The thesis also presents in Appendix A, a tool developed during the current study. The pro-

totype tool named PraTo has been developed to remove the aforementioned limitations. This tool

is envisaged to increase utility of prediction models for software industry. PraTo has a collection

of over 15 defect prediction techniques and a number of public datasets. The public datasets are

available in original as well as preprocessed form to improve performance of the models to be

built. PraTo also enables a user to select a suitable model for a certain scenario by asking the user

to provide a context. Based on the context, the most suitable model is used for prediction. A user

can provide a context by specifying a scenario. A user may specify a scenario in following ways:

1. Compare a dataset with a public dataset. The closest public dataset is the user scenario. The

model with the best performance, in terms of recall, on the public data is selected.

2. Provide a set of constraints in terms of Human Resource, Budget and Time. Based on the

severity of the constraints, values of the performance measures are determined. The model

with the best values of the performance measures is selected.

3. Select three model performance measures. Provide relative importance of the measures. An

Analytic Hierarchical Processing (AHP) based technique is applied to select a model.

Appendix A also suggests an approach which is composed of existing models (called compo-

nent models). Given a software in terms of its collected metrics, the approach selects the most

appropriate applicable model to predict the desired quality factor value. Like an integrated ap-

proach suggested by Wagner et al. (Wagner and Deissenboeck, 2007), our approach also requires

the prior analysis of the important quality factors. The proposed approach will automate the se-

lection of an appropriate prediction model and will calculate desired quality factor on the basis of

the selected model. To propose this approach we have studied the techniques applied for software

quality prediction in structural and object oriented programming paradigms. We have presented

a logical grouping of dimensions under which a prediction technique should be selected. This

multidimensional model helps quality managers in the selection of a prediction method.

23

Appendix B lists the software metrics as they appear in a Unified Metrics Database (UMD) and

includes base as well as derived metrics.

1.7 Summary

Software Quality has multiple facets and defectiveness of a software module is one on the facets of

software quality. Early information regarding defectiveness of software modules helps in resource

planning and focusing on potentially problematic areas of software. Defect prediction models are

used to get this defectiveness information beforehand. These models predict the software modules

as defect-prone and not-defect-prone based on the software metrics. Availability of public datasets

in PROMISE (Menzies et al., 2012) repository has resulted in numerous defect prediction studies.

There are many challenges faced by the defect prediction community such as:

1. Public domain datasets being used to validate defect prediction models are imbalanced.

Models developed using these datasets are unable to achieve very high Recall.

2. Most of the studies use code metrics available in public datasets. Predictions based on code

metrics are late.

3. Available datasets and software defect prediction studies use different nomenclature for cer-

tain metrics.

4. Relationship between software metrics and defects is unknown.

5. Existence of numerous models has made it difficult to select an appropriate prediction model.

6. There are quality issues related to the public domain datasets. The datasets sometimes con-

tain single valued attributes and in some cases they have repeated rows.

7. The datasets include examples with conflicting class labels, for example there are more that

one software modules which have same values for software metrics but one of them is labeled

as defect-prone and others are labeled as not-defect-prone.

24

The above list of challenges in the field of software quality prediction in general and software

defect prediction in particular is not exhaustive. There are other issues being addressed by numer-

ous studies who mine software repository to improve quality of the software. Present work is one

of such studies which addresses the first three limitations and presents the remedy in the following

chapters.

25

2. BACKGROUND STUDY AND LITERATURE SURVEY

A large number of Software Defect Prediction (SDP) models have been developed and reported in

literature. This chapter briefly describes software defect prediction and discusses datasets used for

development of the SDP models. This chapter also presents a survey of various software quality

prediction studies. Furthermore, studies related to evaluation of defect prediction models and

handling inconsistencies in software measurements terminology are also discussed. The chapter

also characterizes viewpoints found in the literature, and highlights limitations and issues in the

domain of software defect prediction.

2.1 Software Defect Prediction

Software defect prediction is an estimation of defect proneness of a software (or its modules).

This prediction is done using a prediction model which takes software metrics as input and pro-

vides defect proneness information at the output as shown in Figure 2.1. The software metrics

are the software related data collected during the software lifecycle, for example data about struc-

tural complexity of the software, coupling between software modules, and cohesion of software

modules. These data are sometimes available to support defect prediction studies. Many software

development organizations choose not to share their software related data publicly and hence their

data is used in limited number of defect prediction studies only. These prediction models are typ-

ically based on Artificial Intelligence. Software defect prediction models are usually evaluated

using confusion matrix (Jiawei and Micheline, 2002). Receiver Operator Curve (ROC) is another

parameter used to gauge performance of software defect prediction models.

The following sections first discuss software defect datasets, explain the model evaluation pa-

rameters, and afterwards, present a summary of defect prediction studies.

Defect Prediction ModelSoftware Metrics Defect Prone (D) /Not Defect Prone (ND)Number of DefectsFig. 2.1: Software Defect Prediction

2.2 Software Defect Data

Software defect datasets provide software measurements and defectiveness information as at-

tributes of the datasets. Measurements of software are represented in form of software metrics

where each metric becomes an attribute of the dataset. Presence or absence of defects is also pro-

vided as an attribute. A dataset is a collection of records where each record represents a software

or a software module.

The Datasets used in defect prediction studies describe the defectiveness information regard-

ing various software modules. Each instance in a dataset represents a software module and the

attributes are software metrics calculated for that module. Class attribute in each dataset indicates

if the software module is defect prone (D) or not-defect prone (ND). Given a number of datasets

and metrics in those datasets, it is not possible to describe all datasets and metrics in a single study.

We have selected 17 datasets based on their use in literature. These selected datasets have similar

attributes which makes the comparison relatively easier. Table 2.1 lists these datasets and their

characteristics such as the number of instances in each dataset, percentage of (instances of) ND

modules in each dataset, and the development language used for the software. The percentages

show that the datasets are dominated by ND modules.

27

Tab. 2.1: Selected Datasets and their Chracteristics

Dataset

Language No. of Instances Percentage of

ND Modules

(Approx. %)

No. of At-

tributes

Group

CM1 C 498 90 22 1

JM1 C 10885 81 22 1

KC1 C++ 2109 85 22 1

KC2 C++ 522 80 22 1

PC1 C 1109 93 22 1

PC3 C 1563 90 38 2

PC4 C 1458 88 38 2

MW1 C++ 403 92 38 2

AR1 C 121 93 30 3

AR4 C 107 81 30 3

AR6 C 101 85 30 3

PC5 C++ 17186 97 39 4

MC1 C++ 9466 99 39 4

KC3 Java 458 91 40 5

Kc1-class-

level

C++ 145 59 95 6

jEdit-Bug-

Data

Java 274 8 6

These datasets are different from each other in at least two ways: they use different attributes

(software metrics), and they have different number of attributes. The datasets have been divided

into 5 groups based on the number of common attributes as shown in Table 2.1. Group 6 consists

28

of 2 datasets that do not have metrics common with other datasets and use metrics from Object

Oriented paradigm. The grouping presented here is further used when evaluating the suggested

preprocessing approach in Chapter 4. Certain variants of the datasets have also been used but their

use in not part of main thesis. Another grouping and preparation of variants of the datasets can be

found in Appendix A.

Table 2.2 provides a list of metrics used in these datasets. Description of other metrics can

be found in Appendix B. Table 2.3 provides the information on metrics used in each dataset.

A cross(×) in a cell indicates the presence of a metric in the corresponding dataset. Metrics

information regarding datasets not listed in this table can be found in respective chapters.

Tab. 2.2: Selected Software metrics reported in the datasets used

Software Metric Name

v(G) McCabe’s Cyclomatic Complexity

ev(G) Essential Complexity

iv(G) Design Complexity

LOC McCabe’s Lines of Code

V Halstead Program Volume

L Halstead Program Level

D Halstead Program Difficulty

I Halstead Intelligence Content

E Halstead Effort

B Halstead Error Estimate

T Halstead Programming Time

N Halstead Program Length in terms of Total Op and

Total Opnd

LOCode Halstead’s Lines of Code

LOComment Lines of Comment

Continued on next page

29

Table 2.2 – continued from previous page

Software Metric Name

LOBlank Blank Lines

LOCodeAndComment Lines of Code and Comment

Uniq Op Unique Operators

Uniq Opnd Unique Operands

Total Op Total Operators

Total Opnd Total Operands

BranchCount Total Branches in Program

Tab. 2.3: Datasets and their attributes used in this study

Attribute

CM1 JM1 KC1 KC2 KC3 PC1 PC3 PC4 PC5 MC1 MW1AR1 AR4 AR6

loc × × × × × × × × × × × × × ×v(g) × × × × × × × × × × × × × ×ev(g) × × × × × × × × × × ×iv(g) × × × × × × × × × × × × × ×CALL PAIRS × × × × × × ×CONDITION

COUNT

× × × × × × × × ×

CYCLOMATIC

DENSITY

× × × × × × × × ×

DECISION

COUNT

× × × × × × × × ×

DECISION

DENSITY

× × × × × × ×


30


Attribute CM1 JM1 KC1 KC2 KC3 PC1 PC3 PC4 PC5 MC1 MW1AR1 AR4 AR6

DESIGN

DENSITY

× × × × × × × × ×

EDGE

COUNT

× × × × × ×

ESSENTIAL

DENSITY

× × × × × ×

LOC EXE-

CUTABLE

× × × × × × × × ×

PARAMETER

COUNT

× × × × × ×

GLOBAL

DATA COM-

PLEXITY

× × ×

GLOBAL

DATA DEN-

SITY

× × ×

n × × × × × × × × × × × × × ×v × × × × × × × × × × × × × ×l × × × × × × × × × × × × × ×d × × × × × × × × × × × × × ×i × × × × × × × ×e × × × × × × × × × × × × × ×b × × × × × × × × × × × × × ×t × × × × × × × × × × × × × ×lOCode × × × × × × × ×


31



lOComment × × × × × × × × × × × × × ×lOBlank × × × × × × × × × × × × × ×locCodeAndComment× × × × × × × × × × × × × ×MAINTENANCE

SEVERITY

× × × × × ×

MODIFIED

CONDITION

COUNT

× × × × × ×

MULTIPLE

CONDITION

COUNT

× × × × × × × × ×

NODE

COUNT

× × × × × ×

NORMALIZED

CYLOMATIC

COMPLEX-

ITY

× × × × × × × × ×

uniq Op × × × × × × × × × × × × × ×uniq Opnd × × × × × × × × × × × × × ×total Op × × × × × × × × × × × × × ×total Opnd × × × × × × × × × × × × × ×branchCount × × × × × × × × × × × × × ×PERCENT

COMMENTS

× × × × × ×

defects × × × × × × × × × × × × × ×Continued on next page

32



Collecting and extracting data is a very complex process. Usually data extracted is of poor

quality. This means that there may be some features or data points in the data that might affect the

results of a data mining procedure. Use of public datasets has been frequent in defect prediction

studies. These datasets have been used despite the critique on the datasets (Gray et al., 2011,

Shepperd et al., 2013) because they are publicly available at PROMISE repository (Menzies et al.,

2012). One major problem in these data sets is of repeated data points i.e. observations of module

metrics and their corresponding fault data occurring more than once. This may lead to over-

fitting of classification models. As the proportion of repeated information increases, so does the

performance of the classifier. It is worth pointing out that the severity of repeated data points is

algorithm specific. Naıve Bayes classifiers have been reported to be fairly resilient to duplicates

(Kolcz et al., 2003). Also, the training of learning methods on such data is not very problematic.

A very simple oversampling technique is to duplicate minority class data points during training,

however these data points should not be included in the testing data set. Another problem is of

imbalanced data sets i.e. for this case, class distribution of faulty and faultless modules is skewed

towards faultless modules. However, only some classifiers (like C4.5) are affected by the class

imbalance problem. Then there is the problem of inconsistent patterns. They occur where the

same set of metrics is used to describe both a ‘defective module’ module and a ‘non-defective’

module. They typically introduce an upper bound on the level of performance achievable but are

otherwise harmless. Presence of constant attributes do not seem to pose problems to the learners

and only data set KC4 contains the problem of repeated attributes. Missing values also may or may

not be problematic depending on the type of classification method used. It is not possible to check

correlated attributes, since the code for any data set is not available.

33

Fig. 2.2: Confusion Matrix of a Defect Prediction Classifier

2.3 Measuring Performance of SDP Models

This section describes the performance indicators used to evaluate defect prediction models. In the

rest of the section, confusion matrix is described, the process of ROC curve analysis is presented

and AUC as an evaluation measure is explained. Furthermore, definition and importance of Recall

as a measure is also provided.

2.3.1 Confusion Matrix

Binary classification is a two-class prediction problem, in which the outcomes are labeled as either

positive or negative. For defect prediction, binary classification involves mapping all instances of

a data set containing defective and non-defective modules to defective and non-defective classes.

There are four outcomes possible from this binary classification. If a module is defective and it is

classified as defective then it is called true positive (TP ), but if it is non-defective then it is called

false positive (FP ). Similarly, if a module is non-defective and it is classified as non-defective

then it is called true negative (TN ). However, if it is classified as defective then it is called false

negative (FN ). Figure 2.2 shows a confusion matrix based on this output.

The probability of detection (PD) is defined as the probability that a defective module is cor-

rectly classified as defective. It is also known by the terms sensitivity, recall, true positive rate

(TPR), rate of detection and hit rate.

PD =TP

FN + TP(2.1)

The probability of false alarm (PF ) is defined as the proportion of non-defective modules that

are correctly identified as fault-free. It is also known by the terms specificity, false positive rate

(FPR) and false alarm rate.

34

PF =FP

TN + FP(2.2)

Both these measures and other similar measures (e.g. precision, accuracy etc.) are derived

from the confusion matrix and tell a one-sided story (Lessmann et al., 2008, Jiang et al., 2008a,

Metz, 1978). Some measures like J-coefficient, F-measure and G-mean, are also derived from the

confusion matrix and are more useful as reported in (Jiang et al., 2008a). However, they still do

not aid in simple classifier comparison and selection.

2.3.2 ROC Curves and AUC

As opposed to the performance parameters mentioned in Section 2.3.1, Receiver Operating Char-

acteristic (ROC) curves enable an objective and simple comparison between algorithms, the results

of which can be converged with other studies which also use ROC curve analysis for model evalua-

tion. Many classifiers enable users to specify and alter a threshold parameter so that corresponding

models can be generated. For such models, a higher PD can be produced but with a higher PF

and vice versa. The (PF, PD) pairs obtained by alterations of the threshold parameter form a

ROC curve with PD on the x-axis and PF on the y-axis as shown in Figure 2.3. Every ROC

curve passes through the points (0, 0) and (1, 1). The former represents a classifier that always

predicts a module as non-defective while the latter represents a classifier that always predicts a

module as defective. The ultimate goal is the upper-left corner (0, 1) which represents a model that

identifies all defective modules without making any error. The advantage of ROC curves is that

they are robust towards imbalanced data sets and to changing and asymmetric misclassification

costs. These features are characteristic of software prediction tasks. Therefore, ROC curves are

particularly suited for this task.

The Area Under the ROC curve (abbreviated as AUC) is a common measure used for depicting

accuracy of classifiers from ROC curves based on the same data. Higher AUC values mean that

the model is more towards the upper left portion of the ROC graph. If value of AUC = 1, then

the test is perfect. In general, a value well above 0.5 indicates that the model is effective and

gives valuable advice as to which modules should be focused on while testing and debugging

35

Fig. 2.3: ROC curve of three classifiers with best performance of C1

Source: http://gim.unmc.edu/dxtests/ROC3.htm

the respective software system. The AUC measure is especially useful for imbalanced data sets.

Its use can highly improve convergence across studies in defect prediction because it measures

only accuracy of prediction and does not take into account the operating conditions like class and

cost distributions. Thus it represents an objective and general measure for reporting performance

with respect to predictions. It has a clear statistical meaning: it measures the probability that a

model correctly classifies a randomly chosen model as faulty or faultless. For more information

on ROC curves and AUC in software defect prediction, the reader is referred to (Lessmann et al.,

2008)(Jiang et al., 2008a)(Demsar, 2006). For more details on ROC curves, in general, (Metz,

1978) provides a comprehensive analysis.

2.4 Software Defect Prediction Studies

Initial work in the area of quality prediction was limited to relationship between software science

metrics (Halstead, 1977) and number of errors in the software for example (Ottenstein, 1979, 1981,

Schneider, 1981). In 1977, Halstead (Halstead, 1977) introduced software complexity metrics,

also called software science metrics, which laid down a base for the study of software metrics and

quality.

36

Early work done on quality prediction (Khosgoftaar and Munson, 1990, Ottenstein, 1979,

1981) was of empirical nature. It was inspired from mathematical models and laid a base for

use of statistics and probability in connection with quality prediction.

In 1990s, the focus of the studies changed from predicting number of defects to classifica-

tion and / or identification of fault-prone modules (Briand et al., 1993, Ohlsson and Alberg, 1996,

Munson and Khosgoftaar, 1992, Khosgoftaar et al., 1994, Khoshgoftaar et al., 1997a). Prediction

can be made with the help of software metrics based software quality models (Wang et al., 2007).

These model can either predict a certain quality factor like number of errors or classifies the mod-

ules as fault prone or not fault prone (Wang et al., 2007). The models which output a quality factor

are considered as prediction models (Wang et al., 2007) whereas the later are termed as classifi-

cation models (Wang et al., 2007, Gokhale and Lyu, 1997). Various artificial intelligence based

and mathematically and statistically inspired techniques, for example neural networks, discrimi-

nant analysis, optimized set reduction were used for prediction purpose (Munson and Khosgoftaar,

1992, Briand et al., 1993, Khosgoftaar et al., 1994, Khoshgoftaar et al., 1996, Khoshgoftaar and

Allen, 1999a, Ganesan et al., 2000, Khoshgoftaar and Allen, 1999b). Rest of the section provides

a survey of different defect prediction studies.

2.4.1 Factor and Regression Analysis

Khoshgoftaar et al. (Khosgoftaar and Munson, 1990) have developed prediction models on the

basis of factor and regression analysis. Their model assimilated the relationship between metrics

of program error and software complexity metrics. They have also investigated the relationship

between program error measures and the metrics short listed (on the basis of factor analysis) from

software science metrics (Halstead, 1977). They have employed factor analysis to reduce the set

of all collected metrics to highly related metrics and then have performed the regression analysis

on that smaller set of metrics to find the number of errors the software modules might contain.

37

2.4.2 Discriminant Analysis and Principle Component Analysis

Munson et al. (Munson and Khosgoftaar, 1992) have classified program modules as fault-prone

and not fault-prone using discriminant analysis. They have used uncorrelated complexity and soft-

ware metrics to categorize a program as fault-prone or as not fault-prone. Earlier studies have

shown that metrics of software complexity and errors during software life-cycle have some rela-

tionship (Endres, 1975, Basili and Perricone, 1984, Chmura et al., 1990). So they have developed

the assignment model based on complexity information. They have used historical development

metrics and quality data to develop the predictive model and then used the model to categorize

the modules according to their metric profile. They first employed Principle Component Analysis

(PCA) based approach to extract uncorrelated metrics before applying discriminant analysis on the

basis of those uncorrelated metrics. The discriminant analysis accomplishes the job of classifying

the modules.

Khoshgoftaar et al. (Khoshgoftaar et al., 1996) have used PCA and discriminant analysis to

find the set of most significant metrics and classify the fault-prone and not fault-prone modules.

They kept the process and product metrics as independent variable. The class of a module which is

either fault-prone or not fault-prone is the dependent variable in their case. They have conducted

their analysis on the data of a large telecommunication system. They performed PCA on the

process and product metrics to find which metrics are significant for quality and which are not. The

PCA, in their model, identifies the highly co-related and uncorrelated data. The uncorrelated data

forms the principle components which represent the same data but in a new co-ordinate system.

Now these principle components, which are domain metrics, are input to the classification model,

which is a non-parametric discriminant analysis based model. They developed two models and the

misclassification errors up to 31.1% and 22.6% were observed respectively.

2.4.3 Bayesian Models

Abe et al. (Abe et al., 2006) have applied Bayesian classifier on project metrics to predict the

quality of a completed software. For the selection of metrics on which the Bayesian classifier will

be applied, they have suggested and used Wilcoxon rank sum test (Mann and Whitney, 1947) on

38

each metric. This test compares the location of two populations and based on that comparison it

tells if one of the populations is different from the other or not? They have applied this test to

find statistical difference between the distributions of successful and unsuccessful projects. Such

differences can single out the metrics which affect quality. After applying this test they select 10

metrics out of 29 which can contribute towards the quality of the software. They then apply the

classifier to predict whether the given software project is successful (of good quality) or not.

Li et al. (Li et al., 2006) have shared empirical results of a quantitative defect prediction

approach. Their major focus had been on improvement of testing and resources. Their work

has categorized the metrics available before release for field defects predictions and hence helps

in future improvement. They have compared seven prediction models which include clustering

algorithm as well as the regression models using rank correlation. On the basis of the comparison

they have identified the important predictor for a certain type of software. They have also used

Bayesian Information Criterion (BIC) to determine important and prioritized areas for product

testing.

Bouktif et al. (Bouktif et al., 2006) have presented an approach for improving software quality

prediction. Their approach was to re-use and adapt already available quality prediction models.

They have given an idea of using simulated annealing algorithm on top of a Bayesian classifier

approach. They have generalized the approach of selecting the prediction model by combining

quality experts and model the expertise as Bayesian classifier and run their suggested algorithm.

The algorithm outputs the best subset of expertise from the set of all those expertise. They have

tackled this problem of finding optimal subset as an optimization problem.

Khoshgoftaar et al. (Khoshgoftaar et al., 1997a) have introduced process based metrics to

predict software quality. They have used reliability indicators to predict the quality of the soft-

ware. They have also emphasized that quality of a process reflects the quality of the product itself.

According to them early prediction of reliability indicators motivates the use of reliability im-

provement techniques prior to module integration. Their core work had been on improvement of

integration and testing processes, which eventually improves the software quality as well. They

have suggested that product metrics, as used by various quality prediction models, are not good

39

tools for the task in systems which are evolving with each iteration (for example the systems using

spiral life cycle). They have classified the module as fault-prone and not fault-prone and this clas-

sification helps them focus on modules which are more prone to fault. Their classification model

was probability based and has performed approximately 35.5% misclassifications.

Khoshgoftaar et al. (Khoshgoftaar and Allen, 1999a) have suggested a logistic regression-

based model for software quality prediction and classifying modules as fault-prone and not fault-

prone. Unlike other regression modeling techniques they have taken into account the prior proba-

bilities of misclassifications and cost of misclassification to the classification rule for the logistic

regression-based model. They have performed the analysis on a subsystem of a military real-time

system.

Mockus et al. (Mockus et al., 2005), using regression analysis, have predicted customer per-

ceived quality by measuring service interactions like defect reports, requests for assistance, field

technician patches and other parameters in a large telecommunication software system. They have

investigated the impact of problem occurrence and frequency of problem occurrence and discov-

ered that these will negatively affect the customer perceived quality. Mockus et al. (Mockus et al.,

2005) have used the data collected by automated project monitoring and control and found that

deployment schedule, hardware configurations and software platforms can affect the probability

of a software failure. Furthermore, their findings have suggested that these factors equally affect

the customer perceived quality. Their work has been more helpful in planning of customer support

processes unlike other prediction models which help the planning of development.

Nagappan et al. (Nagappan and Ball, 2005b) have employed code churn metrics for predicting

the system defect density. They have used eight relative metrics like ratio of changed LOC to total

LOC, deleted LOC to total LOC etc., and applied the statistical regression models to predict defect

density. They have found that absolute measures of code churn are not good predictors of defect

density.

Neil et al. (Neil and Fenton, 1996) have highlighted the importance of factors like contextual

information to improve prediction of software defects. They have used a combination of software

process and product metrics to build Bayesian Belief Network (BBN). Their expert driven BBN

40

was capable of making better prediction on the basis of contextual information.

Fenton et al. (Fenton and Neil, 1999) have criticized existing models for being unable to take a

holistic view while predicting defects. For example, certain models predict defects using size and

complexity metrics whereas others use testing metrics only, consequently ignoring the potential

predictors like process metrics. Moreover, these models do not take into account the relation-

ship between software metrics and defects. Fenton et al. (Fenton et al., 2002) have presented a

BBN to overcome the aforementioned limitations. Their model has shown encouraging results

when used to predict defects in multiple real software projects. The BBN has been improved to

work independent of software development life cycle (Fenton et al., 2007b) and has also been em-

ployed by different studies (Fenton et al., 2008, Amasaki et al., 2003, Dabney et al., 2006, Pai and

Bechta Dugan, 2007) to predict software quality.

Causal relationship between software metrics and defects is important in understanding and

improving software processes (Card, 2004, 2006, Chang and Chu, 2008). Thapaliyal et. al (Tha-

paliyal and Verma, 2010) have shown that in object oriented paradigm, ‘Class Size’ is a metric that

has strong positive relationship with defects but ‘Coupling between Objects’ and ‘Weighted Meth-

ods per Class’ are insignificant to be used to predict defects. This study employed weighted least

square model for empirical analysis. Similarly, use of contextual information has been reported to

be ineffective as compared to code metrics only (Bell et al., 2011).

2.4.4 Neural Networks

Khoshgoftaar et al. (Khosgoftaar et al., 1994) have suggested a neural network based classifica-

tion model to identify the modules which have high-risk of error and have applied it in quality

evaluation and resource planning in telecommunication software. They have compared the classi-

fication results with the results of discriminant analysis based classification. Their finding was that

neural network based approach is better in terms of providing management support in software

engineering. Also neural networks based techniques are simpler and easier to use and produce

more accurate models as compared to discriminant analysis.

41

Pizzi et al. (Pizzi et al., 2002) have performed Neural Networks (NN) based classification of

software modules and predict the quality accordingly. They have classified the modules (objects)

on the basis of quality attributes like maintainability, extensibility, efficiency and clarity. Upon

classification the low quality objects might be reviewed by project manager for betterment. Their

approach recommends to ask an expert software architect to determine class labels and then use

neural networks to classify objects. Since the labeling of classes involve a human so subjectivity

can not be avoided in this approach. To compensate such imprecision they have performed prepro-

cessing by using median-adjusted class labeling. They have applied their approach on 19 different

metrics and based on the values of those metrics, the class labels were set by the expert architect.

Thwin et al. (Thwin and Quah, 2002) have conducted an empirical study and propose a NN-

based approach to estimate software quality. Similar study has been conducted afterwards by Quah

et al. (Quah and Thwin, 2003). They have used two different neural network models (Ward NN

(WNN) model and General Regression NN (GRNN) model) for estimating number of defects (re-

liability) per class and number of line changes (maintainability) per class. They have used object

oriented metrics like inheritance related metrics, cohesion metrics and coupling metrics as inde-

pendent variables. They have also used traditional complexity metrics for prediction purpose. The

WNN model was based on Gaussian function while the GRNN was an approach which estimated

about the continuous variables. After applying both approaches they have found that GRNN pre-

dicts the quality more accurately. Unlike Quah et al. (Quah and Thwin, 2003), Thwin et al. (Thwin

and Quah, 2002) have taken into account the reliability metric (i.e. number of defects) only.

Wang et al. (Wang et al., 2004) have also used neural network based approach for quality pre-

diction but they have extracted rules from the prediction model and then used those rules to detect

the fault prone software modules. They have used genetic algorithms and rule based approach on

top of NN based approach. They have discouraged the use of NN as black box and have introduced

an interpretable NN-based model for software quality prediction.

42

2.4.5 Classification/Regression Trees

Khoshgoftaar et al. (Khoshgoftaar and Allen, 1999b) have used Classification and Regression

Trees (CART) to predict fault-prone software components in embedded systems. The goal of their

prediction was to improve reliability of embedded systems. They have used measurement and

fault data from previous releases to develop a prediction model. This model was applied to the

modules which were currently being developed to predict the software modules which have high

risk of faults to be discovered later on. They have taken software metrics as independent (predictor)

variable and treated the classes of fault-prone or not fault-prone modules as dependent variables.

Gokhale et al. (Gokhale and Lyu, 1997) have used regression tree modeling for the prediction

of software quality. The independent variables in their prediction model were software complexity

metrics. The total number of faults and the set of classification of modules were considered as

dependent variables. They have compared their approach with the defect density model for predic-

tion and have found that their approach had lower misclassification rate as compared to the defect

density model. The misclassification rate as quoted by them is 14.86% for their tree modeling

and 20.6% for defect density techniques. In addition their approach was robust to the presence of

outliers and was capable enough to handle the missing values as well. Also, their approach has

catered the highly uncorrelated data and performed stably.

2.4.6 Case-based Reasoning

Ganesan et al. (Ganesan et al., 2000) have targeted better reliability and give a Case-based Reason-

ing (CBR) approach for quality prediction. In contrast to other approaches which use qualitative

metrics and CBR for predicting quality by identifying similar cases, their approach predicts quan-

titative metrics of quality. They have narrated a case study in which a family of full-scale industrial

software systems applied CBR for software quality modeling. While predicting faults the accuracy

of CBR system was found to be better than a corresponding multiple linear regression model. The

CBR system has given better results for approximately 67% of the datasets and equivalent results

for the rest of the 33%.

Grosser et al. (Grosser et al., 2003) have presented a case-based reasoning based stability

43

prediction model for object-oriented systems. Their approach was to record stability value of a

certain set of components or items. These stability values are already known and are correct.

These stored values act as reference values for new instances of components. The newly arrived

component is compared with the stored component. The set of components which the arriving

component is most similar to decide the stability of the new component, i.e the stability value of

the nearest set of components is the stability value of the newly arrived component. They have

used successive versions of Java Development Kit (JDK) as validation data for their model.

2.4.7 Fuzzy Approaches

Yuan et al. (Yuan et al., 2000) have applied subtractive fuzzy clustering to classify the modules

as fault-prone and not fault-prone and predict the number of faults. They cluster the data by

picking up potential centers of clusters and selecting data point with highest potential as the center.

This process of selecting centers for forthcoming clusters continues until the potential values of

remaining data points fall below a certain threshold. Then rules are generated for each class using

these membership functions of data points. They have used Sugeno-Principle for fuzzy inference.

These rules are then used to predict number of faults. They perform clustering on product as well

as process metrics. The approach used by Yuan et al. (Yuan et al., 2000) is different from the

approach of Khosgoftaar et al. (Khoshgoftaar et al., 1997a) in the sense that Yuan et al. have used

fuzzy clustering while Khoshgoftaar et al. have used probability based technique for classification.

Dick et al. (Dick and Kandel, 2003) have employed fuzzy clustering based on software metrics

and PCA on those clusters together to identify which are more significant modules in terms of

effort and resources needed. Their approach have identified the few relatively complex modules

from the whole software which helps risk mitigation of those modules. They have performed the

fuzzy cluster analysis of three datasets and got impressive results. They have used 11 metrics for

each data set. Different number of clusters were considered optimal for different datasets. Once the

clusters were made, Dick et al. (Dick and Kandel, 2003) have identified that modules representing

high metric values have high risks. They have used PCA to find for each dataset linearity and

homogeneous ordering.

44

Attempts have been made to identify some rules useful in early stages of software lifecycle to

predict defect proneness and reliability (Yang et al., 2007). Yang et al. (Yang et al., 2007) suggest

the use of fuzzy self-adaptation learning control network (FALCON) which can predict the quality

based on fuzzy inputs. They intend to predict the quality when exact values of software metrics

are not known and the domain knowledge and experience of the project managers can be used

to approximate the metrics values. Furthermore, they also intend to find the rule set which has

the capability to reason under uncertainty, which is a limitation of most of the quality prediction

models (Fenton and Neil, 1999).

Fuzzy Unordered Rule Induction Algorithm (FURIA) (Huehn and Huellermeier, 2009) is a

modified version of the rule-based classifier RIPPER (Cohen, 1996). Apart from rule learning

and simple and comprehensible rule sets, like that of RIPPER, FURIA has some more extensions.

Whereas RIPPER learns conventional rules and has rule lists, FURIA learns more general fuzzy

rules and has unordered rule sets. Unordered rule sets may lead to uncovered examples which are

dealt, in the FURIA algorithm, by a rule stretching method which generalizes the existing rules

until they cover the uncovered examples.

2.4.8 Support Vector Machine

Xing et al. (Xing et al., 2005) have used Support Vector Machines (SVM) for early prediction of

software quality. Their proposed model being an SVM-based model does not need large amounts

of data. The SVM based predictor gives correct classification 89% of the times. They suggested

a new SVM based approach named Transductive SVM (TSVM) which achieved 90.03% correct

classification rate (CCR). This improved CCR is achieved on the expense of more training time

than other methods. In another research Xing et al. (Xing and Guo, 2005) have presented a soft-

ware reliability growth model (SRGM) based on support vector regression (SVR). That technique

has been classified as statistical and probabilistic technique and is discussed later in this chapter.

Brun et al. (Brun and Ernst, 2004) have employed machine learning based approach to predict

the latent code errors. They worked on identifying the program properties which are good indi-

cators of errors in later phases of SDLC. They have generated machine learning models of those

45

properties of programs which are exhibited if errors are found. Then they have applied those mod-

els to already developed code to find and rank the properties which are more fault-revealing. They

have used SVM and decision tree learning tools for the classification of such properties.

2.4.9 Genetic Algorithms

Bouktif et al. (Bouktif et al., 2004) have suggested genetic algorithms based improvement in

quality prediction techniques for object oriented systems by combining a set of existing models.

Their approach helps an interested organization select an appropriate quality model. The quality

factor they have talked about is stability of a software developed in object oriented paradigm. They

believe that a stability prediction model should be used to assess and reduce the implementation

cost of the new requirements. But this prediction model needs some input from the previous

versions, so it can not be used unless a few versions have been developed. For this reason they

collect the metrics for each class which contribute towards its stability. These metrics include Coh,

NPPM (number of public and protected methods) and stress. Based on these metrics they build a

decision tree classifier that predicts the stability of a given class whether it is stable or not. They

encode this classifier as a chromosome to be used in genetic algorithms.

Wang et al. (Wang et al., 2007) have introduced a genetic algorithms based approach to select

relevant metrics from the set of all collected metrics in the development of a certain software. This

approach helps designing more accurate predictors of software quality since all of the collected

metrics are not contributing towards the quality of the software but there is a certain subset of

these metrics. So all of the collected metrics should not be used as input to the prediction model.

Instead only the appropriate software features should be contributing to the prediction. They have

selected these relevant features through a genetic algorithm based feature selection model. They

have suggested that the prediction models usually built are not adaptive learners. When the trained

models are provided with an outlier as input, they fail to classify it correctly. But avoiding outliers

in real-world software is not practically possible. So they have suggested an outlier detection

technique which prunes the small clusters of software which might degrade the performance of

the prediction model. They have achieved both these goals, the appropriate feature selection and

46

software clustering for outlier detection by combining them together in the fitness function. They

have validated their model on data collected over four releases of two large telecommunication

systems. The systems were developed using high level languages Ruby and Sapphire. They have

used 14 file level metrics and 18 routine level metrics to test their idea. During this validation

activity, they observed an overall misclassification rate of upto 24.6 %.

2.4.10 Association Mining

Use of association mining for defect prediction has also been reported. Association Rule Min-

ing (ARM) has been useful to predict software defect correction effort and determine association

among software defects (Song et al., 2006, Priyadarshin, 2008, Karthik and Manikandan, 2010).

An association rule based classifier, CBA2, has been empirically evaluated to predict software de-

fects (Baojun et al., 2011). Accuracy and comprehensibility of CBA2 has been comparable to C4.5

and RIPPER, two recognized defect classification models. Kamei et al. (Kamei et al., 2008) have

presented a hybrid approach to classify the software module as fault-prone or not-fault-prone. The

hybrid of ARM and logistic regression has performed better in terms of lift, however, its perfor-

mance has been inferior in terms of support and confidence when compared with the individual

models based on logistic regression, linear discriminant and classification tree. Association rules

have been employed to identify the action patterns that may cause defects (Chang et al., 2009).

Each rule represents actions as antecedent and number of defects as consequent. The antecedents

can be of numeric or categorical type, whereas the consequent is discretized as low, medium and

high. Actions co-occurring with defects are used to avoid future defects. The proposed approach

has shown promising results when used for a business project.

2.4.11 Ensemble Models

Ensemble models work on the principle of creating multiple learning models and combining their

output to get the desired output. Random Forests (Rnd For) (Breiman, 2001) is a decision tree-

based ensemble classifier. It consists of many decision trees. The trees are constructed using the

following strategy: the root node of each tree contains an initial sample of the data that is the same

47

size as the original. Each tree has a different initial sample. At each node, a subset of variables

is randomly selected from all the input variables to split the node. The best split is adopted. This

splitting is done for each new node without pruning to the largest extent possible for building the

tree. When all trees are grown, a new instance or set of instances are fitted to all the trees and the

mode of all the trees is selected as the prediction of the new instance(s) (Jiang et al., 2008a).

Composite Hypercubes on Iterated Random Projections called CHIRP (Wilkinson et al., 2011),

is a new covering algorithm and not modified version of an existing algorithm. It is a non-

parametric, ensemble model. It deals with the problems of the ”curse of dimensionality,” com-

putational complexity and separability that are usually faced by supervised classifiers. CHIRP

addresses these obstacles by using the three stages: projection, binning and covering in an iterative

sequence. For using this model, prior knowledge of the structure of the data is not required since

it does not need to be customized for each data set.

The classifier DTNB is a combination of Decision Table and Naıve Bayes (Hall and Frank,

2008). At each point in the search, the algorithm evaluates the merit of splitting the attributes into

two groups: one for Naıve Bayes and the other for the decision table. The resulting probability

estimates from both the models are combined to give the result of this hybrid classifier. Initially, all

attributes are modeled by the decision table. A forward selection search is used to select attributes.

At each step, the selected attributes are modeled by Naıve Baye’s and the remainder by the decision

table. At each step, the algorithm also considers dropping an attribute entirely from the model.

2.4.12 Other Studies

Ottenstein (Ottenstein, 1979) has suggested a mathematical model for prediction of number of de-

fects in the system before the testing starts. This effort can help in planning and testing phases

of the SDLC. Ottenstein investigated the relationship between the software science metrics (Hal-

stead, 1977) and number of bugs. In short this model was based on the study of software science

metrics and was tested on the data available in literature by then. The proposed model was useful

in estimating time needed for testing, generating better schedules, estimating amount of computer

time needed to perform testing etc. Furthermore, this model could lead to improved reliability esti-

48

mates. The model showing relationship between Halstead Volume (V) and number of bugs had the

least error rate of 12.3%. This error rate was constant across several datasets used by them. Later

on Ottenstein has improved the work and presented a model (Ottenstein, 1981) predicting the total

number of bugs during the development of the project. The difference was that the previous model

was predicting number of bugs in the system at the validation stage of the software lifecycle while

this model was telling the total number of bugs during the development of a project. The model

was validated on the projects developed by professional programmers and projects developed by

students. Projects developed by students can further be grouped into projects with fewer errors and

projects with large number of errors. Both groups had approximately equal number of programs.

This model, did not give promising results for the projects with large number of errors.

Mohanty (Mohanty, 1979) has discussed assessment of various aspects which affect software

quality for example accessibility and testability. Mohanty has also talked about statistical meth-

ods to estimate software reliability and found that the estimates for reliability and mean-time-to

failure were highly correlated with real values. Mohanty’s work targeted various phases of SDLC

and multiple means of assessing software quality at different phases for example methodologies

for design evaluation through entropy function, estimation of development effort through software

science metrics, test effectiveness measurement through suitable test plans. This work contributed

significantly in affirming that quantifiable attributes of software can help control and manage soft-

ware projects.

Schneider (Schneider, 1981) has presented estimators on the basis of experimental study. These

estimators were formulae to estimate the number of software problem reports. The estimators were

validated using data taken from and were found consistent with the data.

Jensen et al. (Jensen and Vairavan, 1985) have conducted an experimental study on software

metrics and their relationship with each other for Real-Time Systems developed in Pascal. Their

model was also an empirical and mathematical model but they did not validate the model to study

the relationship between errors and metrics. Their study of relationship between errors and soft-

ware metrics revealed that a new metric (namely NF ) is a better estimator of program length.

Approximately 91% of the programs tested by them suggested that NF is a better approximation

49

than NH, the Halstead’s program length metric (Halstead, 1977).

Brocklehurst et al. (Brocklehurst and Littlewood, 1992) have suggested a statistically inspired

model which is a good candidate of almost a generic model for reliability prediction but with some

limitations. They have narrated that the ability to depict the past correctly did not guarantee the

ability to predict the future accurately. So the models already been discussed in literature by then

were not true predictors according to their claim. They have presented a new concept of detecting

semantic differences between the predicted value and actual value. They have used u-plot, very

similar to the concept of bias in statistics, to assess the predictive accuracy. They validated their

model on three datasets and found promising results.

Ohlsson et al. (Ohlsson and Alberg, 1996) had empirical evidence that fault-prone software

modules can be identified before the coding starts. This prediction was aided by the design metrics

and some complexity metrics. They have carried out the study at Ericsson Telecom AB and de-

veloped a tool named ERIMET which helps in calculating certain metrics for only those modules

which are affected by a change. Their study has supported the fact that a small number of software

modules are responsible for major portion of total faults encountered in the system. They have

validated the model accuracy using a technique called Alberg diagram (a slight variation of Pareto

diagram), introduced by them in the same paper. They intended to find the relationship between

design metrics and the functions test failures. For this reason they have modeled their predictor

on the basis of design metrics. Their prediction model also incorporated some complexity and

size metrics and the finding was that the metric size does not give any better results than the four

design metrics used by them. Their model was developed on telecommunication data and showed

promising results.

Wang et al. (Wang et al., 1998) have not directly talked about software quality but they present

a model for software reliability which eventually helps in estimating the software quality. They

use the Markov chain properties for estimation in their model. They estimate the reliability of the

components independently and then the components are mapped into state diagrams for further

use. The transition between states is considered as Markov process. They have used different

reliability models for different architecture styles.

50

Guo et al. (Guo and Lyu, 2000) have proposed a statistical technique for early prediction of

software faults. Their approach does not need the prior knowledge of number of faults in the

modules to be predicted as fault-prone or not fault-prone. This kind of information is present in

early stages of software development. They have suggested that software size and complexity

metrics can be used to develop a model for software quality prediction. They select the appropriate

class for the module based on the values given by their criterion termed as Akaike Information

Criterion (AIC) (Akaike, 1974). AIC puts a software module in the class of modules with similar

characteristics.

Xing et al. (Xing and Guo, 2005) have presented a software reliability growth model (SRGM)

based on support vector regression (SVR). Their proposed SRGM has proved to be better predictor

of reliability than other suggested reliability predictors on the data used by them. They have

identified the problem with conventional SRGMs and rectified that. The conventional SRGMs

made unrealistic assumptions about the fault distribution, which restrains these models to analyze

actual failure data. They have compared their model with three different reliability growth models

and the results have shown that their technique is better than the other compared techniques.

Nagappan et al. (Nagappan and Ball, 2005a) have advocated that static analysis tools can

be good predictors of pre-release defect density and consequently the quality of the software.

They have used a statistical regression technique to build prediction model but do not provide

the regression equations for the sake of protecting proprietary data. The task of static analysis

tools was to detect the low level programming errors and the errors usually not uncovered by

conventional testing. They have stated that static analysis defect density has a positive correlation

with pre-release defect density and therefore one can easily use the static analysis defect density

as an indicator of pre-release defect density and as a result decisions on testing, code inspections,

redesign etc. can be improved.

Liu et al. (Liu et al., 2005) have presented a performance prediction method for component-

based applications. Their approach was applicable at design level i.e. before development of a

significant prototype version of the application. Moreover, they have designed the suggested tech-

nique specifically for component-based server side applications. This design-level technique fa-

51

cilitates system architects deciding appropriate application architecture and improve design before

the significant implementation of application has been done. They have employed their approach

to build a quantitative performance model for the application once the design of the application is

available. The independent input variables of the model include the design description of the appli-

cation and the performance profile of the platform using which the application is to be developed.

They have implemented two different architectures on different implementations of Enterprise Java

Beans (EJBs). Then the performance predictions from the model were validated by measuring per-

formance of the above said implementations of the two architectures.

Nagappan et al. (Nagappan et al., 2006) have conducted an empirical study to advocate that

failure-prone entities have correlation with code complexity metrics and based on values of com-

plexity metrics, the component failures can be predicted. But no single metric or set of metrics can

be a best defect predictor. They have used principle component analysis on different code metrics

and have built a regression model to predict the post release defects.

Briand et al. (Briand et al., 1993) have presented an application of Optimized Set Reduction

(OSR) to construct model which can identify high-risk software components. They have evalu-

ated the accuracy of their stochastic model on Ada components. The motivation behind the use

of OSR based model was that in software engineering usually data is not complete and may not

be comprehensive enough to fulfill the needs of a model. So they have suggested a robust model

which reliably classifies the high-risk components. OSR is a technique which is partially based on

machine learning principles and uni-variate statistics and was developed at the University of Mary-

land (Briand et al., 1993, 1992). The output of OSR is logical expressions representing patterns in

the dataset and the OSR classification model works on the basis of these logical expressions.

Reussner et al. (Reussner et al., 2003) have suggested a parameterized reliability prediction

model for component-based software architectures. They have emphasized that the architectures

should be parameterized by the component usage profile and the reliability of the required com-

ponent. So their empirical technique collects the usage information and models dependencies

between service reliability as state machines. In this way they have overcome the problem seen in

earlier suggested models, which was neglecting the component usage information and the context

52

related information.

Grassi et al. (Grassi and Patella, 2006) have suggested a reliability prediction methodology for

service-oriented computing environments. They have associated a flow graph with each running

service in order to get its internal failure probability and usage profile. They have presented steps

to build this flow graph and once such a graph is built and associated with a specific service it can

portray reliability information for the service it is associated with.

Kim et al. (Kim et al., 2007) have presented an innovative approach for fault prediction. They

have argued that a changed entity, a newly added entity, an entity which introduced a fault most

recently and the logically coupled entities to that entity tend to introduce faults soon. So to address

these localities they keep track of recent faults and their location (the change history of a software

project) and predict the most fault prone entities. The cache keeps the current list of the most fault-

prone entities. If a fault is introduced after a certain revision then a cache hit is considered if that

fault introducing entity is already in the cache, otherwise it is counted as a cache miss. Similarly

when a bug in an entity is fixed the presence of that entity is also checked in the cache and it is

counted as a hit if the entity is already there. They have suggested the cache size to be 10% of the

total number of entities and use Least Recently Used (LRU) as the replacement algorithm. The hit

rate tells how accurately the prediction model is working. If the hit rate is good then it means that

the cache keeps the most fault-prone entities and hence makes the approach dynamic.

2.5 Performance Evaluation Studies

Studies to investigate the methods to build and evaluate defect prediction models have also been

conducted. These studies compare different prediction models, asses the impact of using different

metrics, discuss the evaluation parameters that should be used to gauge performance of the models.

Examples of such studies include (Lessmann et al., 2008, Menzies et al., 2010, Arisholm et al.,

2010). In this chapter we conduct a study similar to (Lessmann et al., 2008) and include the new

models proposed after 2008.

Lessman et al (Lessmann et al., 2008) proposed a framework for evaluating different classi-

fiers. The framework is tested on 22 classifiers using 10 data sets from the PROMISE Repository

53

(Menzies et al., 2012). A split-sample setup is used, that is, the data sets are randomly partitioned

into 1/3 for learning (model building) and 2/3 for performance estimation (testing). For models

having hyperparameters, a set of candidate values is selected for each hyperparameter and all com-

binations are experimentally checked by using 10-fold cross validation on the training data. The

combination of hyperparameters that produces the best predictive result is chosen and is used to

build a model on the whole training data set. The split sample setup and 10 fold cross validation

have been used in other classifier evaluation experiments as well (Mende, 2010, Menzies et al.,

2007, Koru and Liu, 2005b). The results are evaluated and compared using ROC curve analysis,

more specifically, using AUC.

Based on previous works on defect predictors built from static code features (Lessmann et al.,

2008, Menzies et al., 2008, 2007), Menzies et al. (Menzies et al., 2010) point out that “better

data mining technology is not leading to better defect predictors.” They call this the ceiling effect.

Due to this ceiling effect, they claim that researchers have reached “the limits of the standard

goal” of optimizing AUC (pd, pf) and in their work, they explore the effects of changing the

standard goal. For this, they propose and evaluate a software, WHICH, a meta-learner that can be

customized according to varying objectives depending on the domain. They also claim that varying

the standard goal will help break through the ceiling. Another major use, complementing the idea

of customization of goals, is that WHICH will help different businesses achieve different goals.

Considering resources available to businesses, different businesses have different goals when it

comes to defect prediction (Menzies et al., 2010). Menzies et al. (Menzies et al., 2010) also

include details over the debate on the usefulness of static code metrics and argue in favor of these

metrics.

Fenton et al (Fenton and Neil, 1999) present a critical review of the existing literature on soft-

ware metrics and statistical models developed using the metrics. They report that prediction models

are based on: size and complexity metrics, testing metrics, process quality data and multivariate

approaches. In particular, the Pareto principle: “a very small proportion of the defects in a system

will lead to almost all the observed failures in a given period of time,” implies that removing large

number of defects in a system may not actually improve its reliability. In their critique of current

54

approaches to defect prediction, they identify problems related to these approaches and suggest a

framework based on Bayesian networks as a solution to these problems. They argue that using

only complexity metrics as indicators of defects is not sufficient and appropriate. They point out

that statistical methodologies should be used such that a major focus is on data quality and method

of evaluation used. Finally, they stress on the importance of identifying the link(s) between defects

and failures.

One of the aims of any Software Engineering experiment is that it should be repeatable and

the results should be generalizable. Mende (Mende, 2010) perform two replications and reports

the experience gained and problems faced. Such studies highlight the information needed for

replication of a defect prediction experiment. Replication of defect prediction experiments may

not produce exact results since many statistical techniques used are inherently non-deterministic

(Mende, 2010). But results should at least be consistent. Mende (Mende, 2010) notes that details

given and evaluation experiment performed by Lessman et al (Lessmann et al., 2008) seem to be

up to mark. For the same reason the techniques used by Lessman et al (Lessmann et al., 2008)

have also been adopted by Jiang et al (Jiang et al., 2008a) and Mende et al (Mende and Koschke,

2009).

Jiang et al. (Jiang et al., 2008a) provide an evaluation of a variety of performance measures

based on different scenarios on eight data sets from the PROMISE repository (Menzies et al.,

2012). A comparison of several performance measures is also given, including that of AUC and lift

charts. Cost curves are introduced along with their merits. The experiment is conducted using six

classifiers implemented in WEKA (Witten et al., 2008). The conclusion presented is that different

performance measures are appropriate for different software systems. This is because different

systems have different requirements when it comes to defect prediction (Menzies et al., 2010).

Mende et al. (Mende and Koschke, 2009) consider the evaluation of a defect prediction model

with respect to effort or cost related to quality control activities for each module. A trivial model is

compared with five other classifiers, first using only the measure LOC and then an effort-sensitive

performance measure. For the former, the trivial classifier performed well, but for the latter, it

performed the worst. The authors claim that costs of quality assurance of a module rely to some

55

degree on the size of the module, which is also supported by Koru et al. (Koru and Liu, 2005b).

This work also closely follows the experimental setups of Lessman et al. (Lessmann et al., 2008)

and Jiang et al. (Jiang et al., 2008a) in the selection of data sets, algorithms and evaluation method-

ology. However, for evaluation, it considers two additional measures, both based on cost associated

with each module.

Menzies et al. (Menzies et al., 2007) emphasize the need for convergence of studies and how

the use of publicly available data sets can enable researchers to compare their techniques. The

paper advocates the use of static code attributes to learn defect predictors and shows that the subset

of attributes used by a particular classifier for a particular data set is important as opposed to a

best set of attributes since different classifiers give different performances on different data sets. It

also proposes the use of Naıve Bayes with logNums, and using WEKA it shows how the module

proposed in the paper performs better than the three WEKA algorithms: OneR, J48 and Naıve

Bayes. For model building, it uses split sample setup and 10-fold cross validation. To evaluate the

classifiers, it uses ROC curve analysis and to show results it makes the use of performance deltas.

One useful way of surveying techniques, tools and models used for a particular field or problem

domain is to consult literature review(s) on that subject. These help in identifying major compo-

nents and trends. Thus Hall et al. (Hall et al., 2011) give a broad overview of tools and techniques

used in defect prediction, current trends that are followed and reports limitations as well. It reports

that the impact on model performance of specific context variables such as maturity, application

area and programming languages has not been studied extensively. It also reports that performance

of models increases when the size of system increases (there is more data). It also reports on

the type of independent variables used for defect predictions. Three main categories have been

identified: process metrics (e.g. previous change and fault data), product metrics (e.g. static code

metrics) and metrics relating to developers. Fault severity is another aspect that has not been stud-

ied in detail. However, severity is a difficult concept to measure. Menzies et al. (Menzies et al.,

2007) describe severity as a vague concept to reliably investigate. Hall et al. (Hall et al., 2011) also

discusses the quality of defect prediction studies and the NASA data sets (Menzies et al., 2012) in

particular. Catal et al. (Catal and Banu, 2009) have also performed a systematic study of existing

56

defect prediction models.

Jiang et al. (Jiang et al., 2008b) explore the performance of different classifiers based on mis-

classification costs: “the ratio of costs for false positives to the costs of false negatives.” They

assume that misclassification costs for each module are the same. With this assumption, they

confirm that different misclassification costs have an immense effect on the selection of suitable

classifiers. Assigning misclassification costs is used to bias the classification models. This tech-

nique is useful in analyzing domain-specific learning of classifiers but the results of such studies,

for comparison between classifiers, are difficult to converge (Lessmann et al., 2008).

Demsar (Demsar, 2006) presents a set of guidelines to carry out a statistical analysis which

is as accurate as possible while comparing a set of models using multiple data sets. A group of

non-parametric statistical methods are proposed for comparing classifiers based on conditions that

ensure that parametric statistical analysis will not be suitable for this comparison. An analysis of

the performance of the recommended statistics on classification tasks is provided. It is also checked

that the proposed statistics are more convenient than parametric techniques. The work focuses on

the examination of new recommendations and introduces the Nemenyi test for making all pairwise

comparisons. After empirical evaluations, it also proposes the use of the Wilcoxon signed ranks

test and the Friedman test over others.

Koru et al. (Koru and Liu, 2005b) discuss how module size affects prediction results. The

change in module size is due to the multiple definitions of a module. Defect prediction experi-

ments done on data sets that take functions and methods as modules (e.g. JM1 and KC2) report

low probability of detecting defective modules. Studies that considered larger program units as

modules (e.g. a set of source files) reported successful defect prediction results. This is because

for large modules, static attribute values show more changes which make it easier for a statistical

algorithm to differentiate between a large defective and non-defective module whereas it is harder

to differentiate between a small defective and non-defective module. By testing out KC1 data class

level modules and method level modules, Koru et al. (Koru and Liu, 2005b) conclude that defect

prediction should be done at a higher level (e.g. class level) as opposed to lower granularity (e.g.

method level). Like Menzies et al. (Menzies et al., 2007), they also advocate the choice of different

57

classification techniques for different data sets.

Becker et al. (Becker et al., 2006) have suggest guidelines to select an appropriate perfor-

mance prediction method from a larger set of prediction methods for component-based systems.

In order to improve the prediction capabilities of a model, they have presented a set of charac-

teristics which need to be there in a prediction model. These include accuracy, adaptability, cost

effectiveness, compositionality, scalability, analyzability and universality. They have presented a

comparison framework on the basis of these characteristics and compared the existing performance

prediction techniques. They have mentioned the inherent weaknesses and strengths of each dis-

cussed technique. After the comparison they have given guidelines to select prediction model for

component-based systems.

2.6 Studies to Remove Inconsistencies in Software Measurement Terminology

Various studies have attempted to standardize software metrics (Purao and Vaishnavi, 2003, Vaish-

navi et al., 2007), remove inconsistencies in software measurement terminology (Garcıa et al.,

2006, 2009) and establish a uniform vocabulary of software measurement terminology and princi-

ples (IEEE, 1998, 2008, ISO/IEC, 2001). Purao et al (Purao and Vaishnavi, 2003) have presented

a uniform and formal representation which aggregates and unifies the metrics dealing with related

aspects of the object oriented (OO) software. They have noticed that fragmented work on OO soft-

ware metrics has resulted in proposal of similar metrics. Vaishnavi et al (Vaishnavi et al., 2007)

have studied the OO product metrics and have suggested a generic framework for the OO product

metrics. The framework has been formalized based on structural attributes of the underlying met-

rics and relatedness of the metrics with each other. Garcia et al (Garcıa et al., 2006, 2009) have

appreciated the problem of inconsistencies in software metrics related literature and have suggested

an ontology based approach to establish a consistent terminology for software measurement. They

have identified the inconsistencies among existing research, standards and even within a certain

standard. Their work includes the semantic relationship of software metrics and development of a

concept glossary. A number of standards have also been developed to establish a uniform vocabu-

lary of software measurement terminology, principles and methods (IEEE, 1998, 2008, ISO/IEC,

58

2001). IEEE Std 1061-1998 (IEEE, 1998) presents a framework for software quality metrics that

allows to divide a metric into subfactors. This approach can be used to study existing metrics,

find their commonalities and identify which metrics can be considered as sub-metrics of a certain

metric. ISO/IEC 9126-1 (ISO/IEC, 2001) focuses on uniform terminology of product quality re-

lated terms. An adoption of ISO/IEC 15939:2007 (IEEE, 2008) defines a measurement process

for software and systems engineering. The standard describes a detailed method to carry out the

measurement process. The standard can be applied to select appropriate metrics on the basis of in-

formation needs of an organization. Based on the process outlined in the standard, an organization

may collect, store, analyze and interpret the measurement data according to the information needs.

In certain scenarios, an organization may not be following any standard to carry out the mea-

surement process but still wants to use existing quality prediction models. Such organization will

certainly have some data to build a prediction model. It will be very helpful for the organization

if there exists some dataset similar to the organization’s data and the defect prediction information

regarding that dataset can be used to analyze the organization’s data. But the metrics used by the

existing datasets have inconsistent labels. In order to match two datasets, atleast their metric labels

should be consistent. The above standards have not mentioned the problem of inconsistent labeling

of software metrics. So far, software product metrics used in quality prediction have also not been

collected at one place.

2.7 Analysis and Discussion

From the preceding sections we can observe that software defect prediction studies present two

major views. These views are different from each other in terms of focus, nature, use of static

code metrics, and availability of data as given in Table 2.4. The table also provides examples of

studies from each view. View 1 emphasizes on the significance of the causes of defects and un-

derstanding relationship between software metrics and defects. View 2 includes studies that have

used public datasets focused on selection of software metrics that are important to find defects, im-

provements in classification of defects, and identification of the software metrics that do not help

in classification of defect prone modules. View 1 requires expert opinion to be incorporated and

59

works on probabilistic approaches. On the other hand, effort made in support of View 2 is based

on empirical evidence. View 1 suggests that these metrics are insignificant unless process metrics

or additional information is also used whereas View 2 emphasizes that defects in software can be

predicted through software code metrics. View 1 highlights the importance of investigating the

causal relationship between software metrics and defects while View 2 leads to build classification

models. Because of the public availability of software defect data a larger number of empirical

studies belong to View 2. Studies that belong to View 1 are relatively smaller in number. Al-

though, significance of the view presented by View 1 cannot be denied, but similar study cannot be

performed without data. Association mining can potentially bridge the gap between the two views

by giving associations between software metrics and defects. These associations do not directly

identify causes of defects but can be further investigated to understand the relationship between

software metrics and defects.

It is not easy to develop empirical approach at initial phases of SDLC, since it needs data to

build a model. For example statistical techniques like Discriminant Analysis (Khoshgoftaar et al.,

1996, Munson and Khosgoftaar, 1992) and Factor Analysis (Khosgoftaar and Munson, 1990) need

previous data or data from similar projects to build the prediction model on. Based on the metrics

calculated from these data the quality of the next phase can be determined. This prediction is very

helpful in iterative development approaches where the organization can atleast roughly predict the

quality of the next iteration. The statistical techniques can be employed either after the release or

right after the design when everything about the number of classes and total number of functions

is known. Mohanty (Mohanty, 1979) also discussed statistical models for estimation of software

reliability and error content but those models were applicable once the software has been deployed.

Certain approaches have the potential to predict quality in early stages. Some of them can be

applied even if very small amount of data is available in the initial phases. For example during

design phase we do not have number of software metrics (like Lines of Code, Number of Bugs)

other than some design metrics. In such scenarios SVM and Rule-based Systems can be employed

for predicting quality. The AI techniques which are further based on machine learning (Thwin and

Quah, 2002, Pizzi et al., 2002) also need previous or similar data to first perform the learning and

60

Tab. 2.4: Two Major Views in Software Defect Prediction

View 1 View 2

Focus Causes of defects.

Relationship be-

tween software

metrics and defects

Important metrics.

Classification models

based on correlation

etc..

Nature Expert opinion

based

Empirical

Use of Static

Code Metrics

Suggest to incorpo-

rate expert opinion

Use as they are

Availability of

Data

Most of the times

data not publicly

available

Most of the times data

is publicly available

Studies in each

group

(Neil and Fenton,

1996, Fenton and

Neil, 1999, Fen-

ton et al., 2002,

2008, Pai and

Bechta Dugan,

2007, Klas et al.,

2010)

(Menzies et al., 2007,

Peng et al., 2011, Wang

et al., 2011, Bell et al.,

2011, Song et al., 2011,

Sun et al., 2012, Bishnu

and Bhattacherjee,

2012, WANG et al.,

2013, He et al., 2015,

Okutan and Yildiz,

2014)

61

then to classify based on the learnt model. So, like statistical techniques, the NN-based models

(Quah and Thwin, 2003, Wang et al., 2004, Pizzi et al., 2002), Case Based Reasoning (Ganesan

et al., 2000) and other classification techniques (Khoshgoftaar and Allen, 1999b, Dick and Kandel,

2003) are dependent on data of previous releases or that of the similar projects.

No distinction has been noticed in the use of a prediction approach for a particular development

paradigm. For example, So any prediction approach can be used for a particular paradigm. The

choice of the approach depends on the kind of data available to build the prediction model and/or

the goal of the prediction. Use of metrics related to a certain paradigm for predicting defects in

another paradigm has been observed.

Software quality factor, also has significant role in the overall quality prediction model. When-

ever software quality is predicted, the basic objective is set for example whether we need to predict

stability of this software or reliability of the software. Khan et al. (Khan et al., 2006) also men-

tion that each prediction model has its own objectives in this regard. For example Becker et al.

(Becker et al., 2006) only predict performance of the software where as Xing et al. (Xing and Guo,

2005) predict software reliability. These objectives should also be kept in mind while selecting

a technique appropriate for a software. A few examples of quality factors are number of errors,

performance(Becker et al., 2006, Liu et al., 2005), reliability(Reussner et al., 2003, Xing and Guo,

2005, Grassi and Patella, 2006), stability (Grosser et al., 2003, Bouktif et al., 2004), dependability

(Grassi, 2004), customer perception (Mockus et al., 2005).

The literature survey revealed that many metrics have been used with same definition but dif-

ferent names and sometimes different metrics have been given same name.

2.8 Summary

Software metrics specifically code metrics are used to develop software defect prediction models.

Literature suggests that the code metrics are often used in SDP models. Collection of code metrics

is easy and literature reports that code metrics give good predictions. Though the code metrics

predict good, they cannot be used for early predictions because code becomes available late in

software lifecycle. Literature suggests that for early predictions requirements and design metrics

62

are used. All these metrics predict occurrence of defects and not cause of defects, reason being the

data regarding causal relationship is not available in code metrics.

Defect data for prediction is publicly available also. The public datasets also consist of code

metrics. These datasets are dominated by examples of defect free modules. This makes prediction

of defect prone modules difficult. Software defect prediction studies have also been done using

data which is not publicly available (Khoshgoftaar et al., 1996, Quah and Thwin, 2003, Wang et al.,

2004). Also there are studies that identify cause of defect unlike the studies that predict occurrence

of defects based on size and complexity of software. Literature also suggests that the experiments

conducted using public data can be replicated and the studies reporting such experiments are more

useful.

A variety of software defect prediction models exist in literature using a range of techniques

such as neural networks, evolutionary algorithms, bayesian belief networks, regression based tech-

niques, decision trees, decision tables, naive bayes classifier. Empirical studies have also been

performed to investigate relationships between software metrics and defects. Different studies

have investigated the causes of defects, selected the software metrics that are important to find de-

fects and identified the software metrics that do not help in classification of defect prone modules.

Further, use of association mining for defect prediction has also been reported. Bayesian Belief

Networks (BBN) have been widely used to discover causes of defect. Software process metrics

have been combined with software product metrics to overcome serious problems implicit in defect

prevention, detection and complexity. Significance of causal analysis in software quality predic-

tion has been highlighted and holistic models for software defect prediction, using BBN, have been

presented as alternative approaches to the single-issue models proposed in literature. Bayesian net-

works have also been used for accurate predictions of software defects in a range of real projects,

without commitment to a particular development life cycle. Association rule mining has also been

employed to discover the patterns of actions which are likely to cause defects in the software. De-

fect causal analysis has been acknowledged prudently in software process improvement techniques

as well.

There are variety of measures available to assess performance of prediction models. Accuracy,

63

Precision, J-Coefficient, F-Measure, G-Mean, Recall, Area Under ROC Curve (AUC) are a few

of them. Existing studies prefer AUC over other performance measures. In software domain, de-

tection of defective module is more important that detection of a defect-free module, suggesting

that Recall is important in domain of software defects. In literatures, Accuracy and Precision are

not considered good performance measures as compared to AUC and Recall in case of imbalanced

datasets. Also, performance achieved using number of data mining techniques suffer from ceil-

ing effect. This means that the performance of new defect prediction models is not improving

significantly.

A significant number of studies have been done to evaluate performance of existing defect

prediction models. These studies use public datasets for the performance evaluation such that the

comparison between the models can be refuted and/or replicated. Lessman et al (Lessmann et al.,

2008) in 2008 have compared performance of 22 defect prediction models.

In the following chapter we present a comparative study of models reported after 2008. We

have used Lassman’s comparison framework and steps for the study.

64

3. SOFTWARE DEFECT PREDICTION MODELS: A COMPARISON

This chapter presents a comparative analysis of existing models which helps select the prediction

model with best performance such that the performance of the selected model is improved using

association mining based preprocessing approach. This comparative analysis has followed steps

by Lessman et al (Lessmann et al., 2008) which are selecting classifiers, selecting datasets, com-

paring the performance of the classifiers on multiple datasets using statistical tests, and presenting

statistical difference in performance of the selected classifiers.

To conduct the comparative study of defect prediction models, we select 12 datasets from the

list of datasets provided in Chapter 2. These 12 datasets have been selected based on a similar

study with novel findings on performance of the classifiers (Lessmann et al., 2008). The difference

between their work and present work is that they used 10 datasets and the present study uses 12

datasets for comparison of classifiers. Lessman et al compared 22 classifiers and the present work

compares 5 classifiers namely Random Forests, Naive Bayes, CHIRP, DTNB, and FURIA. First

two of these have been evaluated by Lessman et al (Lessmann et al., 2008) whereas the rest of the

3 have been reported in literature after the study by Lessman et al (Lessmann et al., 2008). The

scale of the critical difference in performance shown in the present study is different from the one

used by Lessman et al (Lessmann et al., 2008) because the number of datasets and models has been

different.

The study by Lessman et al (Lessmann et al., 2008) provides a significant framework for com-

parison between performances of defect prediction models. Defect prediction models reported in

literature after 2008 should be compared with the previous models to see if significant change in

prediction performance has been achieved. Therefore, the present study compares the aforemen-

tioned 5 models using the framework provided by Lessman et al (Lessmann et al., 2008). Random

Forests and Naive Bayes have been selected from existing comparison for the present comparison

because Random Forests has been the best of all in their study (Lessmann et al., 2008) and NB

has been extensively used (Lessmann et al., 2008, Menzies et al., 2010, Mende, 2010, Jiang et al.,

2008a, Mende and Koschke, 2009, Menzies et al., 2007, Demsar, 2006) in literature for defect

prediction. As a reference we are providing a brief description of the 5 models.

3.1 Description of Models

3.1.1 Naive Bayes

Naive Bayes (NB) classifier is a simple probabilistic classifier based on Bayes’ Theorem (Jiawei

and Micheline, 2002). NB classifiers are called “naive” since they assume independence of each

feature. This classifier accepts the data sample in the form of n-dimensional feature vector X

where each dimension corresponds to measurement made on n attributes of the sample. For a

feature vector X = x1, x2, x3, x4, ..., xn each xi measures an attribute Ai of the data sample.

Given the data point X without class labels, the NB classifier predicts the class of X .

In case of defect prediction, each vector represents a software module where each attribute is

a metric value corresponding to an attribute of the software module (for example size, cyclomatic

complexity, essential complexity, number of children etc.). The number of classes in our case

is 2 which are Defect-Prone (D) and Not-Defect-Prone (ND). For the 2 class problem the NB

classifier maximizes the posterior probability of a class Cj conditioned on X . The classifier uses

the Bayes Theorem:

P (Cj|X) =P (X|Cj)P (Cj)

P (X), Cj ∈ {D,ND} (3.1)

The NB classifier assigns a class label Cj to an unknown sample X if and only if P (Cj|X) >

P (Ck|X) for j 6= k. This shows that the classifier works on the principle of maximizing P (Cj|X).

Using the Bayes Theorem in Equation 3.1 only P (X|Cj)P (Cj) is maximized because P (X) re-

mains constant for both the classes. Generally, the class probabilities are assumed to be equally

likely by this classifier and prior probabilities of the classes are estimated as P (Cj) =sjs

where sj

is the number of training samples belonging to class j and s is the total number of samples. The

66

NB classifier outperforms most classifiers because of the “naive” assumption mentioned earlier in

the section. The classifier also follows a smart mechanism to deal with missing values. When in

training phase the classifier does not consider an instance having attribute(s) with missing value(s)

for calculations of class prior probabilities. When performing classification, the attribute(s) with

missing value(s) is/are dropped from calculations. With our proposed preprocessing approach of

introducing missing values in the dataset (approach presented in Chapter 4), this elegant treatment

of missing values makes NB a good candidate for selection.

3.1.2 Random Forests

Random Forests (Rnd For) (Breiman, 2001) is an aggregation of decision tree-based classifiers.

Each tree is specialized for a part of training set and its vote is counted towards the final classifi-

cation. The trees are developed using different initial sample data but the size of the initial sample

remains same for all the trees. At each node, the set of discriminative attributes is randomly ob-

tained and the best split is adopted for further processing. Training for trees in the Random Forests

is based on tree bagging. If a training set X = x1, ..., xn has responses Y = y1, ..., yn, B is number

of trees in the random forest, bagging repeatedly selects a random sample with replacement of the

training set and fits trees to these samples:

For i = 1, ..., B :

1. Sample, with replacement, n training examples from X, Y ; call these Xi, Yi.

2. Train a decision or regression tree fi on Xi, Yi.

Once all trees are trained, a test instance (or a set of test instances) is classified by all the

trees. Vote from each tree is taken with each vote having equal weight. The test instances x‘

is classified based on simple majority vote scheme or by averaging the predictions from all the

individual regression trees on x‘:

f = 1B

∑B

i=1 fi(x‘)

67

3.1.3 Composite Hypercubes on Iterated Random Projections (CHIRP)

Composite Hypercubes on Iterated Random Projections (CHIRP) (Wilkinson et al., 2011) is an

ensemble classifier that classifies test instances based on majority vote obtained by m runs of

CHIRP. CHIRP iteratively employs three stages namely projection, binning and covering for each

class as shown in Figure 3.1. The scoring is done based on Composite Hypercube Description

Regions (CHDR) (which can also be considered as sets of rectangles) (Wilkinson et al., 2012). A

Hypercube Description Region (HDR) represents the set of points less than a fixed distance from

the center (another point), whereas a CHDR is the set of points in union of zero or more HDRs.

The CHDRs are used to identify if a set of points forming a region belongs to any class or not

(Wilkinson et al., 2012). When scoring a given test instance, CHIRP transforms and re-scales the

test point, passes it through list of CHDRs, and then for each CHDR the test point is projected.

The first rectangle in a CHDR that encloses the projected point determines class of the test point.

In case no rectangle encloses the test point, the nearest rectangle determines the class.

3.1.4 Decision Table - Naive Bayes (DTNB)

The Decision Table - Naive Bayes (DTNB) classifier (Hall and Frank, 2008) works in the manner

as a decision table does except that the selection of discriminative attributes is different. Instead of

the standard mechanism of selecting the discriminative attributes by maximizing cross validation

performance, the attributes are split into two groups: one for Naıve Bayes and the other for the

Decision Table. One split is modeled by NB and the other by DT. Overall class probability esti-

mates are generated after combining the probability estimates given by each model. The overallCHIRP Train ing CHIRP Scor ingPr oj ect in g B inn in g Cov er in gP ick Class Mor eClasses?Y es Tr an sforman d R escaleP ass thr ou ghL ist ofR egion s Classify b asedon Pr oj ect ionFig. 3.1: CHIRP Working in training and testing

68

class probability estimates are calculated as follows (Hall and Frank, 2008):

P (y|X) =α× PDT (y|XDT )× PNB(y|XNB)

P (y)(3.2)

where XDT and XNB are sets of attributes in DT and NB respectively, P (y) is the prior probability

of the class, α is a normalization constant, PDT (y|XDT ) is the class probability estimate obtained

from DT and PNB(y|XNB) is the class probability estimate computed for NB.

3.1.5 Fuzzy Unordered Rule Induction Algorithm (FURIA)

Fuzzy Unordered Rule Induction Algorithm (FURIA) (Huehn and Huellermeier, 2009) is an ex-

tension of a rule learner RIPPER (Cohen, 1996). Rule sets generated by FURIA are simple and

comprehensible as they are in case of RIPPER. Unlike RIPPER (which learns conventional rules

and has rule lists), FURIA learns fuzzy rules and has unordered rule sets. Handling fuzzy rule sets

makes decisions made by FURIA more general and less abrupt as compared to the conventional

(non-fuzzy) models having sharp decision boundaries and abrupt transitions between classes. In

order to deal with the systematic bias (of conventional models) in favor of a class that is to be

predicted, FURIA works on the principle of separating one class from all the other.

The problems introduced due to the fuzzy approach are well anticipated by FURIA. FURIA

provides resolution to the problem of a conflict that may arise if an instance is equally well covered

by rules from different classes, however this conflict is less likely to arise. The cases like an

instance is not covered by any rules are dealt by a rule stretching method which generalizes the

existing rules until they cover the uncovered examples.

3.2 Comparison Framework

To compare the performance of multiple algorithms over multiple data sets, the use of statistical

measures is common. Researchers propose the use of the Friedman test followed by the corre-

sponding post-hoc Nemenyi test (Lessmann et al., 2008, Jiang et al., 2008a, Demsar, 2006). The

Friedman test is a non-parametric statistical test. It is a version of ANOVA and is applied to nu-

meric and ranked data, rather than actual data, which makes it less susceptible to outliers. ANOVA

69

was specifically designed to test mean accuracies (significance of difference between multiple

means) across different data sets. However, the Friedman test is preferred over ANOVA (Less-

mann et al., 2008, Demsar, 2006). The procedure of performing the test is similar to that of other

hypothesis tests. The hypothesis being tested in this setting is:

H0: The performance of each pair of defect prediction models has no significant difference.

vs.

H1: At least two models have significantly different performance.

All models are ranked according to their AUC values, for each data set, in ascending order. The

mean rank, ARi, is calculated for each model i over all data sets. The test statistic of the Friedman

test is calculated as:

χ2F =

12K

L(L+ 1)[

L∑

i=1

AR2i −

L(L+ 1)2

4], (3.3)

ARi =1

K

K∑

j=1

rji , (3.4)

where K is the total number of data sets, L is the total number of classifiers and rij is the rank

of classifier i on data set j.

However, the χ2F statistic is quite conservative (Demsar, 2006), therefore the following statistic

is preferred:

FF =(K − 1)χ2

F

K(L− 1)− χ2F

. (3.5)

The test statistic is distributed according to the F-Distribution with L− 1 and (L− 1)(K − 1)

degrees of freedom. In our experiments, the measure of interest is the mean AUC estimated over

10-fold cross validation, using 95% confidence interval (α = 0.05) as a threshold to judge the

significance (α < 0.05). Using these, the critical value of the test is obtained. If the calculated test

statistic is greater than the critical value, H0 is rejected which would imply that the performance

difference among at least two classifiers is significant.

If H0 is rejected, the post-hoc Nemenyi test is applied to detect which specific pair of classifiers

differ significantly. For every two models, it tests the null hypothesis that their mean ranks are

70

statistically same. For this, the critical difference (CD) is calculated. The null hypothesis is

rejected, if the difference between the mean ranks of the models exceeds CD, which is calculated

as:

CD = qa,∞,L

√

L(L+ 1)

6K. (3.6)

The value qa,∞,L is based on the Studentized range statistic (Lessmann et al., 2008).

Tab. 3.1: Results of Classifiers over Selected Data Sets in Terms of AUC

CM1 JM1 KC1 KC3 MC1 MC2 MW1 PC1 PC2 PC3 PC4 PC5 AR

NB 0.57 0.64 0.70 0.56 0.72 0.76 0.79 0.80 0.55 0.71 0.90 0.71 2.21

Rnd For 0.76 0.78 0.74 0.63 0.75 0.65 0.86 0.78 0.77 0.83 0.95 0.78 1.42

CHIRP 0.53 0.55 0.56 0.49 0.50 0.69 0.65 0.52 0.50 0.50 0.63 0.62 4.50

DTNB 0.35 0.64 0.68 0.57 0.66 0.72 0.75 0.76 0.85 0.63 0.78 0.68 3.00

FURIA 0.64 0.60 0.58 0.60 0.50 0.58 0.57 0.64 0.49 0.50 0.83 0.68 3.88

3.3 Experiment

The goal of this experiment is to compare the competitive performance of five classifiers described

in Section V. The accuracy of each classifier (in terms of AUC) is obtained on a test set that is a

randomized version of the actual data set. This process is repeated for each data set and is also

called the split-sample setup. As mentioned earlier, the complete procedure in this section has been

adopted from (Lessmann et al., 2008).

All data sets are randomized and partitioned into 2/3 for training sets (model building) and 1/3

for test sets (performance estimation). The merits of this scheme are that it provides an objective

estimate of a model’s generalized performance and it enables easy replication. Also, it is used

extensively in defect prediction studies as reported in (Lessmann et al., 2008).

71

Tab. 3.2: Mean AUC and Std. Dev. Over the Complete Range of Tuning Parameters

NB Rnd For CHIRP DTNB FURIA

CM1 0.568 ± 0.000 0.747 ± 0.024 0.531 ± 0.000 0.482 ± 0.076 0.583 ± 0.037

JM1 0.640 ± 0.000 0.700 ± 0.037 0.550 ± 0.003 0.638 ± 0.010 0.583 ± 0.013

KC1 0.695 ± 0.000 0.728 ± 0.018 0.562 ± 0.002 0.688 ± 0.005 0.604 ± 0.017

KC3 0.555 ± 0.000 0.633 ± 0.033 0.502 ± 0.017 0.577 ± 0.020 0.582 ± 0.045

MC1 0.717 ± 0.000 0.738 ± 0.025 0.500 ± 0.000 0.648 ± 0.016 0.504 ± 0.017

MC2 0.762 ± 0.000 0.655 ± 0.012 0.635 ± 0.036 0.722 ± 0.002 0.630 ± 0.029

MW1 0.786 ± 0.000 0.822 ± 0.043 0.646 ± 0.003 0.761 ± 0.011 0.751 ± 0.073

PC1 0.799 ± 0.000 0.783 ± 0.032 0.529 ± 0.012 0.715 ± 0.045 0.620 ± 0.040

PC2 0.550 ± 0.000 0.733 ± 0.065 0.500 ± 0.000 0.609 ± 0.142 0.490 ± 0.004

PC3 0.707 ± 0.000 0.821 ± 0.024 0.505 ± 0.002 0.594 ± 0.039 0.536 ± 0.043

PC4 0.897 ± 0.000 0.938 ± 0.017 0.614 ± 0.013 0.689 ± 0.049 0.830 ± 0.023

PC5 0.713 ± 0.000 0.773 ± 0.014 0.621 ± 0.006 0.713 ± 0.021 0.675 ± 0.017

Except for NB, all other models have tuning parameters, also known as hyper-parameters.

These hyper-parameters enable a model to be adapted to a particular problem. Since each data

set represents a different problem, each of the models with hyper-parameters has to be tuned for

each particular data set to acquire a characteristic assessment of that classifiers performance. To

this end, a grid-search technique is used for classifier selection step: for each tuning parameter,

a set of values are selected as candidates. Each model is empirically evaluated over all possible

combinations of its hyper-parameters using 10-fold cross validation on its training data i.e. to split

the data into 10 partitions, and perform 10 iterations where each partition serves as the test set

once, and the other 9 partitions are used as a training set. This is a common scheme in statistical

learning. Since AUC is used for classifier comparison, it is also used to guide in the selection of the

72

tuning parameters during the search. Thus, the model that achieves the maximal performance with

a particular combination of hyper-parameters is used on the test data and the results are recorded

and reported.

As mentioned above, NB does not require any parameter setting. The parameter settings for

Rnd For have been taken from (Lessmann et al., 2008), where settings for two parameters were

considered. For number of trees, the values: [10, 50, 100, 250, 500, 1000] were assessed and for ran-

domly selected attributes per tree the values [0.5, 1, 2].√M were assessed, where M is the number

of attributes of the data set. For CHIRP, default value for numV oters = 7 in WEKA and is the

same as the one mentioned in (Wilkinson et al., 2011) for the corresponding parameter m. How-

ever, we tuned the models over values: [7, 10, 15] to capture performance improvements. Similarly,

for DTNB and FURIA, parameter tuning was performed even though for both, a default parameter

setting was mentioned. DTNB has one tuning parameter, crossV al. The default is leave-one-

out-cross-validation (crossV al = 1). However, we use the following values: [1, 3, 5, 10, 15]. For

FURIA the parameters and the respective candidate values used are: folds = [3, 5, 10, 15, 20],

minNo = [2, 10, 15] and optimizations = [2, 5, 10, 15]. All algorithms were implemented in

version 3.7.5 of WEKA.

3.4 Results

In this section we present and analyze the results obtained from the experiment described earlier.

Table 3.1 shows the performances obtained in terms of AUC of each of the classifiers on each of

the data sets. The last column displays the mean rank ARi of each classifier over all data sets. The

mean rank is used in conducting the Friedman test. For each data set, the best AUC value obtained

over all classifiers is highlighted in boldface.

To get a better understanding of the overall performance of the models, the mean and standard

deviation of their performances in terms of AUC are reported, across all values of hyper-parameter

combinations, in Table 3.2. By comparing the two tables, we can see if there are any outliers

in Table 3.1 and we can also observe the amount of performance variations for each model over

the selected hyper-parameter settings. However, for models with no hyper-parameters (like NB),

73

this check on performance reporting cannot be applied. But in our case, we have a better idea

of the performance of NB on similar data sets from previous studies. Based on those studies,

we can safely assume that the values obtained here are not outliers. Another problem with this

approach is that, since models with a larger set of candidates over all hyper-parameters intuitively

give more accurate results for mean and standard deviation, we cannot get objective results unless

the number of overall candidates is set as constant for all models. The latter may not be desirable as

different models require different number of candidate values for each hyper-parameter to capture

performance gain. A better way is to perform the experiment more than once using different

randomized sets of the original data sets for each run. Take the mean of performances over all runs

and report the best mean and its standard deviation for each model over each data set. But this is

more time consuming and not ideal with resource constraints.

From Table 3.1 it is evident that classifiers mostly achieve AUC results above 0.5. Only one

AUC value falls far below 0.5 and that is of DTNB for data set CM1. But on average it has values

above 0.6. Apart from this, only a couple of AUC values of FURIA and CHIRP are 0.5 or below.

PC4 shows the overall best performance of models for a data set (with an average AUC of 0.82

over all models). If we consider sizes of te data sets, it becomes evident that size does not appear to

impact the accuracy of the models directly. This can be observed by comparing the performances

between of the large data sets, such as JM1 (with number of modules = 7782) and smaller data sets

such as CM1 (with number of modules = 327) and MC2 (with number of modules = 125). It can

be seen that there is no pattern that suggests that models perform better or worse on large or small

data sets. Same is the case with number of attributes. In terms of efficiency, NB performed quite

well, since it did not require any parameter adjustment. With higher values of tuning parameters,

Rnd For, CHIRP and DTNB were not as efficient. CHIRP’s parameter tuning did not seem to

produce a significant difference in performance in terms of AUC. This is also evident from Table

3.2 where the standard deviation of AUC for CHIRP over most data sets does not exceed 0.01. In

fact, it is the only model that has standard deviation for some data sets equal to 0. Another point to

be noted for CHIRP is that, it was mentioned in (Wilkinson et al., 2011) that increasing the value of

m (numV oters), will increase the accuracy of the model. But that is not the case. Initial checking

74

Fig. 3.2: Nemenyi’s Critical Difference Diagram for evaluation using AUC

with m = 7 and m = 20 was done on randomized subsets of data sets CM1 and PC5. For CM1,

the claim holds true with AUC = 0.52 and AUC = 0.55 with m = 7 and m = 20, respectively.

Also, since it is the same data set and same model with only one variable (the tuning parameter)

changing, percentage of correctly identified instances is a good measure of checking accuracy.

With m = 7 and m = 20 this percentage is 84.72 and 85.19, confirming our previous result. But,

with PC5 we have for m = 7 and m = 20, AUC = 0.52 and AUC = 0.55 and percentages 76.88

and 76.79, respectively. This shows that with a greater value of m, performance actually decreased

(even if the difference may not be significant).


To compare the performances of each classifier and rank them in order of overall performance, it is

important to check whether there is a significant statistical difference among their AUC values. The

test statistics are calculated using Table 3.1. The Friedman test tests whether there is a difference

in the performance among the 5 classification models over the 12 data sets. For this experiment,

L−1 = 5−1 = 4 and (L−1)(K−1) = (5−1)(12−1) = 44 degrees of freedom. With α = 0.05

the critical value of F-Distribution is 2.58. For the more conservative Chi Square distribution, the

75

critical value at α = 0.05 is 9.49. The test statistics are calculated, using (3.3) and (3.5), as follows:

χ2F = 12×12

5(5+1)[2.212 + 1.422 + 4.52 + 32 + 3.882 − 5(5+1)2

4] = 29.78,

FF = (12−1)29.7812(5−1)−29.78

= 17.98.

Since 17.98 > 2.58 and even for the more conservative test we have 29.78 > 9.49, the null

hypothesis (H0) which states that the performances of these 5 models over the 12 data sets is

statistically equal, is rejected. Consequently, we proceed to Nemenyi’s post hoc test with α = 0.05

i.e. performing pair-wise comparisons for every two models and checking whether the difference

in their performances surpasses the critical difference. The critical value qa,∞,L is 2.73 (Demsar,

2006). The critical difference is calculated, using (3.6), as:

CD = 2.73√

5(5+1)6×12

= 1.76.

The results of the pair-wise comparison are depicted in Figure 3.2, utilizing a revised version

of significance diagrams introduced in (Demsar, 2006). All classifiers are plotted on the y-axis

and their corresponding average rank is plotted on the x-axis along with a line whose length is

equal to the critical difference CD. The right end of the line shows which models, whose mean

rank lies further towards the right, are significantly outperformed by the particular classifier. Thus,

all classifiers that do not overlap in this plot perform significantly different. Considering this, we

can see that even though Rnd For performs better than all other algorithms, its performance is

not significantly different from NB and DTNB. CHIRP performs the worst and its performance is

significantly different from all other models except for FURIA and DTNB. Based on these results,

we do not recommend CHIRP for defect prediction. FURIA’s performance was significantly less

from that of Rnd For. But it statistically performs as well as NB. So this model may perform better

than other models used in defect prediction studies, like CART, RBF net and K-Star. However,

DTNB gives respectable results and based on the critical difference, performs as well as NB and

Rnd For. Furthermore, the results observed in this empirical evaluation confirm conclusions of

previous works regarding the accuracy of Rnd For and NB for defect prediction using static code

features.

Comparing Table 3.1 and Table 3.2, we can see that the lowest value of Table 3.1 (DTNB’s

performance for CM1) is an outlier since this AUC value is far lower than a couple of standard

76

deviations from the mean. However, even in Table 3.2, this model does not perform better than

any other model for CM1. In fact, the ranking for Table 3.2 is the same as that for Table 3.2 which

means that similar results from test statistics will be obtained when they are applied on Table 3.2.

This observation further strengthens confidence in the results and findings of this empirical study.

It should be noted that Nemenyi’s statistics tests the null hypothesis that performance of two

models does not differ (same has been reported earlier (Lessmann et al., 2008)). But if H0 is not

rejected, it does not guarantee that the null hypothesis is true. For example, Nemenyi’s test is

unable to reject H0 for Rnd For and DTNB i.e. the hypothesis that they have the same mean ranks.

So the difference in their performances may only be due to chance. But this difference can also

occur because of Type II error i.e. that there is a significant difference between the two models but

the test is unable to detect it with α = 0.05. This implies that rejecting the null hypothesis only

means that there is a high probability that two models differ significantly (where this probability

= 1− α).

Considering this, an overall conclusion that can be reached from this experiment is that only

predictive performance is not sufficient to assess the worth of a classifier and has to be supple-

mented by other standards. This argument is also supported Mende et al. (Mende and Koschke,

2009) where tests show that the average performance in terms of AUC of a trivial model turned

out to be the same as that of NB. So other criteria like computational efficiency and transparency

should also be considered.

So far we have evaluated the accuracy of 5 different classifiers based on their average AUC

values over 12 data sets that were cleaned versions of data sets from the PROMISE repository

(Menzies et al., 2012). The significance of the results is measured using statistical tests. Two of

the classifiers, NB and Rnd For are used as base models since their performances stands out in a

number of other defect prediction studies. The other three are statistical models proposed during

the past five years: DTNB (2008), FURIA (2009) and CHIRP (2011).

We evaluated whether these new models are useful for defect prediction studies by comparing

their performances with those of the baseline models. It turns out that only one model, DTNB,

gives a reasonable performance. CHIRP, due to its low accuracy, is not recommended for use

77

in defect prediction studies and the performance of FURIA is also questionable. We have also

provided a comprehensive detail of the characteristics of the data sets used.

Defect prediction from static code metrics is a fairly recent and active area of research and quite

a lot of improvements can be made in classifying faulty modules based on static code metrics. One

of these includes the comparison of performance between different classifiers for predicting de-

fects in software systems. This constitutes one of the least studied areas in empirical software

engineering (Mende and Koschke, 2009). Studies like these will help in improving the understand-

ing of existing models and will also help incorporate new and novel classification models for this

problem domain.

3.5.1 Threats to Validity

This section focuses on some threats to validity of the comparative study presented in this Chapter.

We have already given problems related to data set characteristics in an earlier section. Seemingly

another problem regarding data sets is that this study covers a limited number of data sets from only

one source. However, there is a debate between how representative these data sets are (Lessmann

et al., 2008, Jiang et al., 2008a, Mende and Koschke, 2009, Menzies et al., 2007). Lessman et al.

(Lessmann et al., 2008) favor the use of these data sets and refer several other studies that have

argued in favor of the selected data sets in terms of their generic characteristics and suitability for

software defect prediction. On the other hand, Mende et al. (Mende and Koschke, 2009) take

the opposing stance that these data sets may not be representative and so the results of their study

cannot necessarily be generalized.

These data sets might require pre-processing (Gray et al., 2011), depending on the models used

and also depending on problems related to or some particular characteristics of the data sets. These

problems have not been empirically investigated for the new models in this study. As mentioned

in (Lessmann et al., 2008), classification is only one of the steps within a multistage data mining

process. Other steps can increase performances of certain classifiers. Keeping this in mind, pre-

processing of data sets for CHIRP might have improved the accuracy of the results for this model

(Wilkinson et al., 2011). This applies to other models as well, including the models NB and Rnd

78

For.

Further and extensive parameter tweaking for each of the new models may produce better

results. However, a set of representative candidate values for each hyper-parameter need to be

identified first. This may require exhaustive testing of each model on data sets similar to the ones

used for defect prediction. Tuning of parameters affects both accuracy and efficiency of these

models.

Another possible problem is related to the sampling procedure which might bias results (Less-

mann et al., 2008, Menzies et al., 2007). However, in response to this problem, Lessman et al

(Lessmann et al., 2008) state that the split-sample setup is an acknowledged and common ap-

proach for classification experiments. Also, the size of the chosen data sets appears sufficiently

large to substantiate this setting.

3.6 Summary

This chapter describes the process followed for the comparative study of existing models and

the presents the results of the study. The datasets used are from the Promise repository and are

described in Chapter 2. From those datasets 8 have been used to compare performance of defect

prediction models because a similar comparative study has used those 8 datasets and to compare

the performance of new models with the older ones it was required to be the same datasets.

The five models compared on the eight datasets are Naive Bayes Classifier (NB), Random

Forest (RndFor), CHIRP, DTNB, and FURIA. Based on the statistical ranking of the 5 models, NB

and RndFor appear as winners. This study further selects one of them to test the preprocessing

approach proposed in Chapter 4. Since the data being used has quality issues like duplicate data

points, the tree based RndFor has been considered prone to over-fitting. Secondly, the proposed

preprocessing introduces missing values in data. The criteria used by NB to handle the missing

values has been simpler as compared to the criteria used by RndFor. Therefore, this study selects

NB for further steps.

79

4. INCREASING RECALL IN SOFTWARE DEFECT PREDICTION

Publicly available defect prediction data is dominated by Not Defect-prone (ND) modules. In

such scenarios identifying the software metrics values that associate with scarcely available Defect-

prone (D) modules in a challenging task. Non availability of sufficient number of D modules (as

training examples) is one of the reasons that models do not achieve high Recall. This chapter

presents an association mining based approach that allows the models to learn defect prone mod-

ules in imbalanced defect datasets. As shown in Figure 4.1 the datasets are Preprocessed using

the proposed approach, a defect prediction model is developed using the preprocessed data and

Performance Analysis of the model is performed in terms of Recall as Performance Measure. The

Preprocessing step partitions data and finds important itemsets such that the prediction of D mod-

ules can be improved. Afterwards, the preprocessed datasets are used for Model Development.

As discussed in the previous chapter, Naive Bayes (NB) classifier has been used as a test case to

evaluate the proposed preprocessing. Performance of the NB classifier is analyzed as a next step.

Significance of Recall for Performance Analysis is highlighted through a questionnaire distributed

in software industry. The results show that the algorithm has either improved Recall of the NBPreprocessing ModelDevelopment PerformanceAnalysisPerformanceMeasureFig. 4.1: Major Steps of Our Methodology

Eq ui~Freq ue ncyBinning Ho rizo nta l Pa r~t it io ning,DS = Pt U Pf Ge ne rat io n ofFreq ue ntIte msetsSe lect io n ofFoc used a nd I nd if~fe re nt Ite msetsCompatibleDatasetDatasetSe lect io nIte mset S up~po rt Ca lc ulat io n Ra nges hav ingst ro ng assoc iat io nw it h defectsRa nges hav ingwea k assoc iat io nw it h defects

Datasets Set as ‘ Miss ing’Va lue in Pa rt it io nPfU n~ Pa rt it io n DataFig. 4.2: Preprocessig Step

classifier (upto 40%) or left it unchanged.

4.1 Proposed Preprocessing

In the preprocessing step we modify the dataset before development of NB based classifier. Ac-

tivities in the preprocessing step are shown in Figure 4.2. We have used the datasets discussed

in Chapter 2and discretized the inputs through Equi-Frequency Binning. After the discretization

step, we have partitioned (Wang et al., 2005) the data into two; one partition Pt includes defect

prone instances whereas the other partition Pf consists of not-defect prone instances. Afterwards,

frequent itemsets are generated in each partition using Apriori algorithm (Jiawei and Micheline,

2002) and support of each itemset is calculated in respective partition to gauge its usefulness and

strength. Support is defined as number of instances containing A and B divided by total number

of instances. We generate those itemsets only that show association between software metrics and

D modules, i.e. the itemsets that can be used to further generate Class Association Rules (CARs)

(Liu et al., 1998). CARs are special type of association rules with class attribute as consequent. In

the next step, the important itemsets (called focused itemsets) are selected from partition Pt and

their values are set as missing in the partition Pf . Both partitions of the data are combined before

defect prediction model is developed.

81

Algorithm 1 Focused Itemsets based Approach to Preprocess Data

Require: Dataset, n, αD, αND, β, τ1,τ2

Ensure: Datasetm

1: Data = discretize(Dataset, n)

2: [dataDdataND] = partitionClassWise(Data)

3: for each class c ∈ {D,ND} do

4: FrequentItemsetsc = generateFIusingApriori(datac, αc)

5: SortedItemsetsc = sort(FrequentItemsetsc, DescSupport)

6: end for

7: for each itemset i ∈ {SortedItemsetsD ∩ SortedItemsetsND} do

8: if Supporti > τ1 in SortedItemsetsD AND Supporti > τ1 in SortedItemsetsND then

9: if |Supporti in SortedItemsetsND - Supporti in SortedItemsetsD| < τ2 then

10: Mark the corresponding attribute as Indifferent

11: else

12: Mark the corresponding itemset as Focused

13: end if

14: end if

15: end for

16: for each itemset i ∈ {SortedItemsetsD do

17: if Supporti > β then

18: dataModifiedND = setMissing(dataND, i)

19: end if

20: end for

21: Return Datasetm = [dataD dataModifiedND]

82

The proposed algorithm identifies associations between software metrics and Defect prone (D)

modules. The metric values strongly associated with D modules (called Focused Itemsets) have

been identified using the Algorithm 1. Algorithm 1 presented here takes 7 inputs and returns the

preprocessed data. The 7 inputs are: the Dataset to be preprocessed; the number n of bins each

attribute will be divided into; αD and αND as minimum support values to generate frequent itemsets

in partition D and ND respectively; threshold value β to decide if a focused itemset should be set

missing; thresholds τ1 and τ2 that help in marking the attributes as indifferent attributes, i.e. the

attributes that do not significantly help in classification of D modules. The algorithm returns a

modified version of the input data. In the modified version the focused itemsets appearing with

ND modules are replaced with missing values. This is done to prevent the prediction model to

learn ND modules associated with metric values corresponding to the focused itemsets. The rest

of the section describes the preprocessing step in detail.

4.1.1 Discretization

The proposed approach needs to find association between defect-prone modules and certain ranges

of software metrics’ values. This association can be found if the attributes are of categorical nature.

However, most of the metrics in the datasets used are quantitative in nature. Therefore, the data

needs to be discretized. We discretize the data by dividing each quantitative attribute into equi-

frequency bins (Jiawei and Micheline, 2002). Essentially, this step divides each attribute into

intervals of metric values.

4.1.2 Horizontal Partitioning

The partitioning step horizontally divides a dataset into two parts, Pt and Pf . Partition Pt has

instances of D modules and the partition Df contains instances of modules that are ND. During

this step, each dataset DS is divided into two subsets such that |DS| = |Pt|+ |Pf | , DS = Pt∪Pf

, and Pt ∩ Pf = φ

83

4.1.3 Generating Frequent Itemsets

We have applied Apriori algorithm (Jiawei and Micheline, 2002) on each partition to find frequent

itemsets. An itemset is frequent if it is occurring many times in a partition and satisfies a minimum

Support threshold. As mentioned earlier, an itemset in our case is an interval, so this step finds

out those intervals of all metrics that co-occur frequently either with D modules in partition Pt or

with ND modules in partition Pf .

Our approach focuses on special itemsets, which do not include the class attribute. Normally,

Apriori algorithm does not distinguish between class and non-class attributes and uses the dis-

cretized values of a class attribute to generate frequent itemsets. A class attribute is included in a

frequent itemset if it is associated with another item in the data. Our approach requires the Apriori

algorithm to generate only those itemsets which are individually or collectively associated with

class attribute (i.e. Defects).

Support count Counti of an itemset iseti is the frequency of its occurrence in a partition. If

m is the total number of independent attributes and n is the number of intervals for each attribute,

Support of iseti in a partition Pj is calculated as follows:

Supporti =Counti|Pj|

× 100 (4.1)

where i ≤ m × n and j ∈ {t, f}. By convention, value of Supporti varies from 0% to 100%

instead of 0 to 1.0 (Jiawei and Micheline, 2002). It is pertinent to mention that in each partition,

the itemsets can have different support values as compared to their support in the whole dataset.

Also the itemsets can have different lengths. An Itemset with one item is called 1-Itemset and an

Itemset with k items is known as k-Itemset. A 1-Itemset is essentially one interval from the range

of values an attribute can have (Jiawei and Micheline, 2002).

4.1.4 Selecting Focused and Indifferent Itemsets

In order to find critical ranges for each attribute, we need to identify itemsets in partition Dt that

satisfy a minimum support threshold αt. At the same time a similar threshold αf of minimum

support is applied on itemsets in partition Df . The itemsets (or intervals) with Support ≥ αt

84

and Support ≥ αf are itemsets of interest and are called Itemsett and Itemsetf respectively.

Focused itemsets appear in Itemsett only and do not appear in Itemsetf whereas indifferent

itemsets appear in both:

Focused = Itemsett − Indifferent (4.2)

Indifferent = Itemsett ∩ Itemsetf (4.3)

Indifferent itemsets are the itemsets that appear in both partitions Pt and Pf and satisfy αt

and αf thresholds in the respective partition. These itemsets do not affect the classification of D

modules and can be ignored when developing a classification model with high Recall. Attributes

with indifferent bins indicate that they do not facilitate classification and can be dropped before

developing a defect classification model. Itemsets that appear in the partition Pt, satisfy αt and

are not indifferent itemsets are called focused itemsets. The attributes with focused itemsets are

good indicators of defects. These ranges of values further need to be studied in order to get a better

understanding of causes of defects.

Identification of focused and indifferent itemsets need to be validated. For each dataset, we

compare the attributes with focused itemsets with a list of all attributes ranked with respect to their

Information Gain (IG) (Jiawei and Micheline, 2002). The attributes with focused itemsets should

be ranked higher whereas the attributes with indifferent itemsets should be ranked lower by the IG

based ranking approach.

4.1.5 Modifying Dataset

After marking the itemsets as focused and indifferent, the algorithm modifies the data such that

classification of D modules can improve. The algorithm modifies the instances in partition Pt and

affects the ability of model to learn ND modules. All the focused itemsets that co-occur with

ND modules and have support greater than the threshold β (provided by the user) are set to be

missing values. This will prevent the prediction model to learn ND modules from the instances

in Pt. Same itemsets appearing in partition Pf remain unchanged. The modified partition and the

unchanged partition are combined together to form the complete dataset. The resultant modified

dataset is returned by the algorithm.

85

4.1.6 Time Complexity Analysis

Asymptotic time analysis of Algorithm 1 has been done to see how much time will the suggested

approach take when applied on large datasets. If a dataset has n instances, then each of the lines

1–3 will execute once, and each method call will have complexity of O(n). The loop at line 4 will

iterate 2 times for a binary class problem (in our case) and equal to the number of classes in a multi

class problem. Number of classes is a constant, say nc, therefore the loop will iterate a constant

number of times. Time complexity of line 5 is a function of number of instances n, number of

attributes in dataset na, and number of items nI . In our case nI = na×nb, where nb is the number

of bins for each attribute. But the time complexity of line 5 is bounded by O(n× nI) for itemsets

of length 1 (also known as 1-itemsets). For k-itemsets the running time becomes O(n× nkI ). The

method call at line 5 takes O(n2I). The loop at line 8 iterates nI times and each statement in the

loop takes constant time. So time complexity for this loop is O(nI). Time complexity for the next

loop (line 17) is also O(nI). Hence the time complexity of the whole algorithm is as follows when

generating k-itemsets:

T ime Complexity = O(n) +O(n× nkI ) +O(n2

I) +O(nI) +O(nI)

= O(n× nkI )

(4.4)

When k = 1 (i.e. 1-itemsets) the time complexity is O(n × nI). Presently we are focusing

on 1-itemsets to improve performance of NB classifier. It is pertinent to mention that the defect

prediction activity is not frequently performed. It is usually performed infrequently prior to coding

and testing phases in a project.

4.2 Developing Defect Prediction Model

Once the focused itemsets in partition Pt have been set as missing, the data is used to develop the

prediction model. The model is developed with different levels of preprocessing; for example, by

setting only one itemset as missing, then by setting another itemset as missing and so on. There is

a range of prediction models reported in literature, Naive Bayes being among the best (Lessmann

86

et al., 2008, Menzies et al., 2010). In second phase of our process we develop Naive Bayes and

Decision Tree based models using the modified data with 10 bins for each attribute and evaluate

the models using 10-fold cross validation.

Naive Bayes (NB) classifier is a simple probabilistic classifier based on Bayes’ Theorem (Ji-

awei and Micheline, 2002):

P (Cj|X) =P (X|Cj)P (Cj)

P (X), Cj ∈ {D,ND} (4.5)

The NB classifier assigns a class label Cj to an unknown sample X if and only if P (Cj|X) >

P (Ck|X) for j 6= k. This shows that the classifier works on the principle of maximizing P (Cj|X).

Using the Bayes Theorem in Equation 3.1 only P (X|Cj)P (Cj) is maximized because P (X) re-

mains constant for both the classes. Generally, the class probabilities are assumed to be equally

likely by this classifier and prior probabilities of the classes are estimated as P (Cj) =sjs

where

sj is the number of training samples belonging to class j and s is the total number of samples.

The classifier follows a smart mechanism to deal with missing values. When in training phase the

classifier does not consider an instance having attribute(s) with missing value(s) for calculations

of class prior probabilities. When performing classification, the attribute(s) with missing value(s)

is/are dropped from calculations. With our proposed preprocessing approach of introducing miss-

ing values this elegant treatment of missing values makes NB a good candidate for selection.

4.3 Performance Analysis

The preprocessing should not deteriorate the performance of the model developed without prepro-

cessing. Performance of the developed models is evaluated in terms of Recall. We expect that

effort to increase Recall will result in increased FPRate as well. This requires the proposed ap-

proach to take into account the possibility of high FPRate and its acceptance in software industry

(users of the prediction model). To this end we have performed 2-fold evaluation of the proposed

approach. First we have collected responses from software industry against the questions listed in

Table 4.1. Responses to these questions will indicate if the industry accepts the models with high

FPRate with the benefit of increased Recall. If our approach is to be useful, the industry should

87

accept high Recall at the cost of high FPRate. Secondly we have empirically tested the stability

of the approach by using 5 public datasets. We have verified that the Recall of the developed

model does not deteriorate the performance of the model when no preprocessing has been done on

data. Model is developed with different number of bins for this process.

Tab. 4.1: Questions asked from the Software Industry

Q. No. Statement of Question Rationale

Q1 If you have very serious time and bud-

get constraints for testing but there are

enough human resources, which of these

will be your acceptable result from a

given Defect Prediction Model (DPM)?

To know the important performance met-

rics for industry

a. High TPRate b. High FPRate c. High

Accuracy d. Low Accuracy e. Low

FPRate f. Low TPRate

Q2 If you have very serious time and bud-

get constraints for testing but there are

enough human resources, which of these

results will you NOT accept from a given

Defect Prediction Model (DPM)?

To see what cannot be tolerated by the

software industry

a. High TPRate b. High FPRate c. High

Accuracy d. Low Accuracy e. Low

FPRate f. Low TPRate

Q3 If you have very few human resources

for testing but there is enough time left

in project delivery, which of these results

will you accept from a given Defect Pre-

diction Model (DPM)?

To see if the software industry agrees to

test extra modules given that defect prone

modules are predicted better


88


Q. No. Statement of Question Rationale

a. High TPRate and High FPRate b. Low

TPRate but Low TPRate as well c. High

TPRate low Accuracy d. None of these

Q4 For you, the cost of misclassifying a De-

fective module as Defect-free is higher

than the cost of misclassifying a Defect-

Free module as Defective

To confirm that testing extra modules is

acceptable if defective modules are pre-

dicted better. The respondent should rate

4 or 5 (Agree or Strongly

a. Strongly Disagree b. Disagree c. Equal

Cost d. Agree e. Strongly Agree

Agree) if Recall is more important

Q5 For you, correct detection of a defective

module is more important than the correct

detection of defect-free modules

To confirm that high Recall is more im-

portant than FPRate. The respondent

should rate 4 or 5 (Agree or Strongly

Agree) if Recall is more important

a. Strongly Disagree b. Disagree c.

Equally Important d. Agree e. Strongly

Agree

Q6 What are the scenarios where cost of mis-

classifying a defect-free module is more

than the cost of misclassifying a defective

module?

To identify the scenarios where high

FPRate is not acceptable

a. Never b. When time for testing is too

short c. When budget constraint is too

sever d. Other

89

4.3.1 Identifying Performance Measure

Since increase in FPRate is expected while improving Recall, we have consulted software in-

dustry to know if performance shortfall in FPRate is acceptable for the software industry and if

Recall is important for them at the cost of increased FPRate. We have received 30 responses from

the software industry and the responses to each question are shown in Figure 4.3. The results from

the questionnaire show that majority of industry personnel prefer Recall over other performance

measures of defect prediction models. The results also show that project managers can tolerate

FPRate to certain extent as shown in Figure 4.3c.

4.3.2 Results

Results are reported in three steps. In first step results of preprocessing are presented, in second

step development of prediction models with fixed number of bins is reported, in the third step

results for Naive Bayes model have been reported for different number of bins. The third step

presents the results on selected datasets. Results reported in this section have been obtained using

Weka (Witten et al., 2008). We have performed the same set of actions on each datset and have

recorded the performance. As a first step, we have discretized the datasets and divided each at-

tribute into 10 equi-frequency bins (Jiawei and Micheline, 2002, Cios et al., 2007). Each dataset is

then partitioned into Pt and Pf as discussed in Section 4.1.2. Within each partition we have applied

Apriori algorithm with minimum Support values (αt, αf ) shown in Table 4.2 to generate frequent

itemsets. Table 4.2 also shows number of itemsets with different length. Afterwards, frequency of

occurrence, Supporti, of each of the 1-Itemsets is calculated. Table 4.3 shows top five 1-Itemsets

and their Supporti in each partition of each dataset. The itemsets in boldface are focused itemsets

whereas the itemsets in italic are indifferent itemsets. The focused itemsets present the intervals

(ranges) that are highly associated with occurrence of defects. It is pertinent to mention that there

are more focused itemsets than the itemsets shown in Table 4.3. Focused and indifferent itemsets

with length greater than 1 can be found in (Rana et al., 2013).

90

0% 10% 20% 30% 40% 50% 60% 70% 80%

High Recall

High FPRate

High Accuracy

Low Accuracy

Low FPRate

Low Recall

Low FPRate but Low Recall as well

(a) Question 1

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%

High Recall

High FPRate

High Accuracy

Low Accuracy

Low FPRate

Low Recall


(b) Question 2

0.00% 10.00% 20.00% 30.00% 40.00% 50.00%

High Recall and High FPRate


High Recall and Low Accuracy

None of These

(c) Question 3

0.00% 10.00% 20.00% 30.00% 40.00%

Strongly Disagree

Disagree

Equal Cost

Agree

Strongly Agree

Strongly Disagree

Disagree

Equally Important

Agree

Strongly Agree

Never

When time for testing is too

Other

(d) Question 4

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%

Strongly Disagree

Disagree

Equally Important

Agree

Strongly Agree

Never


When budget constraint is

Other

(e) Question 5

22.70%

59.10%

4.50% 13.60%

Never


short

When budget constraint is

too sever

Other

(f) Question 6

Fig. 4.3: Questionnaire Results Showing Industry’s Response Regarding Recall

91

Tab. 4.2: Minimum Support Thresholds and Itemset Counts for Each Dataset

Partition Pt Partition Pf

Data

set

αt 1-

Itemset

2-

Itemset

3-

Itemset

αf 1-

Itemset

2-

Itemset

3-

Itemset

cm1 15% 53 179 517 10% 81 240 429

jm1 15% 41 193 724 10% 98 298 902

kc1 15% 35 142 395 20% 15 91 312

kc2 20% 30 95 287 20% 16 101 376

kc3 15% 32 113 208 15% 18 97 317

pc1 20% 20 95 298 10% 82 83 77

pc3 20% 40 53 37 20% 23 125 475

pc4 20% 29 48 30 20% 22 124 443

pc5 40% 21 156 620 85% 16 108 399

mc1 15% 10 27 33 15% 17 88 299

mw1 15% 10 27 33 15% 17 88 299

ar1 10% 24 113 298 10% 18 91 273

ar4 25% 122 556 1369 15% 16 87 299

ar6 10% 122 556 1369 15% 16 87 299

92

Tab. 4.3: Top 5 1-Itemsets and their Supporti in each partition


Dataset 1-Itemset Supporti 1-Itemset Supporti

cm1

locCodeAndComment

= (-inf-0.5]

97.95% locCodeAndComment

= (-inf-0.5]

99.77%

ev(g)=(-inf-1.2] 63.26% ev(g)=(-inf-1.2] 76.61%

loc=(65.5-inf) 34.69% iv(g)=(-inf-1.2] 50.77%

lOComment=(34.5-

inf)

28.57% lOCode=(-inf-0.5] 46.77%

n=(400.5-inf) 26.53% lOComment=(-inf-0.5] 34.52%

jm1

locCodeAndComment

= (-inf-0.5]


= (-inf-0.5]

90.13%

ev(g)=(-inf-1.2] 55.17% ev(g)=(-inf-1.2] 71.71%

lOComment=(-inf-0.5] 53.32% lOComment=(-inf-0.5] 70.15%

loc=(90.5-inf) 25.83% iv(g)=(-inf-1.2] 38.96%

l=(0.005-0.035] 23.97% lOBlank=(-inf-0.5] 32.03%

kc1

locCodeAndComment

= (-inf-0.5]


= (-inf-0.5]

94.16%

ev(g)=(-inf-1.2] 68.09% ev(g)=(-inf-1.2] 89.90%

lOComment=(-inf-0.5] 56.44% lOComment=(-inf-0.5] 80.42%

loc=(49.5-inf) 31.59% iv(g)=(-inf-1.2] 66.85%

uniq Op=(15.5-inf) 30.67% v(g)=(-inf-1.2] 64.21%

kc2

locCodeAndComment

= (-inf-0.5]


= (-inf-0.5]

90.60%

v(g)=(-inf-1.2] 51.40% ev(g)=(-inf-1.2] 88.91%

uniq Opnd=(36-inf) 39.25% lOComment=(-inf-0.5] 72.77%

total Op=(151.5-inf) 35.14% iv(g)=’(-inf-1.2] 56.62%


93




loc=(95.5-inf) 34.58% v(g)=(-inf-1.2] 52.04%

kc3

locCodeAndComment=(-

inf-0.5]

65.11% locCodeAndComment=(-

inf-0.5]

95.9%

ev(g)=(-inf-2] 62.79% ev(g)=(-inf-2] 86.02%

essential density=(-

inf-0.5)

62.79% essential density=(-inf-

0.5]

86.02%

Decision Density =(1-

2.03]

55.81% Decision Count=(-inf-

1]

60.48%

HALSTEAD

LEVEL=(0.065-0.075]

16.27% Decision Density

=(-inf-1)

55.9%

pc1

ev(g)=(-inf-1.2] 62.33% locCodeAndComment

= (-inf-0.5]

76.45%

locCodeAndComment

= (-inf-0.5]

46.75% ev(g)=(-inf-1.2] 71.51%

loc=(54.5-inf) 33.76% lOComment=(-inf-0.5] 55.81%

I=(65.25-inf) 32.46% iv(G)=(-inf-1.2] 47.69%

B=(0.565-inf) 31.16% v(g)=(-inf-1.2] 30.52%

pc3

ev(g)=(-inf-1.2] 66.25% ev(g)=(-inf-1.2] 74.70%

ESSENTIAL DENSITY

= (-inf-0.025]

66.25% ESSENTIAL DENSITY

= (-inf-0.025]

74.70%

DECISION DENSITY

=(1-2.01]


= (-inf-0.5]

69.14%

PARAMETER COUNT

=(0.5-1.5]

45.00% lOComment=(-inf-0.5] 61.15%


94




uniq Opnd=(44.5-inf) 31.25% PERCENT COMMENTS

=(-inf-0.25]

52.53%

pc4

ESSENTIAL DENSITY

= (-inf-0.5]

96.07% ESSENTIAL DENSITY

= (-inf-0.5]

92.5%

DECISION DENSITY

= (1-2.5]

82.02% ev(g)=(-inf-1.2] 75.94%

ev(g)=(-inf-2] 82.02% locCodeAndComment

=(-inf-0.5]

69.45%

PARAMETER COUNT

=(-inf-0.5]

63.48% CONDITION COUNT

=(-inf-2]

59.69%

CALL PAIRS=(0.5-

1.5]

34.27% DECISION COUNT=(-

inf-1]

59.69%

pc5

PARAMETER COUNT=(-

inf-0.5]

96.12% ev(g)=(-inf-2] 97.17%

n=(44.5-inf) 77.91% ESSENTIAL DENSITY=(-

inf-0.025]

97.17%

v=(201.765-inf) 77.71% iv(g)=(-inf-1.5] 95.55%

CYCLOMATIC DENSITY

=(-inf-0.245]

72.29% PARAMETER COUNT=(-

inf-0.5]

94.70%

l=(0.005-0.105] 70.93% lOComment=(-inf-0.5] 91.88%

mc1

—-=(-inf-0.5] 93.3% locCodeAndComment=(-

inf-0.5]

94.62%

loccomment =(9.5-13] 16.67% PARAMETER COUNT=(-

inf-0.5]

48.11%


95




ev(g)=(-inf-2] 70.0% ev(g)=(-inf-2] 83.33%


inf-0.5)

100% essential density=(-inf-

0.5]

94.62%

halstead level=(-inf-

0.045]


inf-1]

36.29%

mw1


inf-0.5]


inf-0.5]

94.62%


inf-0.5]

48.11%

ev(g)=(-inf-2] 70.0% ev(g)=(-inf-2] 83.33%


inf-0.5)


0.5]

94.62%


0.045]


inf-1]

36.29%

ar1


inf-0.5]

100% locCodeAndComment=(-

inf-0.5]

92.85%

blank loc=(-inf-1] 88.89% blank loc=(-inf-1] 94.64%

formal parameters=(-

inf-0.5)

66.67% formal parameter=(-

inf-0.5]

71.42%

halstead error=(-inf-

0.045]

20% multiple Condition Count=(-

inf-0.5)

50.89%

unique operand

=(8.5-10]

22.2% design Complexity=(-

inf-0.5]

46.42%


96




ar4

—-=(-inf-0.5] 93.3% locCodeAndComment=(-

inf-0.5]

94.62%


inf-0.5]

48.11%

ev(g)=(-inf-2] 70.0% ev(g)=(-inf-2] 83.33%


inf-0.5)


0.5]

94.62%


0.045]


inf-1]

36.29%

ar6


inf-0.5]


inf-0.5]

94.18%

Blank loc=(-inf-0.5] 80% Blank loc=(-inf-0.5] 91.86%

Decision

Density=(0.635-1.065)

60% Decision Density=(-

inf-0.5]

50%

unique operands=(31.5-

inf)

20% decision density=(-inf-

0.5]

50.0%

Total loc =(53.5-inf] 20% branch Count=(-inf-1] 38.37%

In order to identify the indifferent and focused 1-Itemsets we have used the thresholds given

in Table 4.4. This can be noticed that in all cases τt ≤ τf in Table 4.4. This is because of the fact

that there is less data regarding defect-prone modules hence the itemsets associated with defects

have low support. So in order to pick a reasonable number of focused itemsets, we had to keep τt

very low. Further we have also identified longer itemsets (i.e. itemsets with length > 1). Top 3

itemsets from selected datasets have been shown in Table 4.5.

As next step we have developed defect prediction models to evaluate effectiveness of the fo-

97

Tab. 4.4: τt and τf , used in this study, for each dataset

Dataset τt τf

cm1 25% 30%

jm1 20% 30%

kc1 25% 25%

kc2 20% 25%

kc3 38% 37%

pc1 25% 30%

pc3 20% 30%

pc4 20% 49%

pc5 55% 87%

mc1 15% 37%

mw1 17% 37%

ar1 20% 47%

ar4 25% 37%

ar6 34% 39%

98

cused itemsets. The models developed are J48 decision tree and Naive Bayes (NB) model. Both

the models have been developed for selected datasets. The datasets used for J48 are cm1, jm1,

pc1, kc1, and kc2. Each dataset is discretized into 10 equi frequency bins. Further the proposed

preprocessing has been applied on each dataset. After dropping the indifferent attributes the longer

itemsets identified as focused are relabelled as missing one by one. The relabeling of upto 3 item-

sets have been done and an increase in Recall has been observed as shown in Table 4.6. These

results have been partially reported in (Rana et al., 2013).

Tab. 4.5: Top 3 2-Itemsets and their Supporti in each partition

Partition Dt Partition Df


CM1

ev(g)= ’(-inf-1.2]’

locCodeAndComment=’(-

inf-0.5]’ {Indifferent}

61.22 % ev(g)= ’(-inf-1.2]’



76.61 %

loc=‘(65.5-inf)’ loc-

CodeAndComment

=(-inf-0.5]

34.69 % iv(g)= ’(-inf-1.2]’


inf-0.5]’

50.78 %

loc=‘ (65.5-inf)’

lOComment=‘(34.5-

inf) ’

28.57 % ev(g)= ’(-inf-1.2]’

iv(g)= ’(-inf-1.2]’

46.77 %

JM1

lOComment=’(-

inf-0.5]’



49.72 % lOComment=’(-

inf-0.5]’



66.76 %

lOBlank=’(-inf-0.5]’

locCodeAndCom-

ment =’(-inf-0.5]’

21.56 % iv(g)=’(-inf-1.2]’


inf-0.5]’

37.65 %


99


Partition Dt Partition Df


e=’(48232.24-inf)’

t=’(2679.57-inf)’

21.27 % ev(g)=’(-inf-1.2]’

iv(g)=’(-inf-1.2]’

35.39 %

KC1

ev(g)=’(-inf-1.2]’



65.34 % ev(g)=’(-inf-1.2]’



86.26 %

e=’(14140.38-inf)’

t=’(775.605-inf)’

28.83 % ev(g)=’(-inf-1.2]’

iv(g)=’(-inf-1.2]’

66.74 %

n=’(147.5-inf)’

v=’(795.61-inf)’

28.53 % iv(g)=’(-inf-1.2]’


inf-0.5]’

65.68 %

KC2

ev(g)=’(-inf-1.2]’

lOCodeAndComment=’(-


44.86 % ev(g)=’(-inf-1.2]’



83.13 %

v=’(1403.34-inf)’

uniq Opnd=’(36-inf)’

34.58 % ev(g)=’(-inf-1.2]’

lOComment=’(-inf-

0.5]’

70.84 %

uniq Opnd=’(36-inf)’

total Op=’(151.5-inf)’

34.58 % lOComment=’(-

inf-0.5]’


inf-0.5]’

69.40 %

At the third step Naive Bayes classifier has been developed and its performance on different

number of bins has been observed. Performance of the NB model on up to 20 bins has been

observed and trend has been studied. Details of the performances of NB for the 5 datasets, i.e. 1

dataset from each of the groups mentioned in Section 4.1 has been presented in Table 4.7. The

100

Tab. 4.6: Performance of Decision Tree Model in terms of Recall

KC1 KC2 JM1 CM1

No Pre Processing 0.34 0.505 0.13 0.22

Attributes Dropped 0.507 0.527 0.155 0.154

1-Itemsets Relabelled 0.553 0.587 0.258 0.449



table presents different performance parameters including Recall. A trend of Recall has been

plotted. Performance of the model does not deteriorate for up to 20 bins as shown in Figure 4.4

when compared with no preprocessing. It is expected that the model will show similar results for

the rest of the datasets from each group.

Tab. 4.7: Performance of NB classifier on different number of bins with and without proposed approach

PM1 PAA2 4 6 8 10 12 14 16 18 20

AR4

RecallNo 0.7 0.7 0.55 0.6 0.65 0.55 0.7 0.55 0.7

Yes 0.75 0.8 0.70 0.8 0.75 0.75 0.75 0.70 0.75

FP RateNo 0.22 0.15 0.16 0.17 0.16 0.18 0.16 0.18 0.17

Yes 0.22 0.15 0.17 0.17 0.17 0.18 0.17 0.18 0.19

CM1

RecallNo 0.65 0.63 0.63 0.63 0.57 0.59 0.59 0.57 0.51

Yes 0.81 0.75 0.71 0.69 0.73 0.73 0.71 0.69 0.71

FP RateNo 0.34 0.34 0.33 0.32 0.31 0.31 0.29 0.29 0.28

Yes 0.34 0.34 0.34 0.31 0.31 0.31 0.29 0.29 0.28

KC3

RecallNo 0.86 0.84 0.84 0.84 0.84 0.81 0.84 0.81 0.79

Yes 0.88 0.88 0.88 0.86 0.86 0.86 0.86 0.86 0.86

FP RateNo 0.34 0.32 0.32 0.32 0.31 0.3 0.3 0.29 0.28


101


PM1 PAA2 4 6 8 10 12 14 16 18 20

Yes 0.35 0.32 0.32 0.33 0.32 0.31 0.31 0.3 0.3

MC1

RecallNo 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82

Yes 0.84 0.87 0.88 0.87 0.84 0.87 0.87 0.87 0.95

FP RateNo 0.18 0.16 0.16 0.15 0.14 0.14 0.15 0.14 0.14

Yes 0.17 0.15 0.15 0.15 0.14 0.14 0.15 0.14 0.14

PC3

RecallNo 0.73 0.74 0.74 0.73 0.74 0.73 0.73 0.73 0.74

Yes 0.84 0.84 0.84 0.84 0.89 0.89 0.88 0.89 0.89

FP RateNo 0.3 0.3 0.3 0.3 0.29 0.29 0.29 0.29 0.28

Yes 0.3 0.29 0.29 0.29 0.28 0.28 0.28 0.28 0.27

1PM = Performance Measure. 2PAA = Proposed Approach Applied. Table 4.7 ends.


For the datasets used in this study, very high ranges of the focused itemsets mainly contribute

towards defects. In most of cases the intervals bounded with infinity, inf , are found to be highly

associated with defect. For dataset jm1, the itemset l = (0.005− 0.035] is an exception. But this

itemset has a very low Supporti value and it appears only when we keep αt < 23%.

The intervals with high values are the critical ranges and are very important for software man-

agers and researchers. If, during a software project, values of the mentioned software metrics fall

in critical ranges, this should raise an alarm and the project schedules, resource plans, and testing

plans should be adjusted accordingly. Further a defect prediction model can use the critical ranges

for improved classification defect-prone modules.

The focused itemsets do not only identify the critical ranges for each software metric, they

also indicate the software metrics that can improve the detection of defect-prone modules. Some

attributes have atleast one focused itemset in all dataset, hence showing strength of that attribute

in prediction of defects. Across five datasets, the majority vote reveals that loc, n, v, d, i, e, b, t,

lOCode, uniq Op, uniq Opnd, total Op and total Opnd consistently contribute in occurrence of

102

No. of bins00.10.20.30.40.50.60.70.80.91 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Recall wit hout preprocessingMin Recall wit ht he proposedpreprocessingMax Recall wit ht he proposedpreprocessingR ecall(a) CM1

00.10.20.30.40.50.60.70.80.911 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 14 15 1 6 17 18 1 9 2 0

Recall wit hout preprocessingMin Recall wit ht he proposedpreprocessingMax Recall wit ht he proposedpreprocessingR ecall(b) PC3

No. of bins0.50.550.60.650.70.750.80.850.90.9511 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Recall without preprocessingMin Recall withthe proposedpreprocessingMax Recall wit ht he proposedpreprocessingR ecall(c) MC1

Fig. 4.4: Trend of Recall across five datasets (Continued on next page)

103

No. of bins00.10.20.30.40.50.60.70.80.91

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Recall without preprocessingMin Recall withthe proposedpreprocessingMax Recall withthe proposedpreprocessingR ecall

(d) KC3

00 . 10 .20 .30 .40 .50 .60 .70 .80 .9

1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0Recall without preprocessingMin Recall with the proposedpreprocessingMax Recall with theproposed preprocessingR ecall

(e) AR4

Fig. 4.4: (Continued from previous page) Trend of Recall across five datasets

104

051015202530354045

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20No. of binsP ercent agech angei nR ecall

(a) CM1

0510152025


(b) PC3

024681012141618


(c) MC1

Fig. 4.5: Percentage Change in Recall across 5 datasets with and without the proposed preprocessing (Con-

tinued on next page)

105

012345678910


(d) KC3

0510152025303540

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15No. of binsP ercent agech angeinR ecall

(e) AR4

Fig. 4.5: (Continued from previous page) Percentage Change in Recall across 5 datasets with and without

the proposed preprocessing

106

defects. Whereas locCodeAndComment, ev(g) and lOComment do not necessarily cause defects.

In addition to identify the focused itemsets, the experiments have also identified some ranges

that are neutral to occurrence or presence of defects. lOCodeAndComment = (−inf − 0.5],

ev(g) = (−inf−1.2], and lOComment = (−inf−0.5] are few examples. lOCodeAndComment =

(−inf − 0.5] appears in all datasets without exception. Supporti for this itemset is very high for

both partitions with minimum Supporti = 46.75% for partition Dt and Supporti = 76.45%

for partition Df of dataset pc1. lOCodeAndComment is a count of all lines of code including

comments. The interval represented by this itemset shows very low values of this metric. This

indicates that when a software is small in size, measured in terms of lines of code and comments,

this metric does not contribute in occurrence or absence of defects. Low values of ev(g) (i.e.

ev(g) = (−inf − 1.2]) also appears as an indifferent itemset in four datasets.

The trends of Recall across the five datasets are encouraging. Figure 4.5 shows that at most

of the times the change in recall has been positive. The bars in the figure indicate the percentage

increase or percentage decrease in the recall when the proposed preprocessing has been used with

varying bin sizes. This improvement in Recall is achieved at cost of high False Positive Rate

(FPRate). Figure 4.6 shows the percentage change in FPRate. The percentage increase in false

positive rate has remained below 5% for 5 datasets except AR4 where the false positive rate has

been around 14%. As already discussed in Section 4.3.1, this increase in false positive rate is

acceptable and can be tolerated to have better Recall.

4.5 Summary

Software metrics have been investigated over the years for software defect prediction. We study

the relationship between software product metrics and software defects. We have selected public

datasets and discretized the data to study associations of software metrics and defects. From the

discretized data we have generated frequent itemsets and identified the 1-Itemsets strongly associ-

ated with defects. We call these 1-Itemsets the focused itemsets. The 1-Itemsets that are strongly

associated with both the presence and absence of defects are called indifferent itemsets. We have

also identified longer itemsets and their association with defect prone (D) modules. Based on the

107

�2�1 .5�1�0 .500 .511 .521 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 17 1 8 1 9 2 0P ercent ageCh ange

i nFPR at e(a) CM1

04.50403.50302.50201.50100.50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P ercent ageCh angei nFPR at e

(b) PC3

I3.5I3I2.5I2I1.5I1I0.500.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P ercent ageCh angei nFPR at e

(c) MC1

00.511.522.533.544.51 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P ercent ageCh ange

i nFPR at e(d) KC3

02468101214

161 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P ercent ageCh ange

i nFPR at e(e) AR4

Fig. 4.6: Percentage Change in FPRate across five datasets with and without the proposed preprocessing

108

indifferent attributes and focused itemsets we have proposed a preprocessing approach that has

further been used to develop J48 and Naive Bayes defect prediction models. Upto 3-itemsets have

been preprocessed for J48 model. Further, stability of the proposed approach has been studied by

developing the NB model with different number of bins. A trend of Recall has been studied for

the different number of bins.

Analysis of the focused itemsets across datsets shows that very high ranges of loc, n, v, d, i,

e, b, t, lOCode, uniq Op, uniq Opnd, total Op and total Opnd consistently contribute in causing

defects. Whereas locCodeAndComment, ev(g), iv(g) and lOComment do not necessarily cause

defects. associated with defects, this study also identifies the critical ranges of these attributes.

Performance of the J48 and NB models with 10 bins has either increased or remained unchanged

when different number of itemsets are set missing. The trend of Recall suggests that the perfor-

mance of the NB model has also either improved or remained unchanged when number of bins is

increased from 2 to 20. This relabeling of bins to missing values has increased the prediction of D

modules up to 40%.

Discussion so far has been based on static code metrics. Major issue with the code metrics

is that they are not available in early phases of lifecycle and are not effectively used for early

prediction.

109

5. EARLY PREDICTIONS USING IMPRECISE DATA

This chapter proposes the use of Fuzzy Inference System for early detection of defect prone mod-

ules. It is desired and useful to predict defect prone modules when code metrics are not yet avail-

able since this can help avoid defects later in the lifecycle. Defects caught later have high cost of

fixing as compared to the cost of fixing defects earlier in the lifecycle. In order to enable the soft-

ware engineers get the defectiveness information earlier in lifecycle phases, we propose a model

that works with imprecise inputs. The model can be used for earlier rough estimates when exact

values of software measurements are not available.

5.1 Imprecise Inputs and Defect Prediction

Use of code metrics to develop a prediction model gives good predictions (Menzies et al., 2007).

However, code is available late in the software lifcycle. Literature suggests that earlier the predic-

tion the more useful it is, for example potential defects can be avoided and better resource plans

can be generated in early stages of development. The literature also suggests that early estimates

are less accurate as compared to the estimates made later in the software lifecycle, as shown in

Figure 5.1 (Shari L. PFleeger, 2010). One reason for the imprecise estimates is the lack of details

related to software. In the earlier stages, imprecise/fuzzy information can be used to make rough

predictions which can be improved later in the lifecycle when exact details become available. To

this end we propose the use of Fuzzy Inference System (FIS) for early detection of defect prone

modules. The FIS works with imprecise inputs which are design and code metrics (defined in fuzzy

terms). The approach adopted here provides an approximate value of the conventional prediction

made at the later stages of Software Development Life Cycle (SDLC). It has been observed that

fuzzy linguistic based model reaches a comparable level of accuracy, precision and recall. Ex-

Fig. 5.1: Accuracy of estimation as project progresses (Shari L. PFleeger, 2010).

pressing software metrics in fuzzy linguistic terms is a workable solution because the management

responsible for preparing resource plans is usually experienced enough to provide the values of

software metrics in linguistic terms such as very low, low, medium, high.

This chapter presents a Fuzzy Inferencing System (FIS) using inputs represented in fuzzy lin-

guistic terms. The fuzzy linguistic inputs have been used to generate an FIS that predicts the defect

proneness of various modules of object oriented software. We validate the model using kc1-class-

level-data and jEdit bug data (Menzies et al., 2012). First fuzzy c-means clustering (Bezdek, 1981)

is applied on the input data to determine: 1) the membership functions for each input and 2) the

number of rules to be generated. A Sugeno type FIS (Kosko, 1997) is generated afterwards. The

FIS based model proposed in this study has been compared with classification trees (CT) (Mitchell,

1997), linear regression (LR) (Bishop, 2006) and neural networks (NN) (Haykin, 1994) based pre-

diction models so that a comparison of approximate prediction with the exact prediction can be

made to see the extent of performance shortfall. Upto 10% shortfall in Accuracy and 0.93% short-

fall in Recall has been observed using the proposed technique. The rest of the chapter describes

the model and presents the results obtained for public datasets.

111

5.2 Proposed Model based on Imprecise Inputs

5.2.1 Dataset

The datasets used for the study are class-level data for kc1 available at PROMSIE website (Men-

zies et al., 2012). The details about the instances and parameters of each dataset are given in

Table 5.1. A cross(×) against a metric indicates that the metric from the respective dataset has

been used. Each instance in a dataset represents a software module (or file). A classification vari-

able is used as output to indicate if the software module is defect prone (D) or not-defect prone

(ND). There are 95 attributes in the dataset kc1-class-level-data. Except the class attribute, the

rest of the 94 parameters are software metrics calculated for one module and are divided into two

groups. Group A has 10 parameters and Group B has 84 parameters. Values of the parameters in

Group A are originally measured at module level whereas the values of the parameters in Group B

are originally measured at method level and are later transformed to module level before making

the dataset available at PROMSIE website (Menzies et al., 2012). No parameter from Group B

(transformed to module level) has been selected for this study. From Group A, 8 most commonly

used parameters identified in (Menzies et al., 2003) have been selected to predict defect proneness.

5.2.2 FIS Based Model

Our FIS based model is generated through a two phase process. The first phase performs fuzzy

c-means clustering (Bezdek, 1981) to identify the membership functions for each input and the sec-

ond phase then generates a fuzzy inference system that models the behavior of the data. Two types

of fuzzy reasoning methods, i.e. Mamdani and Sugeno reasoning methods (Kosko, 1997), can

be applied in FIS implementations. Sugeno uses constant or linear functions as output functions

whereas Mamdani uses fuzzy membership functions at the output resulting in higher computational

costs (Kosko, 1997). In the present study Sugeno method has been applied as the reasoning process

simply because it is computationally efficient, and performance enhancement of this method can

further be achieved through applying other optimization and adaptive techniques.

112

Tab. 5.1: Metrics Used for this Study

Metric Abbreviation kc1-classlevel jEdit data

Coupling Between Object classes CBO × ×Depth of Inheritance Tree DIT × ×Lack of Cohesion in Methods LCOM × ×Number Of Children NOC × ×Dependence on an a descendant DOC ×Count of calls by higher modules FAN IN ×Response For a Class RFC × ×Weighted Methods per Class WMC × ×Number of Public Methods NPM ×Lines of Code LOC ×

Number of Instances 145 274

113

Phase 1: Performing Fuzzy C-Means Clustering

In order to generate the fuzzy rules, it is required to determine the total number of rules to be

developed and the number of antecedent membership functions. Fuzzy c-means clustering (FC)

(Bezdek, 1981) has been used to determine these two parameters. The total number of clusters

given by FC are useful in determining the number of rules. FC algorithm outputs n clusters C1···n

and the membership value of each antecedent in each of the n clusters.

Phase 2: Generating FIS

A Sugeno type FIS using Matlab fuzzy toolbox has been generataed in which the consequent y of

a rule is a crisp number which is computed as follows:

y =k

∑

i=1

αixi + βi (5.1)

where k is total number of antecedents (parameters predicting the defect proneness), αi and βi are

co-efficients which can be different for each parameter xi. The generated FIS has n rules and the

jth rule takes the form:

Rulej : IF input1 in Ci AND · · · AND inputk in Ci THEN output in Ci

where k is total number of input parameters and i = 1 · · ·n. For a binary class problem, two

rule sets, RuleSetD and RuleSetND, are generated for classification of the D and ND modules

respectively. When a certain rule, say Rulep, is fired, the following RuleD and RuleND classify

the given module into class D and ND respectively:

RuleD : IF Rulep ǫ RuleSetD THEN D

RuleND : IF Rulep ǫ RuleSetD THEN ND

where

RuleSetD = {Ruleq| Ruleq classifies the given module as D},

RuleSetND = {Ruler| Ruler classifies the given module as ND}

and

RuleSetD ∩RuleSetND = Φ

114

Tab. 5.2: Evaluation Parameters Used for Comparison

Evaluation Parameter Abbreviation

True Negative Rate TNRate

True Positive Rate (Recall) TPRate (Recall)

False Positive Rate FPRate

False Negative Rate FNRate

Accuracy Acc

Precision Prec

Misclassification Rate MCRate

F-Measure F

5.2.3 Evaluation

The suggested FIS model has been compared with the prediction models such as classification

tree (CT), linear regression (LR) and neural network (NN) based models. The objective is to

not only identify the best model for each evaluation parameter but also highlight the extent of

performance shortfall for each evaluation parameter if fuzzy prediction model is used instead of

the exact prediction models.

The comparison mentioned above has been done in terms of the evaluation parameters listed

in Table 5.2. TNRate, TPRate or Recall, FPRate and FNRate are obtained while computing

a confusion matrix (Jiawei and Micheline, 2002). Acc, Prec and MC Rate are important and

widely used model performance measures when confusion matrix is available. Acc is not consid-

ered a good performance measure for unbalanced datasets (Kubat et al., 1998), therefore to deal

with the unbalanced data F − measure (Rijsbergen, 1979) is used. It is worthwhile to mention

that performance in terms of Recall (i.e. true positive rate) is significant as it helps in focusing

on the problematic areas in the software and thus better resource planning can be done. Since our

goal is to detect the defect prone modules, we consider Recall as our primary measure of compar-

ison. In addition to these parameters we have determined the position of each model on relative

115

operating characteristic (ROC) graph (Egan, 1975) by plotting a ROC point for each model. This

visualization is helpful in identifying a better model in terms of Recall. To see the effect of using

fuzzy model, a percentage of performance shortfall is calculated for each evaluation parameter. An

overall performance shortfall (maximum shortfall from all the parameters) is also recorded. This

percentage is helpful in finding the benefits of using the fuzzy based prediction and its drawbacks

over using the exact predictions.

5.3 Results

To develop the FIS, the membership functions for each input were assigned based on the frequency

distribution of the data. A relationship between the distribution of all the input metrics of kc1-class-

level data and the membership functions (or clusters) for each input software metric is shown in

Figure 5.2 and Figure 5.3. There are more membership functions in Figure 5.3 for the more dense

areas of the corresponding input metric in Figure 5.2. A similar relationship between the metrics

of jEdit bug data and their membership functions can be seen in Figures 5.4 and 5.5 respectively.

For both the datasets, we have kept the shape of all the membership functions of the antecedents as

gaussian and have used the information obtained from phase 1 to determine the rules that model the

data behavior. Linear least squares estimation has been used to find the consequent of each rule. In

total 21 and 7 conjunctive rules were generated for kc1-class-level and jEdit datasets respectively.

The decision whether a module is D or ND is made using RuleD and RuleND respectively. For

kc1-class-level, 21 clusters were obtained with clusters 1 to 11 belonging to ND class and clusters

12 to 21 belonging to D class for kc1-class-level data. For jEdit data, 7 clusters were obtained

where instances lying in cluster 1 to 4 belong to ND class and the instances that belong to clusters

5, 6, 7 belong to D class.

Three models LR, NN and CT have been developed for both the datasets. The standard R-

squared LR model was obtained using the training data. The NN model is a single hidden layer

feed forward back propagation based network having one neuron in the hidden layer. The NN

model selected for comparison is obtained after exhaustive experimental runs. It was seen during

the experiments that the more complicated NN models did not perform better than the single layer

116

−5 0 5 10 15 20 250

5

10

15

20

25

30

35

input 1

frequency

(a) CBO

0 1 2 3 4 5 6 7 80

10

20

30

40

50

60

70

input 2

frequency

(b) DIT

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

50

100

150

input 5

frequency

(c) DOC

−0.5 0 0.5 1 1.5 2 2.5 3 3.50

10

20

30

40

50

60

70

input 6

frequency

(d) FAN IN

Fig. 5.2: Frequency distribution of all input metrics for kc1-classlevel data (Continued on Next Page)

117

−20 0 20 40 60 80 100 1200

10

20

30

40

50

60

input 3

frequency

(e) LCOM

−1 0 1 2 3 4 5 60

20

40

60

80

100

120

140

input 4fr

equency

(f) NOC

−50 0 50 100 150 200 2500

10

20

30

40

50

60

input 7

frequency

(g) RFC

−20 0 20 40 60 80 100 1200

10

20

30

40

50

60

70

80

input 8

frequency

(h) WMC

Fig. 5.2: (Continued from Previous Page) Frequency distribution of all input metrics for kc1-classlevel data

118

Fig. 5.3: Output of phase 1: clusters and membership functions for each input of kc1-classlevel. (The plot

of the membership functions for each input appear in the same order as the distribution of each

input appears in Figure 5.2)

(a) CBO (b) DIT

(c) NPM (d) LOC

(e) LCOM (f) NOC

Fig. 5.4: Frequency distribution of all input metrics for jEdit bug data (Continued on Next Page)

119

(g) RFC (h) WMC

Fig. 5.4: Frequency distribution of all input metrics for jEdit bug data

single neuron network for the datasets under study. To train the NN on kc1-class-level, 100 epochs

were used to obtain reasonable results. A larger number of epochs has not produced better results

with the initial bias of 1 and the initial unit weights at each hidden layer. The NN for jEdit data

took 78 epochs for training with same initial bias and weights. A CT with 14 and 16 intermediate

nodes was developed for kc1-class-level and jEdit data respectively.


Despite being a simple learner of all the other learners, CT has performed better in training phase.

It has the best values for all the evaluation parameters (in case of kc1) including Recall. The

FIS based model has the third best Recall during training as shown in Figure 5.6a whereas it has

the best Recall but a higher FPRate in the testing phase as shown in Figure 5.6b. In case of

JEdit data FIS based model performs better than the others as shown in Figure 5.6c and 5.6d. The

testing phase performance reported in Table 5.3 indicates the LR based model dominates in the

test phase (of kc1) in terms of Acc, Prec and MC Rate. The NN based model turns out to be

the best classifier for the ND modules. The FIS based model has the second best F value. It is

worthwhile to mention that the results reported in Table 5.3 are for unbalanced test data which is

dominated by ND classes. Hence the best Recall and the second best F value of the FIS based

model is reasonable for a model developed using fuzzy inputs. These values can be improved if

the membership functions are further fine tuned.

The effect of using fuzzy inputs have been measured in terms of performance shortfall. For kc1

data, maximum performance shortfall of 24.99% has been observed in case of TNRate. For the

120

(a) CBO (b) DIT

(c) NPM (d) LOC

(e) LCOM (f) NOC

(g) RFC (h) WMC

Fig. 5.5: Output of phase 1: clusters and membership functions for each input of jEdit. (The plot of the

membership functions for each input appear in the same order as the distribution of each input

appears in Figure 5.4)

121

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Train: ROC Point for each model

FP Rate

TP

Rate

(R

ecall)

FIS

CT

LR

NN

(a) kc1-class-level Training

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TP

Rate

(R

ecall)

FP Rate

Test: ROC Point for each model

FIS

CT

LR

NN

(b) kc1-class-level Testing

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Train: Average ROC Point for each model

FP Rate

TP

Rate

FIS

CT

LR

NN

(c) jEdit data Training

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Test: Average ROC Point for each model

FP Rate

TP

Rate

FIS

CT

LR

NN

(d) jEdit data Testing

Fig. 5.6: ROC point for each model in training and testing

122

Tab. 5.3: Testing Performance Measures

Dataset Model TNRate TPRate FPRate FNRate Acc. Prec. MCRate F

(Recall)

kc1- FIS 0.667 0.917 0.333 0.083 0.729 0.478 0.271 0.629

classlevel CT 0.694 0.833 0.306 0.167 0.729 0.476 0.271 0.606

LR 0.861 0.667 0.139 0.333 0.813 0.615 0.188 0.640

NN 0.889 0.500 0.111 0.500 0.792 0.600 0.208 0.546

jEdit FIS 1 0.973 0 0.027 0.979 1 0.021 0.984

Bug CT 0.7 0.877 0.3 0.123 0.839 0.914 0.161 0.895

Data LR 0.983 0.973 0.017 0.027 0.975 0.994 0.025 0.982

NN 0.95 0.982 0.05 0.018 0.975 0.986 0.025 0.984

rest of the evaluation parameters, the performance shortfalls have been 10.25%, 22.27%, 44.42%

and 1.81% for Acc, Prec, MCRate and F respectively. At the same time, the FIS models provides

10% gain in Recall when compared with the second best Recall. For jEdit data the FIS model

performed better in terms of all evaluation parameters except Recall and FNRate. The percentage

shortfall in Recall is as low as 0.93%. FNRate seems to have increased by 50% but the low value

of 0.0273 for the FNRate indicates that the increase is not alarming. For the rest of the parameters,

the FIS model has increased performance. The observed performance shortfalls are reasonably low

given the approximate nature of the inputs used for the FIS prediction model. Therefore the FIS

based prediction can be used to identify the possible defect prone modules at an early stage of the

SDLC. In order to get the exact prediction, the conventional prediction models can be used later in

the SDLC when precise values are available.

The FIS model has LOC as input metric for jEdit data. LOC is not a design metric and has

dependence on programming language (Pressman, 2010). Function Points (FP) is an alternative

to LOC as an estimate of software size. FP overcomes the limitation of LOC for being language

dependant. Although the datasets do not provide FP value for each software modules, studies have

suggested the relationship between the two (Pressman, 2010). This relationship can be used to

123

translate the LOC based values in the datasets into FP based values if required.

A critique on the proposed approach could be that it requires magic to get the correct linguistic

labels for the software metrics earlier in the lifecyle. This critique is valid. However we expect

that the managers making estimates have appropriate experience to provide good linguistic labels

to the input parameters. FP based estimations face the similar critique that calculations of FP

require freight of hand (Pressman, 2010). However, this does not stop the practitioners to make FP

based estimates.

5.5 Summary

The conventional quality prediction models require exact values of their input parameters. This

Chapter introduces a novel concept of getting approximate prediction of defect proneness when

exact input values are not available, especially during early SDLC phases. These early predictions

are needed for better resource planning, cost reduction and test planning. The Chapter suggests the

use of fuzzy inputs to overcome the need for exact and precise input metric values. The prediction

model introduced in this Chapter is fuzzy inference system based and requires imprecise estimates

of input software metrics. This imprecision is introduced by defining inputs as fuzzy linguistic

variables. The linguistic labels can be obtained from a domain expert. The Chapter identifies

that predictions made through vague inputs are reasonably close to predictions obtained through

the application of conventional models based on exact and precise inputs. In light of the analysis

conducted in this Chapter, one can easily predict the defect proneness at an early stage without

requiring exact measurements of the metrics.

This study needs to be extended for the validation using more datasets. Another study can be

conducted to provide a systematic mechanism for software managers to suggest linguistic labels

for input metrics. We also plan to investigate if the rules extraction can be improved to establish a

causal relationship between the software metrics and the defect proneness of software modules.

124

6. RESOLVING ISSUES IN SOFTWARE DEFECT DATASETS

Measurements of different characteristics of software such as size, complexity and relationship be-

tween its components are collected in the form of software metrics (Baker et al., 1990). Software

metrics have been effectively used in various Software Quality Prediction (SQP) studies (Ganesan

et al., 2000, Shen et al., 1985, Thwin and Quah, 2002). Mostly two kinds of metrics have been used

to build SQP models; Software Product Metrics (SPdMs) and Software Process Metrics (SPcM).

Product metrics correspond to measurements of attributes associated with software itself, for ex-

ample number of errors in the software and number of lines of code. Process metrics correspond

to measurements of attributes of the processes which are carried out during the lifecycle of the

software such as effectiveness of development and performance of testing. This chapter unifies

names of software product metrics. Further the chapter also demonstrates that software metrics

have been misused in software defect prediction studies.

A number of SQP models based on product metrics have been reported in literature for differ-

ent stages of software development lifecyle (SDLC) and different software development paradigms

(SDPs) (Bouktif et al., 2006, Gokhale and Lyu, 1997, Khosgoftaar and Munson, 1990, Munson and

Khosgoftaar, 1992, Quah and Thwin, 2003, Wang et al., 2007). Selection of a prediction model is

generally based on a number of parameters such as software metrics available, phase of software

development lifecycle, software development paradigm, domain of the software (Jones, 2008),

quality attribute to be predicted, product based and value based views of the model (Rana et al.,

2008, Wagner and Deissenboeck, 2007) and so on. Selecting a model on the basis of so many

parameters poses a problem to the organizations and results in subjective selection of a prediction

model. In order to reduce this subjectivity, a generic approach is needed which can help in objec-

tively selecting a model. Though some attempts have been made to develop such approaches to

predict software quality (Bouktif et al., 2006, Rana et al., 2008), but there are limitations due to

different interpretations, nomenclature and representation of model input parameters. For example,

numerous product metrics like program Volume (V) and total Lines of Code (LOC), have been used

with different names by different researchers which has generated inconsistency (Dick and Kandel,

2003, Jensen and Vairavan, 1985, Khosgoftaar and Munson, 1990, Koru and Liu, 2005a, Li et al.,

2006, Ottenstein, 1979). The generic approach presented in (Rana et al., 2008) requires such incon-

sistencies to be removed. Figure 6.1 shows different components of the generic approach. These

components are Input Selection, Model Selection, Model Development and Reporting. Input Se-

lection activity identifies the relevant dataset from a given repository. Model Selection activity

uses the relevant dataset to compare performances of the models and select an appropriate defect

prediction model for the relevant dataset. Model Development activity selects important metrics in

the dataset and develops a prediction model using the selected metrics. Reporting activity presents

the prediction results to the user.

This chapter identifies two types of inconsistencies in naming of software product metrics and

presents a resolution framework. The suggested framework resolves these inconsistencies on the

basis of definition of the metric and chronological usage of the metric name. A Unified Metrics

Database (UMD) is also introduced as part of Input Selection activity of Figure 6.1. Further details

on the role of the UMD can be found in (Rana et al., 2011).

This chapter also studies the role of Software Science Metrics (SSM) (Halstead, 1977) in defect

prediction. The chapter shows with the help of experiments that the use of SSM do not significantly

Generic Approach for Quality Prediction

Input Selection

Similar Dataset Selection

Datasets

Repository ( R )

UMD

Model Selection

Model

DevelopmentReporting

Models

Performance Data

Prediction

Results

Measures’

Labels

Appropriate

Model

Software

MeasuresUsed, Quality

Objectives

Relevant Dataset,

Quality Objectives

Report

Fig. 6.1: Role of UMD in the Generic Approach for Software Quality Prediction

126

contribute in:

1. classifying OO software modules as defect prone and not defect prone (binary classification)

2. predicting number of defects in OO software (numeric classification).

6.1 Issues related to Software Defect Data

6.1.1 Inconsistencies in Software Product Metrics Nomenclature

Over the years various researchers have worked in software metrics and proposed various suites of

product metrics that directly or indirectly measure the difficulty or complexity in comprehending

and developing a software (Belady, 1980, Chidamber and Kemerer, 1994, Halstead, 1977, Henry

et al., 1981, McCabe, 1976). Later on different researchers have used these suites and sometimes

proposed similar metrics with different nomenclature (Bouktif et al., 2004, Gyimothy et al., 2005,

Khoshgoftaar et al., 1997b, Khoshgoftaar and Seliya, 2002, Olague et al., 2007, Schneider, 1981,

Shen et al., 1985, Zhou and Leung, 2006). The lack of coordination among the researchers and

slower dissemination of information in the past has resulted in many inconsistencies in taxonomy

of the Software Product Metrics (SPdM). By studying the literature, we have identified and tried

to resolve two types of inconsistencies in the product metrics nomenclature. These are:

• Type I Inconsistency: This type of inconsistency arises in the situations where same metric

has been used with more than one names (or labels) as shown in Figure 6.2a. An example

of Type I inconsistency is Jensen’s estimate of program length which has been referred as

NF (Jensen and Vairavan, 1985) and JE (Guo and Lyu, 2000). This can also be termed as

Different Labels for Same Metric (DLSM) inconsistency.

• Type II Inconsistency: This type of inconsistency arises when same label has been used for

more than one metric as shown in Figure 6.2b. For example B has been used as the Halstead

error estimate (Jiang et al., 2008c, Khosgoftaar and Munson, 1990, Ottenstein, 1979, 1981)

and B has also been used as average level nesting of control graph of the program (Jensen

127

Mj

L1

Lp

Lq

Lp

SPdMs Alternate

Labels

Preserved

Labels

(a) Type I Inconsistency (DLSM)

Mk

Lt Lt

LvMl

Lu

Lv

Lt

SPdMs Alternate

Labels

Preserved

Labels

(b) Type II Inconsistency (SLDM)

Fig. 6.2: SPdM Type I and Type II Inconsistencies

and Vairavan, 1985). This can also be called Same Label for Different Metrics (SLDM)

inconsistency.

These inconsistencies have been removed using the proposed framework and the resultant UMD

has been put in place. This UMD is part of the generic approach suggested in (Rana et al., 2008).

Before presenting the framework to resolve these inconsistencies we present some examples of

Type I and Type II inconsistencies in most frequently used metrics.

128

Tab. 6.1: Examples of metrics with Type I inconsistency

Metric Definition Labels Used Used by

Total lines of code (including comments) LOC (Brun and Ernst,

2004), (Dick and

Kandel, 2003),

(Gyimothy et al.,

2005), (Khosgof-

taar et al., 1994),

(Khosgoftaar and

Munson, 1990),

(Khoshgof-

taar and Allen,

1999c), (Khosh-

goftaar and

Allen, 1999b),

(Khoshgoftaar

and Seliya,

2003), (Munson

and Khosgoftaar,

1992), (Pizzi

et al., 2002),

(Xing et al.,

2005)

SLOC (Briand et al.,

1993)


129



TC (Gokhale and

Lyu, 1997), (Guo

and Lyu, 2000)

Lines of source code (Dick and Kan-

del, 2003)

Lines of code (Li et al., 2006)

LOC TOTAL (Jiang et al.,

2008c)

McCabe’s Cyclomatic complexity V(G) (Khosgoftaar and

Munson, 1990),

(Khoshgof-

taar and Allen,

1999c), (Munson

and Khosgoftaar,

1992), (Xing

et al., 2005)

MC (Jensen and

Vairavan, 1985)

VG (Briand et al.,

1993), (Khosh-

goftaar et al.,

1996), (Khosh-

goftaar and

Seliya, 2002)


130



VG1 (Dick and Kan-

del, 2003),

(Khosgoftaar

et al., 1994)

McC1 (Ohlsson and Al-

berg, 1996)

M (Gokhale and

Lyu, 1997), (Guo

and Lyu, 2000)

CMT (Dick and Kan-

del, 2003)

Strict Cyclomatic Complexity (Li et al., 2006)

Complexity (Nagappan et al.,

2006)

CYCLOMATIC COMPLEXITY, v(G) (Jiang et al.,

2008c)

Total executable statements EX (Ottenstein,

1979)

(all lines of code excluding EXE (Khosgoftaar and

Munson, 1990)

comments, declarations and blanks) ELOC (Khosgoftaar

et al., 1994),

(Khoshgoftaar

and Allen, 1999c)


131



STMEXE (Khoshgoftaar

and Allen,

1999b), (Khosh-

goftaar and

Seliya, 2003)

CL (Guo and Lyu,

2000)

Executable lines (Dick and Kan-

del, 2003)

Size1 (Quah and

Thwin, 2003)

Lines (Nagappan et al.,

2006)

LOC EXECUTABLE (Jiang et al.,

2008c)

Metrics with Type I Inconsistency

Type I inconsistency appears frequently in literature. From a variety of software metrics collected

in UMD, we show inconsistencies in top 3 frequently used metrics in Table 6.1.

Lines of Code (LOC) is primarily used to measures product size but it has been shown to have

a strong relationship with number of errors as well (Khosgoftaar and Munson, 1990). That is

why it has been widely used in studies related to error prediction as shown in Table 6.1. Because

definition of cyclomatic complexity (McCabe, 1976) is based on the fact that complexity of a soft-

ware depends on the decision structure of a program, it is considered to be highly correlated with

problems in the program, hence compelling the researchers to use this metric for error prediction

(Khosgoftaar and Munson, 1990, Munson and Khosgoftaar, 1992). Number of executable lines

132

of code (EX) has been used as predictor of number of errors because of its strong correlation

with cyclomatic complexity, Halstead’s program volume and number of errors (Khosgoftaar and

Munson, 1990, Ottenstein, 1979).

There are a number of other metrics that have Type I inconsistency. In order to match two

datasets, atleast their metric labels should be consistent. If a single label is not assigned to each of

the metrics, finding the similarity between datasets becomes a difficult task. Consequently, making

development of the generic approach difficult.

Metrics with Type II Inconsistency

Type II inconsistency appears when two different studies assign same label to different product

metrics as depicted in Figure 6.2b. A few examples of Type II inconsistency are given in Table

6.2. B has been used with three different meanings: bandwidth of a program (Jensen and Vairavan,

1985), number of branches in the program (Khosgoftaar and Munson, 1990) and Halstead’s error

estimate (Ottenstein, 1979). Similarly CL has been used to represent total code lines by some

studies and total executable statements by others. But CL has been associated with the definition

total code lines (Munson and Khosgoftaar, 1992) prior to its association with the other definition

by Guo et al (Guo and Lyu, 2000) i.e total executable statements. Another example of Type II

inconsistency is label D that represents Halstead’s difficulty (Shen et al., 1985) as well as number

of decisions (Khosgoftaar and Munson, 1990). FANOut also has more than one definitions:

number of calls of a subroutine (Li et al., 2006), and number of objects in the calls made by a

function (Ohlsson and Alberg, 1996).

Removal of this type of inconsistency is pivotal, because this can cause misinterpretations of

different metrics and may lead to invalid results while finding the data similarity mentioned above.

The framework presented in this Chapter suggests rules to remove this inconsistency as well.

6.1.2 Ineffective use of Software Science Metrics

Software Science Metrics (SSM) proposed by Halstead (Halstead, 1977), are based on number

of operators, operands and their usage for procedural paradigm. These metrics are indicators of

133

Tab. 6.2: Examples of metrics with Type II inconsistency

Label Definition Used Used by

B Bandwidth of a program (Jensen and Vairavan, 1985)

Count of branches (Khosgoftaar and Munson, 1990)

The Halstead’s error estimate (Ottenstein, 1979, 1981)

CL Total code lines (Gokhale and Lyu, 1997, Munson and Khosgoftaar,

Total executable statements (Guo and Lyu, 2000)

D Halstead’s difficulty (Jiang et al., 2008c, Li et al., 2006, Shen

Number of decisions (Khosgoftaar and Munson, 1990)

FANOut Number of calls of a subroutine (Li et al., 2006, Nagappan et al., 2006)

Number of objects in the calls made by a certain function (Ohlsson and Alberg, 1996)

software size and complexity (for example program length N and effort E measure size and com-

plexity respectively). Earlier studies have found a correlation of software size and complexity

with number of defects (Ottenstein, 1979, Khosgoftaar and Munson, 1990) and have used size and

complexity metrics as predictors of defects. Studies have used SSM for defect prediction and clas-

sification of defect prone software modules as well (Ottenstein, 1979, Jensen and Vairavan, 1985,

Munson and Khosgoftaar, 1992, Briand et al., 1993, Khosgoftaar et al., 1994, Gokhale and Lyu,

1997, Khoshgoftaar and Allen, 1999c, Khoshgoftaar and Seliya, 2003, 2004, Xing et al., 2005,

Koru and Liu, 2005a, Li et al., 2006, Seliya and Khoshgoftaar, 2007). Fenton et al. (Fenton and

Neil, 1999) have criticized the use of SSM and other size and complexity metrics in defect predic-

tion models because 1) neither the relationship between complexity and defects is entirely causal 2)

nor are the defects a function of size. Majority of the prediction models take these two assumptions

(Fenton and Neil, 1999). Despite the critique various studies have used SSM to study software de-

veloped in procedural paradigm (Khoshgoftaar and Allen, 1999c, Khoshgoftaar and Seliya, 2003,

Xing et al., 2005) as well as object oriented paradigm (Koru and Liu, 2005a, Challagulla et al.,

2005, Seliya and Khoshgoftaar, 2007).

With the shift of paradigm from procedural to object oriented (OO), metrics such as unique

134

Software Product Measures with Type I and Type II

Inconsistencies

Unification Rules

Unified Measures

Categorization Rules

Categorization Parameters: SDP, SDLC

Phase, Frequency Categorized Measures

(a) High Level Design

SDLC

SDP

FrequencyOccasionally

Used

Frequently

Used

Conventional

Object Oriented

Design

Implementation

Testing

Deployment

Maintenance

(b) Dimensions of Categorization

Fig. 6.3: SPdM Unification and Categorization Framework

operands η2, total operand occurrences N2, program vocabulary n and program volume V do not

remain effective indicators of complexity of the software. This is because of the nature of OO

paradigm where software consists of many classes and each class has its own operands (attributes).

Having 10 to 15 classes in the software each with 5 to 10 attributes might not make the software as

complex as indicated by these operator and operand based metrics. The complexity in case of OO

software will depend on interaction between the objects of the classes and complexity of methods

of the classes. So using SSM as predictors of defects in OO software might not be a wise decision.

The results presented here include the application of different classification models on dataset

kc1 with class level data (Menzies et al., 2012). Analysis on the impact of removing SSM from

the set of independent variables of the classification models is also presented. The experimental

results show that removing SSM from the set of independent variables does not significantly affect

the binary and numeric classification of OO software modules. As compared to the case when

all the collected metrics are used for both the classifications, the number of incorrectly classified

instances and the mean absolute error have improved in absence of SSM for binary and numeric

classification respectively.

135

6.2 Proposed Approaches to Handle the Issues related to Software Defect Data

6.2.1 Metric Unification and Categorization (UnC) Framework

In the previous section it has been highlighted that Type I and Type II inconsistencies exist in

software product metrics nomenclature. In this section we present a Unification and Categorization

(UnC) framework to remove the said inconsistencies as shown in Figure 6.3a. After the unification,

the product metrics are categorized based on the following three dimensions as shown in Figure

6.3b:

• Usage of a software metric in software development paradigms (SDP)

• Availability of a software metric in different phases of software lifecycle (SDLC)

• Frequency of usage of a software metric in studies related to SQP

We have first removed the inconsistencies in metrics nomenclature and then categorized the met-

rics with respect to software development paradigm and software lifecycle phase. Afterwards, on

the basis of frequency of usage, the metrics are categorized in two groups, frequently used and

occasionally used. Then the unified metric is stored in the Unified Metrics Database (UMD). In

the rest of the Section, unification rules are presented followed by the categorization method of the

unified metrics.

Unification

Once all the labels used for a product metric are collected from the Inconsistent Metrics Database

(IMD) shown in the Figure 6.4a, Rule 1 and Rule 2 are serially applied on the SPdM.

Rule 1: Rule 1 shown in Figure 6.4b resolves the Type I inconsistency by associating the most

frequently used label to the metric under consideration. Frequencies of each label are computed

and there is no clear winner label which can be associated with the software product metric if:

|fl1 − fl2| < δ (6.1)

where fl1 and fl2 are frequencies of the top two frequently used labels and δ, a positive integer, is

a threshold value provided by the user. In such a case the label used earliest for the product metric

136

SPdM Categorized w.r.t Paradigm

OO, Conventional

Paradigm Measure

Used in OO

Paradigm?

YesNo

Conventional

Paradigm Measure

Mntc

Available in

SDLC Phase?

Design Maintenance

Implementation Deployment

Te

stin

g

Design + Impl + Test + Dep +

Add the Unified and Categorized SPdM in UMD

UMD

f >

SPdM Categorized w.r.t SDLC Phase

Frequently Used

Measure

Yes

Occasionally

Used Measure

No

BA

Software Product

Measure (SPdM)

Conflict with other

measure label?

Yes

No

Type I

Type II

Rule 2

Rule 1

Alternate Labels

Pick all labels used for

the SPdM

IMD

Unified SPdM

Calculate f, the

frequency of usage

Conventional

Paradigm

Measure

Used in

Conventional

Paradigm?

YesNo

OO

Paradigm

Measure

A B

w

(a) An SPdM Through the Framework

nal

ure

ntc

nce

e

MFU Label

Associated

with the SPdM

Type I Inconsistent

SPdM

Find Most Frequently

Used (MFU) Label

Clear winner

found using ?

Yes No

Find the Earliest

Used Label

Earliest used label

associated with the

SPdM

Type I Consistent SPdM

(b) Rule 1

t

l

eMFU Label

Associated with

the SPdM

Type II

Inconsistent

SPdM with

Conflicting Label

For each conflicting measure find

earliest use of the conflicting label

Label used for

the SPdM?

YesNo

Find the MFU

Label

Conflicting

SPdMs with the

Conflicting

Label

Remove the SPdM from IMD

Remove the label

from the

alternates of the

conflicting SPdM

Type II Consistent SPdM

(c) Rule 2

Fig. 6.4: SPdM Unification and Categorization Framework: Detailed Design

is considered as the preserved label.

Rule 2: Rule 2 shown in Figure 6.4c is intended to resolve Type II inconsistency in metrics nomen-

clature. If the earliest use of the label preserved by Rule 1 was for the software metric under con-

sideration, then Rule 2 does not modify the decision taken by Rule 1. In addition the Rule 2 ensures

that the conflicting label is removed from the alternate labels of all the conflicting metrics in IMD,

so that this label cannot be preserved for any other metric in future. Otherwise, if the earliest use

of the label is for another metric, the decision by Rule 1 is altered and the most frequently used

label is preserved. The metric is then removed from the IMD so that this might not again conflict

with the rest of the product metrics when their unification process takes place.

137

Categorization

The categorization of Software Product Metrics (SPdMs) has been done in three dimensions shown

in Figure 6.3b. While categorizing the SPdMs with respect to each of the dimensions, we only

consider use of the metric in that dimension instead of considering the effective use of the metric

for that dimension. Effective use or statistical importance of a metric may differ for different

scenarios and datasets. Based on the existing literature, it is very difficult to objectively state the

statistical importance of software metrics. Different stages of the unification and categorization

process are shown in Figure 6.4a. Each of the diamond in the figure represents the start of a

categorization stage. These stages are discussed below.

Software Development Paradigm (SDP)

Some metrics are available in conventional as well as object oriented paradigm. Use of a soft-

ware metric in more than one paradigm shows the capability of the metric to capture haracteristics

of different kinds of software. This capability does not guarantee that the use of the metric will

always be effective. Purao et al. (Purao and Vaishnavi, 2003) have termed the use of a conven-

tional metric in object oriented systems as a misuse of conventional metric and as a researcher’s

bias. For example η has been used to study the characteristics of software developed using con-

ventional software engineering (Ottenstein, 1979) as well as characteristics of software developed

in OO paradigm (Koru and Liu, 2005a). But using it as one of the indicators of software quality

in OO paradigm does not give improved results (Koru and Liu, 2005a). Metrics specific to OO

paradigm such as relationship between objects (modules) and coupling between object classes are

better indicators of software characteristics in this case. Similarly there are certain metrics, like

B (Belady, 1980), which have been used in studies related to software developed in conventional

paradigm only.

Software Development Lifecycle Phase

Availability of a software metric in different phases of software lifecycle indicates the measure-

ment of software characteristic in that phase. For example Weighted Methods per Class (WMC)

(Chidamber and Kemerer, 1994) can be available in design phase and it is an indicator of design

of the class. It can further indicate the increments made to the software i.e. if new and complex

138

functionality has been added to the software in the next iteration or not. Availability of a metric in a

particular phase further helps software quality prediction by making it clear that predictions based

on this metric can be made in this particular phase of the lifecycle. Therefore, we have categorized

the product metrics based on the possibility of their availability in software lifecycle phases. The

possible categories for the product metrics are Design+, Impl+, Test+, Dep+ and Mntc, which

represent the design, implementation, testing, deployment and maintenance phases respectively. A

‘+’ sign indicates that the metric is available in the corresponding and the following phases.

Frequency of Usage

Metrics cited in this Chapter have been divided into two broader categories based on their

frequency f of usage: frequently used and occasionally used. We define frequently used metrics as

the metrics which have been used by more than α studies on SQP i.e. f > α where α is a threshold

to classify a metric as frequently used (FU). If f < α the metric is called occasionally used (OU)

metric. α is a positive integer provided by the user.

6.2.2 Proposed Approach to Show Ineffective use of SSM

The role of SSM in defect prediction in OO software has been studied using dataset kc1 (Menzies

et al., 2012), which consists of class level data of a NASA project (Facility). The dataset has 145

instances and each instance has 94 attributes, which are metrics collected for that software instance.

These attributes include object oriented metrics (Chidamber and Kemerer, 1994), metrics derived

from cyclomatic complexity such as sumCY CLOMATIC

COMPLEXITY and metrics derived from SSM such as minNUM OPERANDS, avgNUM

OPERANDS. A few other size metrics like LOC are also part of the 94 attributes. Total 48

metrics were derived from SSM and we applied the models listed in Table 6.3 first using all of the

94 attributes as input to the models and then applied the same models for the 46 metrics which are

not derived from SSM.

The data is available in two structurally different formats. One format allows binary classi-

fication and the other allows numeric classification. We performed binary classification (BC) of

modules, i.e. defect prone or not defect prone, as well as numeric classification (NC), i.e. number

139

Tab. 6.3: List of classification models used from WEKA(Witten et al., 2008).

BC NC

Model Name Abbr. Model Name Abbr.

Bayesian Bay Additive Regression AR

Decision Table DTb Decision Tree DTr

Intance Based IB Linear Regression LR

Logistic Log Support Vector Reg. SVR

of defects in the modules using various classification models available in WEKA (Witten et al.,

2008) and listed in Table 6.3. The classification is done using:

1. all the metrics present in the dataset.

2. all the metrics except the SSM based metrics.

Because of the structural nature of the data, we applied different models for BC and NC and

recorded different performance measures. Similarly the impact of removing SSM from the set

of inputs is studied using different effectiveness measures for both kinds of classifications. We’ll

first discus measures related to BC and then the measures related to NC. Accuracy is used as

model performance measures for BC. Accuracy (Acc) is based on number of Correctly Classified

Instances (CCI), number of Incorrectly Classified Instances (ICI) and is defined as follows:

Acc =CCI

CCI + ICI(6.2)

Effectiveness Effi is defined to study the impact of removing SSM from the set of inputs to the

ith binary classification model. Effi is given by the following equation:

Effi = Acci,All − Acci,NotSSM (6.3)

where Acci,SSM is the accuracy of model i using all metrics and Acci,NotSSM is the accuracy of

model i using all metrics except SSM. Use of SSM is considered effective by model i if Effi is

140

above a threshold α = 0.01. Which means that use of SSM is considered effective if accuracy

of the model i does not decrease more than two decimal points if the SSM are removed from the

set of inputs to model i. If Effi is a negative value, this means that the accuracy of model i has

improved on removing SSM from the set of inputs. In order to measure overall effectiveness of

SSM, Effavg is used which is average of all the Effis. Use of SSM will be considered as effective

only if Effavg is a positive number and is greater than λ = 0.005. On the other hand, SSM cannot

be considered ineffective if Effavg does not fall below −λ.

Performance measures recorded for NC models are: Mean Absolute Error (MAE), Root Mean

Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Square Error (RRSE) and

are defined by Equations 6.4, 6.5, 6.6 and 6.7 respectively:

MAE =1

n

n∑

i=1

|Pi − Ai| (6.4)

where n is total number of instances, Pi is predicted number of errors in ith instance, Ai is the

observed value of number of errors in ith instance.

RMSE =

√

√

√

√

1

n

n∑

i=1

(Pi − µ)2 (6.5)

where µ is the mean of actual values of number of errors.

RAE =

∑n

i=1 |Pi − Ai|∑n

i=1 |Ai − µ| (6.6)

RRSE =

√

∑n

i=1(Pi − Ai)2∑n

i=1(Ai − µ)2(6.7)

To study the impact of removing SSM from the set of inputs to the numeric classification model i,

we have defined a measure Erri based on MAE of model i as follows:

Erri = MAEi,NotSSM −MAEi,All (6.8)

where MAEi,NotSSM is the MAE of model i using all metrics except SSM and MAEi,All is the

MAE of model i using all metrics. Erri should be greater than δ = 0.1 in order to consider SSM as

an effective predictor of number of defects using model i. In order to check overall effectiveness

141

B Categorized w.r.t Paradigm

B

Used in OO

Paradigm?

Available in

SDLC Phase?

Implementation

Impl +

Add B alongwith its definition in UMD

UMD

f > 2

B Categorized w.r.t SDLC Phase

Frequently Used

Measure

Yes

B

The Halstead Error

Estimate

Conflict with other

measure label?

Yes

No

Type I

Type II

Rule 2

Rule 1

Alternate Labels:

B, HALSTEAD_

ERROR_EST, B

Pick all labels used for

the SPdM

IMD

B Unified

Calculate f, the

frequency of usage

Conventional

Paradigm

Metric

B Used in

Conventional

Paradigm?

Yes

B

f=4

OO, Conventional

Paradigm Measure

Yes

(a) B Passing Through the Framework

The Halstead Error

Estimate.

Alternate Labels:

B, HALSTEAD_

ERROR_EST, B

Find Most Frequently

Used (MFU) Label

> 2 ?

No

Find the Earliest

Used Label

B associated with

the SPdM

B

(b) Rule 1

t

Bandwidth of the

Program. B

For each conflicting measure find

earliest use of the conflicting label

Label used for

B?

Yes

The Halstead

Error Estimate. B

Remove B from IMD

Remove B from

the alternates of

Bandwidth of the

Program

Bandwidth of the

Program, 1985

Halstead Error

Estimate, 1979

Unified B

(c) Rule 2

Fig. 6.5: Unification and Categorization of B

of SSM in case of numeric classification, average error Erravg is defined. SSM are considered

effective if average of all Erri is a positive quantity greater than ǫ = 0.05. SSM cannot be

considered ineffective if Erravg does not fall below −ǫ.

6.3 Results

6.3.1 Application of the UnC Framework

The suggested framework has been applied to 140 software product metrics. This section shows

unification and categorization process carried out for two conflicting product metrics. The same

process is used for unification and categorization of the rest of the product metrics and a database

of unified metrics has been developed.

142

Halstead error estimate B (Halstead, 1977) and Bandwidth of the program BW (Belady, 1980)

are two conflicting product metrics. The suggested framework is applied to B as shown in Figure

6.5a. Multiple labels can be noticed for this metric indicating a Type I inconsistency. Hence Rule 1,

as shown in Figure 6.5b, is applied here. In order to apply Rule 1, the value of δ is set equal to 2, and

the frequency fl of each label is calculated and compared. It has been noticed that no clear winner

is identified using expression 6.1. Therefore the earliest used label B is preserved. Preserving

the label B for the Halstead error estimate generates a Type II inconsistency with another product

metric i.e. bandwidth of the program. Both the metrics are labeled as B. Therefore Rule 2 is

now applied to remove Type II inconsistency related to B. Figure 6.5c shows the application

of Rule 2, where it can be noticed that usage of B for the error estimate was done in 1979 by

Ottenstein (Ottenstein, 1979) prior to its use as bandwidth of the program in 1985 by Jensen et al

(Jensen and Vairavan, 1985). Hence preserving the label B for the Halstead error estimate. As a

consequence the label B is removed from the candidate alternate labels of the bandwidth of the

program. Unification process for Halstead error estimate B is now complete.

The next step after the unification is to categorize the metric B. A metric is categorized with

respect to three dimensions: software development paradigm, software lifecycle phase and fre-

quency of use. In the context of development paradigm, this metric has been used in conventional

paradigm (Ottenstein, 1979) as well as in OO paradigm (Jiang et al., 2008c), therefore is catego-

rized as a metric for both OO and conventional paradigms. As far as the software lifecycle phase

is concerned, B is not available in design phase as it is based on code metrics. Therefore, B is

categorized as Imp+ metric as per its usage. Further, B is seen to be a frequently used metric be-

cause the frequency f of B is greater than α which is set equal to 2. It is, therefore, categorized as

frequently used product metric. The unification and categorization process for the Halstead error

estimate ends when the metric is saved in the UMD with B as its preserved label.

The bandwidth of the program BW now is left with other candidate labels such as Band

and NDAV . But BW is the most frequently used label for this metric and hence BW is now

Type I and Type II consistent. The categorization process for this metric groups it as conventional

paradigm, Imp+ and frequently used metric.

143

Tab. 6.4: Results of numeric classification with and without SSM (Halstead, 1977).

Model Input Metrics CCI ICI Acc

Bay All 109 36 0.751

Without SSM 108 37 0.744

Log All 99 46 0.683


DTb All 102 43 0.703


IB All 102 43 0.703


Each metric passes through this UnC process when it is added in the UMD using the interface

discussed in (Rana et al., 2011).

6.3.2 Ineffective use of SSM

Table 6.4 shows results of binary classification of software modules. Use of SSM alongwith other

available metrics to classify defect prone modules does not help in case of all the models except

Bayesian classifier. Rather dropping SSM as predictors, improves Correctly Classified Instances

(CCI) and model accuracy for the dataset under study. Alternatively, Incorrectly Classified In-

stances (ICI) have decreased for all these models on dropping SSM from the input set, which is

a better performance as compared to the case when classification was done using all metrics in-

cluding SSM. When all SSM were removed from the input of Bayesian classifier, which has the

highest accuracy among all four models, number of ICI increased by 1 and accuracy of the model

decreased by a factor of 0.7%. Intance based learning with 1 nearest neighbor (IB) has shown the

highest gain in accuracy, which is by the factor of 4%, when SSM were not a part of input to the

classifier.

Results of numeric classification of modules are presented in Table 6.5 where MAE of all

144

Tab. 6.5: Results of numeric classification with and without SSM (Halstead, 1977).

Model Input MAE RMSE RAE RRSE

AR All 4.58 10.32 75.89% 94.59%

Without SSM 4.37 9.78 72.44% 89.61%

DTr All 4.92 11.23 81.51% 102.84%

Without SSM 4.89 10.77 81.05% 98.71%

LR All 6.59 11.17 109.08% 102.35%

Without SSM 5.60 9.42 92.77% 86.25%

SVR All 4.39 7.42 72.64% 67.96%

Without SSM 4.66 8.96 77.16% 82.13%

the models decreased in the absence of SSM in input to the classifiers except for the case of

support vector regression. SVR had the lowest MAE among all the NC models in presence of

SSM and increase in MAE SVR is an interesting observation. Linear regression (LR) has observed

a significant decrease of 0.99 in MAE in absence of SSM from the set of inputs metrics. Other

three performance measures for NC, showed the same pattern as does MAE, i.e. for all the models

except SVR, values of RMSE, RAE and RRSE decreased in absence of SSM.


This section presents an analysis on the application of the Unification and Categorization (UnC)

framework on 140 Software Product Metrics (SPdMs). Analysis on performance of prediction

models in absence of SSM is also presented here.

6.4.1 UnC Framework Resolves Nomenclature Issues

In order to present the results, we first report the categorization of SPdMs in two classes: frequently

used (FU) and occasionally used (OU). The percentages of FU and OU software product metrics

145

SDLC

SDP

FrequencyOccasionally

Used

(65.72%)

Frequently

Used

(34.28%)

Conventional

(31.43%)

Object

Oriented

(30.71%)

Design

(52.86%)

Implementation

(35%)

Testing

(4.28%)

Deployment

(0.71%)

Maintenance

(7.14%)

Overlap

(37.86%)

44.44%

5.55%

25%

75%

50%

30%

5%

10%

20%

35%

60.60%

39.39%

5.55%

11.11%

13.88%

69.44%

14.28%

85.71%

Type II

(5.71%)

Type I

(42.85%)

13.04% 100%

16.67 %

Inconsistencies

Occasionally

Used

Frequently

Used

39.13% 16.67%

21.74% 68.75%

39.13% 14.58%

Fig. 6.6: SPdM Categorization

are 34.29% and 65.71% respectively. Frequency of usage of a metric does not necessarily indicate

its importance. It only shows that a certain metric has been used for how many times.

The results of application of the UnC framework are presented under the Frequently Used and

Occasionally Used categorization. The combined results for Frequently and Occasionally Used

metrics are reported as overall metrics. The unification process revealed that 100% of the fre-

quently used metrics have been Type I inconsistent whereas 16.67% have been Type II inconsistent

as shown in Figure 6.6. In occasionally used metrics 13.04% have been Type I inconsistent. There

was no Type II inconsistency among the occasionally used metrics. Overall, 42.85% of the product

metrics under study have been Type I inconsistent and 5.71% have been Type II inconsistent. To

remove these inconsistencies 75% of the frequently used metrics have been assigned their earli-

est used labels whereas 25% of the metrics preserve their frequently used label. In occasionally

used category, the earliest used label has been preserved for 100% of the product metrics. Overall,

146

Tab. 6.6: Percentage of Preserved Labels.

Earliest Used Label Frequently Used Label

FU Metrics 75% 25%

OU Metrics 100% 0%

Overall 91.42% 8.57%

Tab. 6.7: Distribution of Software Product Metrics in Software Development Paradigm with Overlap.

OO with Overlap Conv. with Overlap

FU Metrics 83.33% 85.41%

OU Metrics 60.87% 60.87%

Overall 68.57% 69.28%

91.43% product metrics have been assigned earliest used label and 8.57% have been given the

frequently used label. Table 6.6 presents these statistics.

During the categorization of the product metrics with respect to development paradigm, some

metrics have been found to be used in both the conventional and OO paradigms. This overlap

has been 68.75% and 21.74% for frequently used and occasionally used metrics respectively as

shown in Figure 6.6. The overlap in the FU metrics has been considerably high. This indicates that

most of the FU metrics are considered important for the quality assessment of both the paradigms.

Figure 6.6 reports the overlap for both the categories of the product metrics to be 37.86%.

In addition to the use of some product metrics in both paradigms, other product metrics have

been exclusively used for conventional (e.g. BW , S) or OO paradigm (DIT , RFC). Distribution

of these metrics in both the paradigms is presented in Figure 6.6. In frequently used metrics,

16.67 % metrics have been exclusively used for conventional paradigm and 14.58% have been

exclusively used for the OO paradigm. Whereas the percentage of exclusively conventional and

OO metrics in occasionally used metrics has been 39.13% each. Figure 6.6 also shows that overall

31.43% and 30.71% metrics have been categorized as exclusively conventional and OO paradigm

metrics respectively.

147

Tab. 6.8: Categorization with respect to Software Lifecycle Phase.

Design Implementation Test Deployment Maintenance

FU Metrics 54.17% 45.83% 0% 0% 0%

OU Metrics 52.17% 29.35% 6.52% 1.08% 10.86%

Overall percentage of the product metrics in both the paradigms can be calculated using the

above statistics. In FU metrics category, 85.42% have been used in conventional paradigm and

83.33% have been used in OO paradigm as presented in Table 6.7. In occasionally used category

60.87% have been used in the conventional paradigm whereas 60.87% product metrics have been

used in OO paradigm. Overall, 69.28% metrics have been used in conventional and 68.57% in OO

paradigm.

There are two main reasons for a relatively larger percentage of conventional paradigm metrics

in studies related to software quality prediction (SQP):

• the area of SQP has been of prime importance since the embryonic days of software engi-

neering

• data for the software developed conventionally is publicly available (Menzies et al., 2012,

Facility).

A significant percentage of OO metrics in SQP related studies indicates that the importance of

this area has not reduced in contemporary days where software are usually developed using OO

paradigm. However, there is very limited data (Menzies et al., 2012, Facility) available for OO

software. The percentage of OO metrics in SQP studies has significantly improved after the avail-

ability of public data (Menzies et al., 2012, Facility) and is expected to grow with the increase in

the amount of public OO data.

Majority of the product metrics used for SQP are design and implementation phase metrics as

seen in Table 6.8 and Figure 6.6. The percentage of design metrics in frequently used metrics is

54.17% whereas 45.83% of the frequently used metrics are available only after the code has been

written. In occasionally used product metrics, 52.17% can be derived after the completion of the

148

design and 29.34% of the metrics need the code of the program. The percentage of metrics from

other phases is very low. 6.52% of the occasionally used metrics are testing metrics, 1.08% are

deployment metrics and 10.86% are maintenance metrics. Overall 52.85% of the product metrics

are design metrics and 35% are implementation phase metrics. Percentages of use of testing and

deployment phase metrics are 4.28% and 0.71% respectively. The percentage of maintenance

phase metrics is 7.14%.

The high density of design and code metrics in the literature as predictors of software quality

highlights that early quality prediction is crucial and the studies try to predict quality as early

as design phase. Efforts to predict software quality using requirements metrics have been made

without promising results (Jiang et al., 2007). A significant percentage of design and code metrics

and a very low percentage of metrics from later phases of software lifecycle indicate that the

prediction information is desired no later than the start of testing phase. Among the later phase

metrics, a relatively strong percentage of maintenance metrics has been noticed. Maintenance

metrics are usually helpful in predicting quality of later iterations of the same software.

The software product metrics which are available in design phase can be used for early pre-

diction of quality even when historic data is not available. For example Yang et al. (Yang et al.,

2007) have used DIT (Chidamber and Kemerer, 1994) early in the lifecycle to predict reliability

and efficiency of software. Majority of the implementation phase metrics, including the software

science metrics (Halstead, 1977) have usually been used for both paradigms.

The unification and categorization (UnC) of the product metrics will help the development of

SQP models based on product metrics. This can help in identifying and deciding which metrics to

prefer. The UnC can further be helpful in future studies on software metrics. The future studies

can view a certain group of the product metrics (for example design metrics only, or OO metrics

only) which the studies are interested in.

6.4.2 Use of SSM Deteriorates Performance

As mentioned earlier accuracies of majority of the BC models have improved in absence of SSM

and all the performance measures of NC models have improved for majority of NC models as well.

149

This section discusses the extent of improvement in performance measures of BC and NC models.

First the effectiveness of SSM reported by each model is discussed on the basis of Effi and Erri,

and then overall effectiveness of SSM on the basis of Effavg and Erravg, which are combined

results of BC and NC models, is presented.

Effectiveness of SSM reported by each model and the average values of effectiveness measures

are shown in Table 6.9. First two columns of the table show that no model has reported significant

decrease in its accuracy on dropping SSM, i.e. no Effi is greater than α. Unlike other three

BC models, Effi of Baysian classifier is a positive number but since this does not exceed α, we

cannot take it as an indication of effectiveness of SSM. Effavg is less than λ as well hence we

cannot call that SSM have been effective in classifying software modules as defect prone or not

defect prone for the dataset under study. Effavg is a negative term smaller than −λ and prompts

us to believe that SSM have not only been ineffective for this dataset, but they negatively affect

the classification. Moreover, the decrease in number of Incorrectly Classified Instances (ICI) and

increase in number of Correctly Classified Instances (CCI) on dropping SSM further indicates that

SSM have a negative effect on classification of modules in kc1. In case of NC models Erri reported

by SVR is greater than δ, which means that SVR has reported the effectiveness of SSM for this

dataset. SVR is different from the rest of the NC models used in this study. All the used models

minimize the empirical classification error, SVM at the same time also maximize the geometric

margin between the classes. Dropping SSM have reduced the empirical error for all the models but

it has been helpful for SVR in maximizing the margin between the classes. Values reported by rest

of the NC models are less than δ. Erravg is a negative value below −ǫ indicating that using SSM

to predict number of defects in this data of OO software is not a wise decision.

The dataset used to study the behavior of classification models in absence of SSM comprises of

145 instances. Though the number of instances are enough to conduct an initial investigation, yet

the results presented here cannot be generalized. More software instances are needed to establish

that SSM are ineffective defect predictors in case of OO software.

150

Tab. 6.9: Effectiveness of SSM reported by all models

BC Model Effi NC Model Erri

Bay 0.007 AR -0.21

DTb -0.14 DTr -0.03

IB -0.41 LR -0.99

Log -0.34 SVR 0.27

Effavg -0.221 Erravg -0.24

6.5 Summary

One of the limitations to develop a generic software quality prediction approach is inconsistency

found in naming software product metrics. In some cases same metric has been given different

labels whereas in other cases same label has been used for different metrics. In this thesis we

have identified these two anomalies as Type I and Type II inconsistencies respectively. In or-

der to remove these inconsistencies, a Unification and Categorization (UnC) framework has been

suggested. A set of criteria and rules has been devised for unification and categorization. For uni-

fication, frequency of use and usage history have been used as criteria. For categorization, three

dimensions have been considered namely frequency of use, software development lifecycle phase

and software development paradigm. The proposed framework has been applied to 140 base and

derived metrics to develop a searchable Unified Metrics Database (UMD). Out of these metrics,

42.85% metrics were Type I inconsistent and 5.71% were Type II inconsistent. Percentage of

frequently used and occasionally used metrics has been recorded as 34.28% and 65.72% respec-

tively. The metrics pertaining to object oriented paradigm only and conventional paradigm only

were found to be 30.71% and 31.43% respectively. Most of the metrics have been design and

implementation phase metrics with a combined percentage of 87.86%.

Furthermore the study of the role of software science metrics (SSM) in defect prediction of

object oriented (OO) software reveals that SSM are not effective predictors of defects. The models

151

used here are first applied using all the metrics available in the dataset and then removing SSM

from the input and the accuracies and error values of all the models are observed. Effectiveness

of SSM is measured at model level by comparing accuracies and mean absolute error of models

with and without SSM. Overall effectiveness of SSM is measured by taking averages of reported

error values of all models. Out of the four models used for binary classification, no model has

reported SSM as effective metrics to classify OO software modules as defect prone. In case of NC

models support vector regression has reported the effectiveness of SSM in predicting number of

defects, whereas other three models have reported negative role of SSM in predicting number of

defects. Averages of reported errors of all the models show that use of SSM for classification of

OO software modules and predicting number of defects does not help in this case, and prediction

error can even be improved if SSM are dropped from the input.

152

7. CONCLUSIONS AND FUTURE DIRECTIONS

This thesis applies intelligent computing techniques like association mining and fuzzy inference

system to improve prediction of defect-prone modules. The thesis also proposes a framework to

resolve issues in nomenclature of software metrics. This thesis also reports the ineffectiveness of

using Software Science Metrics (SSM) to build defect prediction models.

This thesis starts with handling a known limitation of defect prediction models that they do

not achieve Recall as high as desired because the available software defect data is dominated by

instances of Not-Defect-prone (ND).This thesis proposes preprocessesing in data before building

prediction model by dividing the input variables in equi-frequency bins, and finding association

between the bins and the D modules. The bins highly associated with D modules are given more

importance and their association with ND modules is removed from the data. This preprocessing

results in better prediction of D modules, however at the cost of True Negative Rate (TNRate).

This approach has been tested using Naive Bayes classifier. The thesis analyses the performance

gain and performance shortfall for Recall and TNRate respectively. Upto 40% improvement in

recall has been observed when the technique is applied on 5 public datasets with upto 20 bins for

each variable. On the other hand, the maximum performance shortfall for any of the datasets has

been 36%. Lower TNRate implies higher False Positive Rate (FPRate). The thesis argues on

the basis of industry feedback that performance gain in case of Recall is more important because

shipment of a defective module is more critical as compared to extra testing of a defect free module.

Another limitation in the software defect prediction domain is that for accurate predictions, one

has to wait for code metrics that are collected very in the software lifecycle. This thesis suggests the

use of imprecise values for the code and design metrics in early phases of lifecycle. The resultant

fuzzy input based model gives comparable performance in terms of Recall when compared with

models developed using precise values of code metrics. The thesis suggests that these results based

on linguistic values should be used for early prediction and better models with precise inputs can

be used later in lifecycle to improve the prediction. The gap in performance of models based on

precise and imprecise inputs has been low.

Towards the end, this thesis identifies anomalies in naming of software metrics. The two types

of inconsistencies are of Type I (same metric has been given different labels) and Type II (same

label has been used for different metrics). The thesis also removes these inconsistencies in 140

software metrics, through the proposed Unification and Categorization (UnC) framework. This

framework is a set of criteria and rules that employs frequency of use and usage history as a

criterian for unification. The framework also uses three dimensions namely frequency of use,

software development lifecycle phase and software development paradigm for categorization. Out

of the 140 metrics, 42.85% metrics were Type I inconsistent and 5.71% were Type II inconsistent.

Percentage of frequently used and occasionally used metrics has been recorded as 34.28% and

65.72% respectively. The metrics pertaining to object oriented paradigm only and conventional

paradigm only were found to be 30.71% and 31.43% respectively. Most of the metrics have been

design and implementation phase metrics with a combined percentage of 87.86%. A list and

categorization of all these metrics is given in appendix B.

This thesis also reports that SSM are not effective predictors of defects. The thesis develops

models first by using all the metrics available in the software defect datasets and then by removing

SSM from the input. The accuracies and error values of all the models are observed. Effectiveness

of SSM is measured at model level by comparing accuracies and mean absolute error of models

with and without SSM. Overall effectiveness of SSM is measured by taking averages of reported

error values of all models. Out of the four models used for binary classification, no model has

reported SSM as effective measures to classify OO software modules as defect prone. In case of

NC models support vector regression has reported the effectiveness of SSM in predicting number

of defects, whereas other three models have reported negative role of SSM in predicting number

of defects. Averages of reported errors of all the models show that use of SSM for classification of

OO software modules and predicting number of defects does not help in this case, and prediction

error can even be improved if SSM are dropped from the input.

154

While improving the defect prediction and handling the limitations mentioned in the thesis,

a tool has been developed to support the experimentation activity. This Matlab based tool has

supported the rapid collection of results from the experiments. Architecture and working of the

tool are given in Appendix A.

In presence of numerous publicly available software defect datasets, the defect prediction mod-

els are evaluated using the public data. There are certain directions for future research related to

present study:

• The association mining based preprocessing approach presented in Chapter 4 predicts defect

prone modules. The proposed approach may be applied to find types of defects also. In that

case the software characteristics highly associated with major defects can be further studied

and used to design specific test cases. For example if in a software module higher values of

Weighted Methods per Class (WMC) are associated with a major defect in software, then

special test cases may be designed to test control structures of this software module. These

test cases will help discover the error before release and shipment of major defects will be

avoided.

• Agile development have been reported to be effective in smaller as well as medium scale

software (approximately 1,000 function points). This development methodology has also

been used for large software (in the range of 10,000 function points) in the last decade. Suc-

cess of agile development methodologies for large applications has been scarcely reported.

However, a shift towards the development methodologies has been visible in last few years

which indicates its success. The shift towards the agile methodologies being recent, there is

limited literature or data available related to maintenance costs, defect-prone modules, qual-

ity of application, quality of maintenance task etc. The work presented in this thesis may

be extended to work in case of agile development environment where quality related infor-

mation is not thoroughly collected. For example, some metrics may be categorized to see

their effectiveness in agile methodology and then their values may be estimated in linguistic

terms using findings of cross project studies. These linguistic labels can then be sent to the

proposed fuzzy inference system to get the information regarding defect prone modules.

155

• In order to have good utility of the proposed approaches these approaches should be applied

in software industry and the approaches should be evaluated based on current software data.

For that matter, the software industry data may be compared with available public data and

a similarity in the two data is calculated. This way cross project prediction of defects will

be useful. However, finding similarity in software projects is a non-trivial task. So before

adoption of the proposed approaches data similarity problem may be resolved.

• Performance of defect prediction models is evaluated using ROC curve. As discussed earlier

the public data is imbalanced and ROC curved are considered to present lesser information

as compared to Precision-Recall curves in case of skewed data. An analysis of performance

of existing models using PR curves can be performed to: a) show that information presented

using PR curves is more useful than the information presented using ROC curves. b) com-

pare performance of the existing models using PR-Curves to see if results are different from

the results reported in literature.

156

BIBLIOGRAPHY

Seiya Abe, Osamu Mizuno, Tohru Kikuno, Nahomi Kikuchi, and Masayuki Hirayama. Estimation

of project success using bayesian classifier. In Proceedings of The 28th International Conference

on Software Engineering, ICSE’06, 2006.

H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic

Control, 19(6):716–723, December 1974.

Saleh Alshomrani, Abdullah Bawakid, Seong-O Shim, Alberto Fernndez, and Francisco Herrera.

A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping

in imbalanced datasets. Knowledge-Based Systems, 73(0):1 – 17, 2015.

S. Amasaki, Y. Takagi, O. Mizuno, and T. Kikuno. A Bayesian Belief Network for assessing

the likelihood of fault content. In Software Reliability Engineering, 2003. ISSRE 2003. 14th

International Symposium on, pages 215– 226, Nov. 2003.

Erik Arisholm, Lionel C. Briand, and Eivind B. Johannessen. A systematic and comprehensive

investigation of methods to build and evaluate fault prediction models. Journal of Systems and

Software, 83(1):2 – 17, 2010.

Albert L. Baker, James M. Beiman, Norman Fenton, David A. Gustafson, Austin Meton, and

Robin Whitty. A philosophy for software measurement. Journal of Systems and Software, 12

(Issue 3):277–281, 1990.

Ma Baojun, Karel Dejaeger, Jan Vanthienen, and Bart Baesens. Software defect prediction based

on association rule classification. Technical report, Katholieke Universiteit Leuven, Feb 2011.

Victor R. Basili and Barry T. Perricone. Software errors and complexity: an empirical investiga-

tion. Communications of the ACM, 27:42–52, January 1984.

Steffen Becker, Lars Grunske, Raffaela Mirandola, and Sven Overhage. Performance prediction of

component-based systems - a survey from an engineering perspective. In Architecting Systems

with Trustworthy Components, volume 3938 of Lecture Notes in Computer Science. Springer,

2006.

Laszio A. Belady. Software geometry. In Proceedings of The 1980 Internaional Computer Sym-

posium, 1980.

M. Bell, Robert, J. Ostrand, Thomas, and J. Weyuker, Elaine. The limited impact of individual

developer data on software defect prediction. Empirical Software Engineering, pages 2 –13,

September 2011. ISSN 1573-7616.

J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum Press, New

York, USA, 1981.

P.S. Bishnu and V. Bhattacherjee. Software fault prediction using quad tree-based k-means clus-

tering algorithm. Knowledge and Data Engineering, IEEE Transactions on, 24(6):1146 –1150,

june 2012. ISSN 1041-4347.

Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and

Statistics). Springer, August 2006. ISBN 0387310738.

Salah Bouktif, Danielle Azar, Doina Precup, Houari Sahraoui, and Balazs Kegl. Improving rule

set based software quality prediction: A genetic algorithm-based approach. Journal of Object

Technology, 3(4):227–241, April 2004.

Salah Bouktif, Houari Sahraoui, and Giuliano Antoniol. Simulated annealing for improving soft-

ware quality prediction. In Proceedings of The GECCO’06. ACM, 8-12 July 2006.

Leo Breiman. Random forests. Machine Learning, 45(1):5–32, October 2001.

158

Lionel C. Briand, Victor R. Basili, and William M. Thomas. A pattern recognition approach for

software engineering data analysis. IEEE Transactions on Software Engineering, Vol. 18(No.

11):931–942, November 1992.

Lionel C. Briand, Victor R. Basili, and Christopher J. Hetmanski. Developing interpretable models

with optimized set reduction for identifying high-risk software components. IEEE Transactions

on Software Engineering, Vol. 19(No. 11):1028–1044, November 1993.

Sarah Brocklehurst and Bev Littlewood. New ways to get accurate reliability measures. IEEE

Software, Vol. 9(Issue 4):34–42, July 1992.

Yuriy Brun and Michael D. Ernst. Finding latent code errors via machine learning over program

executions. In Proceedings of The 26th International Conference on Software Engineering,

ICSE’04, 2004.

David N. Card. Understanding causal systems. CrossTalk: The Journal of Defense Software

Engineering, 2004, pages 15–18, October 2004.

David N. Card. Myths and strategies of defect causal analysis. In Proceedings: The 24th Pa-

cific Northwest Software Quality Conference, Portland, Oregon, pages 469–474, October 10-11

2006.

Cagatay Catal and Diri Banu. A systematic review of software fault prediction studies. Expert

Systems with Applications, 36(4):7346–7354, 2009.

Venkata U. B. Challagulla, Farokh B. Bastani, and Raymond A. Paul. Empirical assessment of

machine learning based sofwtare defect prediction techniques. In Proceedings of 10th Workshop

on Object-Oriented Real-Time Dependable Systems (WORDS’05), pages 263–270, Washington,

DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2347-1.

Ching-Pao Chang and Chih-Ping Chu. Improvement of causal analysis using multivariate statistical

process control. Software Quality Control, 16:377–409, September 2008. ISSN 0963-9314.

159

Ching-Pao Chang, Chih-Ping Chu, and Yu-Fang Yeh. Integrating in-process software defect pre-

diction with association mining to discover defect pattern. Information and Software Technology,

51:375–384, February 2009. ISSN 0950-5849.

Shyam R. Chidamber and C. F. Kemerer. A metrics suite for object oriented designs. IEEE

Transactions on Software Engineering, 20(No. 6):476–493, June 1994.

L. J. Chmura, A. F. Norcio, and T. J. Wicinski. Evaluating software design processes by analyzing

change data over time. IEEE Transactions on Software Engineering, 16(7):729–740, July 1990.

K.J. Cios, W. Pedrycz, R.W. Swiniarski, and L.A. Kurgan. Data Mining: A Knowledge Discovery

Approach. Springer, 2007. ISBN 9780387367958.

William W. Cohen. Learning trees and rules with set-valued features. In Proceedings of the

thirteenth national conference on Artificial intelligence (AAAI’96), pages 709–716, 1996.

Philip Crosby. Quality is Free. New York: McGraw-Hill, 1979. ISBN 0-07-014512-1.

James B. Dabney, Gary Barber, and Don Ohi. Predicting software defect function point ratios

using a bayesian belief network. In Proceedings of the PROMISE workshop, 2006.

Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In In

ICML 06: Proceedings of the 23rd international conference on Machine learning, pages 233–

240. ACM Press, 2006.

Janez Demsar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.,

7:1–30, December 2006. ISSN 1532-4435.

Scott Dick and Abraham Kandel. Fuzzy clustering of software metrics. In Proceedings of The

12th IEEE International Conference on Fuzzy Systems, May 2003.

J. P. Egan. Signal detection theory and roc analysis. Series in Cognition and Perception, 1975.

Albert Endres. An analysis of errors and their causes in system programs. In Proceedings of The

International Conference on Reliable Software, pages 327–336. ACM Press, 1975.

160

NASA/WVU IV & V Facility. Metrics data program (mdp). Internet, http://mdp.ivv.nasa.gov/.

Norman Fenton, Paul Krause, Martin Neil, and Crossoak Lane. Software measurement: Uncer-

tainty and causal modelling. IEEE Software Magazine, 19(14):116 – 122, July/Aug. 2002.

Norman Fenton, Martin Neil, and David Marquez. Using bayesian networks to predict software

defects and reliability. 2007a.

Norman Fenton, Martin Neil, William Marsh, Peter Hearty, David Marquez, Paul Krause, and

Rajat Mishra. Predicting Software Defects in Varying Development Lifecycles using Bayesian

Nets. Information and Software Technology, 49:32–43, January 2007b. ISSN 0950-5849.

Norman Fenton, Martin Neil, William Marsh, Peter Hearty, Lukasz Radlinski, and Paul Krause.

Project data incorporating qualitative factors for improved software defect prediction. In

Proceedings of 3rd International Workshop on Predictor Models in Software Engineering

(PROMISE ’07), pages 69–, Washington, DC, USA, 2007c. IEEE Computer Society. ISBN

0-7695-2830-9.

Norman Fenton, Martin Neil, William Marsh, Peter Hearty, Lukasz Radliski, and Paul Krause.

On the effectiveness of early life cycle defect prediction with Bayesian Nets. Empirical Softw.

Engg., 13:499–537, October 2008. ISSN 1382-3256.

Norman E. Fenton and Martin Neil. A critique of software defect prediction models. IEEE Trans-

actions on Software Engineering, Vol. 25(No. 5):675–687, September/October 1999.

Norman E. Fenton and Niclas Ohlsson. Quantitative analysis of faults and failures in a complex

software system. IEEE Transactions on Software Engineering, Vol. 26(No. 8), August 2000.

Norman E. Fenton and Shari Lawrence Pfleeger. Software Metrics: A Rigorous and Practical

Approach. PWS Publishing Co., 2nd edition, 1998.

International Organization for Standardization. Iso 9000:2005 quality management systems – fun-

damentals and vocabulary. Standard, 2005.

161

K. Ganesan, Taghi M. Khosgoftaar, and Edward B. Allen. Case-based software quality prediction.

International Journal of Software Engineering and Knowledge Engineering, Vol. 10(No. 2):

139–152, February 2000.

Vicente Garca, Ramn A. Mollineda, and J. Salvador Snchez. A bias correction function for classi-

fication performance assessment in two-class imbalanced problems. Knowledge-Based Systems,

59(0):66 – 74, 2014.

Felix Garcıa, Manuel F. Bertoa, Coral Calero, Antonio Vallecillo, Francisco Ruız, Mario Piattini,

and Marcela Genero. Towards a consistent terminology for software measurement. Information

and Software Technology, 48(8):631 – 644, 2006. ISSN 0950-5849.

Felix Garcıa, Francisco Ruız, Coral Calero, Manuel F. Bertoa, Antonio Vallecillo, Beatriz Mora,

and Mario Piattini. Effective use of ontologies in software measurement. The Knowledge Engi-

neering Review, 24(1):23–40, 2009. ISSN 0269-8889.

Swapna S. Gokhale and Michael R. Lyu. Regression tree modeling for the prediction of software

quality. In Proceedings of The 3rd ISSAT Intl. Conference on Reliability, 1997.

Vincenzo Grassi. Architecture-based dependability prediction for service-oriented computing. In

ICSE, Workshop on Architecting Dependable Systems, WADS’04, Lecture Notes in Computer

Science. Springer, June 2004.

Vincenzo Grassi and Simone Patella. Reliability prediction for service-oriented computing envi-

ronments. IEEE Internet Computing, pages 43–49, May-June 2006.

Andrew R. Gray and Stephen G. MacDonell. A comparison of techniques for developing predictive

models of software metrics. Information and Software Technology, 39(1997):425 – 437, 1997.

D. Gray, D. Bowes, N. Davey, Yi Sun, and B. Christianson. The misuse of the nasa metrics

data program data sets for automated software defect prediction. In Evaluation Assessment in

Software Engineering (EASE 2011), 15th Annual Conference on, pages 96–103, April 2011.

162

David Grosser, Houari A. Sahraoui, and Petko Valtchev. Analogy-based software quality predic-

tion. In 7th Workshop On Quantitative Approaches In Object-Oriented Software Engineering,

QAOOSE’03, June 2003.

Ping Guo and Michael R. Lyu. Software quality prediction using mixture models with em algo-

rithm. In Proceedings of The First Asia-Pacific Conference on Quality Software, 2000.

Tibor Gyimothy, Rudolf Ferenc, and Istvan Siket. Empirical validation of object-oriented metrics

on open source software for fault prediction. IEEE Transactions on Software Engineering, 31

(No. 10):897–910, October 2005.

Mark Hall and Eibe Frank. Combining naive bayes and decision tables. In Proceedings of the 21st

Florida Artificial Intelligence Society Conference (FLAIRS), pages 2–3. AAAI press, 2008.

Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. A systematic re-

view of fault prediction performance in software engineering. IEEE Transactions on Software

Engineering, 99(PrePrints), 2011. ISSN 0098-5589.

Maurice H. Halstead. Elements of software science. 1977.

Simon Haykin. Neural Networks: A Comprehensive Foundation. Macmillan, New York, 1994.

Peng He, Bing Li, Xiao Liu, Jun Chen, and Yutao Ma. An empirical study on software defect

prediction with a simplified metric set. Information and Software Technology, 59:170–190,

March 2015.

Sallie Henry, Dennis Kafura, and Kathy Harris. On the relationships among three software met-

rics. In Proceedings of the 1981 ACM Workshop/Symposium on Measurement and Evaluation

of Software Quality, pages 81–88. ACM, 1981.

Jens Christian Huehn and Eyke Huellermeier. Furia: An algorithm for unordered fuzzy rule induc-

tion. Data Mining and Knowledge Discovery, 2009.

IEEE. Ieee standard for a software quality metrics methodology. International Standard 1061-

1998, 1998.

163

IEEE. Ieee standard adoption of iso/iec 15939:2007 - systems and software engineering - mea-

surement process. International Standard 15939:2008, 2008.

British Standards Institution. Iso 9001:2008 quality management systems – requirements. Stan-

dard, 2008.

ISO/IEC. Software engineering – product quality – part 1: Quality model. International Standard

9126-1, June 15 2001.

Howard A. Jensen and K. Vairavan. An experimental study of software metrics for real-time

software. IEEE Transactions on Software Engineering, Vol. SE-11(No. 2):231–234, February

1985.

Yue Jiang, Bojan Cukic, and Tim Menzies. Fault prediction using early lifecycle data. In Pro-

ceedings of The 18th IEEE International Symposium on Software Reliability (ISSRE) 07, pages

237–246. IEEE Computer Society, 2007.

Yue Jiang, Bojan Cukic, and Yan Ma. Techniques for evaluating fault prediction models. Empirical

Softw. Engg., 13:561–595, October 2008a. ISSN 1382-3256.

Yue Jiang, Bojan Cukic, and Tim Menzies. Cost curve evaluation of fault prediction models. In

Proceedings of the 2008 19th International Symposium on Software Reliability Engineering,

ISSRE ’08, pages 197–206, Washington, DC, USA, 2008b. IEEE Computer Society. ISBN

978-0-7695-3405-3.

Yue Jiang, Bojan Cukic, Tim Menzies, and Nick Bartlow. Comparing design and code metrics for

software quality prediction. In Proceedings of PROMISE’08, pages 11–18. ACM, May 2008c.

Han Jiawei and Kamber Micheline. Data Mining - Concepts and Techniques. Morgan Kaufmann,

2002.

Capers Jones. Applied Software Measurement: Global Analysis of Productivity and Quality. Tata

McGraw-Hill, 3 edition, 2008.

164

Yasutaka Kamei, Akito Monden, Shuji Morisaki, and Ken-ichi Matsumoto. A hybrid faulty module

prediction using association rule mining and logistic regression analysis. In Proceedings of the

Second ACM-IEEE international symposium on Empirical software engineering and measure-

ment, ESEM ’08, pages 279–281, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-971-5.

Stephen H. Kan. Metrics and Models in Software Quality Engineering. Addison-Wesley Longman

Publishing Co., Inc., 2nd edition, 2002.

R. Karthik and N. Manikandan. Defect association and complexity prediction by mining asso-

ciation and clustering rules. In Computer Engineering and Technology (ICCET), 2010 2nd

International Conference on, volume 7, pages V7–569–V7–573, 2010.

Arashdeep Kaur, Parvinder S. Sandhu, and Amanpreet Singh Bra. Early software fault prediction

using real time defect data. In Proceedings of the 2009 Second International Conference on

Machine Vision, ICMV ’09, pages 242–245, Washington, DC, USA, 2009. IEEE Computer

Society. ISBN 978-0-7695-3944-7.

Malik Jahan Khan, Shafay Shamail, Mian Muhammad Awais, and Tauqeer Hussain. Comparative

study of various artificial intelligence techniques to predict software quality. In Proceedings of

The 10th IEEE Multitopic Conference, 2006, INMIC ’06, pages 173–177, December 2006.

Taghi M. Khosgoftaar and John. C. Munson. Predicting software development errors using soft-

ware complexity metrics. IEEE Journal On Selected Areas In Communications, Vol. 8(No. 2):

253–261, February 1990.

Taghi M. Khosgoftaar, David L. Lanning, and Abhijit S . Pandya. A comparative study of pattern

recognition techniques for quality evaluation of telecommunications software. IEEE Journal On

Selected Areas In Communications, Vol. 12(No. 2):279–291, February 1994.

Taghi M. Khoshgoftaar and Edward B. Allen. Logistic regression modeling of software quality.

International Journal of Reliability, Quality and Safety Engineering, Vol. 6(Issue 4):303–317,

December 1999a.

165

Taghi M. Khoshgoftaar and Edward B. Allen. Predicting fault-prone software modules in embed-

ded systems with classification trees. In Proceedings of The 4th IEEE International Symposium

on High-Assurance Systems Engineering. IEEE, Computer Society, 1999b.

Taghi M. Khoshgoftaar and Edward B. Allen. A comparative study of ordering and classification

of fault-prone software modules. Empirical Software Engineering, 4:159–186, 1999c.

Taghi M. Khoshgoftaar and Naeem Seliya. Tree-based software quality estimation models for fault

prediction. In Proceedings of the Eighth IEEE Symposium on Software Metrics (METRICS’02).

IEEE Computer Society, 2002.

Taghi M. Khoshgoftaar and Naeem Seliya. Fault prediction modeling for software quality esti-

mation: Comparing commonly used techniques. Empirical Software Engineering, 8(No. 3):

255–283, September 2003. ISSN 1382-3256.

Taghi M. Khoshgoftaar and Naeem Seliya. Comparative assessment of software quality classifica-

tion techniques: An empirical case study. Empirical Software Engineering, 9:229–257, 2004.

Taghi M. Khoshgoftaar, John C. Munson, Bibhuti B. Bhattacharya, and Gary D. Richardson. Pre-

dictive modeling techniques of software quality from software measures. IEEE Trans. Softw.

Eng., 18:979–987, November 1992. ISSN 0098-5589.

Taghi M. Khoshgoftaar, Edward B. Allen, Kalai S. Kalaichelvan, and Nishith Goel. Early quality

prediction: A case studv in telecommunications. IEEE Software, 13(1):65–71, 1996.

Taghi M. Khoshgoftaar, Edward B. Allen, Robert Halstead, Gary P. Trio, and Ronald Flass. Pro-

cess measures for predicting software quality. In Proceedings of The High-Assurance Systems

Engineering Workshop, pages 155 –160, aug 1997a.

Taghi M. Khoshgoftaar, K. Ganesan, Edward B. Allen, Fletcher D. Ross, Rama Munikoti, Nishith

Goel, and Amit Nandi. Predicting fault-prone modules with case-based reasoning. In Proceed-

ings of The Eighth International Symposium On Software Reliability Engineering, pages 27–35.

IEEE, 2-5 Nov 1997b.

166

Taghi M. Khoshgoftaar, Naeem Seliya, and Nandini Sundaresh. An empirical study of predicting

software faults with case-based reasoning. Software Quality Journal, 14(2):85–111, June 2006.

ISSN 0963-9314.

Sunghun Kim, Thomas Zimmermann, E.James Whitehead-Jr., and Andreas Zeller. Predicting

faults from cache history. In Proceedings of The 29th International Conference on Software

Engineering, ICSE’07, 2007.

Michael Klas, Haruka Nakao, Frank Elberzhager, and Jurgen Munch. Support planning and con-

trolling of early quality assurance by combining expert judgment and defect data–a case study.

Empirical Softw. Engg., 15(4):423–454, August 2010.

Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. Data duplication: an imbalance

problem? In Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC,

2003.

A. Gunes Koru and Hongfang Liu. An investigation of the effect of module size on defect predic-

tion using static measures. In Proceedings of International Workshop on Predictor Models in

Software Engineering (PROMISE ’05). ACM Press, 2005a.

A. Gunes Koru and Hongfang Liu. Building defect prediction models in practice. IEEE Softw., 22

(6):23–29, nov 2005b. ISSN 0740-7459.

Bart Kosko. Fuzzy engineering. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1997. ISBN

0-13-124991-6.

Miroslav Kubat, Robert C. Holte, and Stan Matwin. Machine Learning, volume 30, chapter Ma-

chine Learning for the Detection of Oil Spills in Satellite Radar Images, pages 195–215. Kluwer

Academic Publishers, Boston, Manufactured in The Netherlands, 1998.

Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. Benchmarking classifi-

cation models for software defect prediction: A proposed framework and novel findings. IEEE

Trans. Softw. Eng., 34(4):485–496, jul 2008.

167

Paul Luo Li, James Herbsleb, Mary Shaw, and Brian Robinson. Experiences and results from

initiating field defect prediction and product test prioritization efforts at abb inc. In Proceedings

of The 28th International Conference on Software Engineering, ICSE’06, 2006.

Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association rule mining.

In In Proc. of the Fourth International Conference on Knowledge Discovery and Data Mining

(KDD-98), pages 80–86, 1998.

Yan Liu, Allen Fekete, and Ian Gorton. Design-level performance prediction of component based

applications. IEEE Transactions on Software Engineering, 31(11):928–941, November 2005.

Atchara Mahaweerawat, Peraphon Sophatsathit, Chidchanok Lursinsap, and Petr Musilek. Fault

prediction in object oriented software using neural network techniques. In Proceedings of the

InTech Conference, pages 27–34, December 2-4 2004.

Mark W. Maier and Eberhardt Rechtin. The Art of Systems Architecting (2Nd Ed.). CRC Press,

Inc., Boca Raton, FL, USA, 2000. ISBN 0-8493-0440-7.

Henry. B. Mann and D. R. Whitney. On a Test of Whether one of Two Random Variables is

Stochastically Larger than the Other. The Annals of Mathematical Statistics, 18(1):50–60, 1947.

Thomas J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(No.

4):308–320, December 1976.

Thilo Mende. Replication of defect prediction studies: problems, pitfalls and recommendations. In

Proceedings of the 6th International Conference on Predictive Models in Software Engineering,

PROMISE ’10, pages 5:1–5:10, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0404-7.

Thilo Mende and Rainer Koschke. Revisiting the evaluation of defect prediction models. In

Proceedings of the 5th International Conference on Predictor Models in Software Engineering,

PROMISE ’09, pages 7:1–7:10, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-634-2.

T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predic-

tors. Software Engineering, IEEE Transactions on, 33(1):2 –13, jan. 2007. ISSN 0098-5589.

168

Tim Menzies, Justin S. Di Stefano, and Mike Chapman. Learning early lifecycle ivv quality indi-

cators. In Proceedings of IEEE Metrics 03. IEEE, 2003.

Tim Menzies, Burak Turhan, Ayse Bener, Gregory Gay, Bojan Cukic, and Yue Jiang. Implications

of ceiling effects in defect predictors. In PROMISE ’08: Proceedings of the PROMISE’08, pages

47–54. ACM, May 2008. ISBN 978-1-60558-036-4.

Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayse Bener. Defect

prediction from static code features: current results, limitations, new approaches. Automated

Software Engg., 17(4):375–407, dec 2010. ISSN 0928-8910.

Tim Menzies, Bora Caglayan, Zhimin He, Ekrem Kocaguneli, Joe Krall, Fayola Peters, and Bu-

rak Turhan. The promise repository of empirical software engineering data, June 2012. URL

http://promisedata.googlecode.com.

C E Metz. Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8(4):283–298, 1978.

Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

Audris Mockus, Ping Zhang, and Paul Luo Li. Predictors of customer perceived software quality.

In Proceedings of The 27th International Conference on Software Engineering, ICSE’05, 15-21

May 2005.

Siba N. Mohanty. Models and measurements for quality assessment of software. ACM Computing

Surveys, 11(3):251–275, September 1979.

John C. Munson and Taghi M. Khosgoftaar. The detection of fault-prone programs. IEEE Trans-

actions on Software Engineering, Vol. 18(No. 5):423–434, May 1992.

John. C. Munson and Y. M. Khoshgoftaar. Predicting software development errors using software

complexity metrics. IEEE Journal on Selected Areas in Communications, 8:253, Feb 1990.

Nachiappan Nagappan and Thomas Ball. Static analysis tools as early indicators of pre-release

defect density. In Proceedings of The 27th International Conference on Software Engineering,

ICSE’05, 2005a.

169

Nachiappan Nagappan and Thomas Ball. Use of relative code churn measures to predict system

defect density. In Proceedings of The 27th International Conference on Software Engineering,

ICSE’05, 2005b.

Nachiappan Nagappan, Thomas Ball, and A. Zeller. Mining metrics to predict component failures.

In Proceedings of The 28th International Conference on Software Engineering, ICSE’06, 2006.

Martin Neil and Norman Fenton. Predicting software quality using bayesian belief networks.

In Proceedings of 21st Annual Software Engineering Workshop NASA/Goddard Space Flight

Centre, pages 217 – 230, 1996.

Niclas Ohlsson and Hans Alberg. Predicting fault-prone software modules in telephone switches.

IEEE Transactions on Software Engineering, Vol. 22(No. 12):34–42, December 1996.

Niclas Ohlsson, Ming Zhaq, and Mary Helander. Application of multivariate analysis for software

fault prediction. Software Quality Journal, 7:51–66, 1998.

Ahmet Okutan and OlcayTaner Yildiz. Software defect prediction using Bayesian Networks. Em-

pirical Software Engineering, 19(1):154–181, 2014.

Hector M. Olague, Letha H. Etzkorn, Sampson Gholston, and Stephen Quattlebaum. Empirical

validation of three software metrics suites to predict fault-proneness of object-oriented classes

developed using highly iterative or agile software development processes. IEEE Transactions

on Software Engineering, 33(No. 6):402–419, October 2007.

Linda M. Ottenstein. Quantitative estimates of debugging requirements. IEEE Transactions on

Software Engineering, Vol. SE-5(No. 5):504–514, September 1979.

Linda M. Ottenstein. Predicting numbers of errors using software science. In Proceedings of The

1981 ACM Workshop/Symposium on Measurement and Evaluation of Software Quality, pages

157–167. ACM, 1981.

170

G.J. Pai and J. Bechta Dugan. Empirical Analysis of Software Fault Content and Fault Proneness

Using Bayesian Methods. Software Engineering, IEEE Transactions on, 33(10):675–686, Oct

2007.

Yi Peng, Gang Kou, Guoxun Wang, Wenshuai Wu, and Yong Shi. Ensemble of software defect

predictors: An AHP-based evaluation method. International Journal of Information Technology

& Decision Making (IJITDM), 10(01):187–206, 2011.

Nicolino J. Pizzi, Arthur R. Summers, and Witold Pedrycz. Software quality prediction using

median-adjusted class labels. In Proceedings of The 2002 International Joint Conference on

Neural Networks, IJCNN’02, 2002.

Roger S. Pressman. Software Engineering, A Practitioners Approach. Pearson, 7th edition, 2010.

J. Priyadarshin. Mining of defect using Apriori and defect correction effort prediction. In Proceed-

ings of 2nd National Conference on Challenges and Opportunities in Information Technology

(COIT-2008) RIMT-IET, Mandi Gobindgarh., pages 38–41, 2008.

Sandeep Purao and Vijay Vaishnavi. Product metrics for object-oriented systems. ACM Computing

Surveys, 35(2):191–221, June 2003.

Tong-Seng Quah and Mie Mie Thet Thwin. Application of neural network for predicting software

development faults using object-oriented design metrics. In Proceedings of The 19th Inter-

national Conference on Software Maintenance, ICSM’03. IEEE Computer Society, September

2003.

Zeeshan A. Rana, Sehrish Abdul Malik, Shafay Shamail, and Mian M. Awais. Identifying associ-

ation between longer itemsets and software defects. In Minho Lee, Akira Hirose, Zeng-Guang

Hou, and RheeMan Kil, editors, Neural Information Processing, volume 8228 of Lecture Notes

in Computer Science, pages 133–140. Springer Berlin Heidelberg, 2013. ISBN 978-3-642-

42050-4.

171

Zeeshan Ali Rana, Shafay Shamail, and Mian Muhammad Awais. Towards a generic model for

software quality prediction. In WoSQ ’08: Proceedings of the 6th International Workshop on

Software Quality, pages 35–40. ACM, May 2008. ISBN 978-1-60558-023-4.

Zeeshan Ali Rana, Mian Muhammad Awais, and Shafay Shamail. An FIS for early detection

of defect prone modules. In De-Shuang Huang, Kang-Hyun Jo, and Hong-Hee Lee, editors,

Proceedings of ICIC’09, Emerging Intelligent Computing Technology and Applications. With

Aspects of Artificial Intelligence, volume 5755/2009 of Lecture Notes in Computer Science,

pages 144–153, Ulsan, South Korea, September 16-19 2009a. Springer Berlin Heidelberg.

Zeeshan Ali Rana, Shafay Shamail, and Mian Muhammad Awais. Ineffectiveness of use of soft-

ware science metrics as predictors of defects in object oriented software. In WCSE ’09: Pro-

ceedings of the 2009 WRI World Congress on Software Engineering, pages 3–7, Washington,

DC, USA, May 19-21 2009b. IEEE Computer Society.

Zeeshan Ali Rana, Mian Muhammad Awais, and Shafay Shamail. Nomenclature unification of

software product measures. IET Software, 5(1):83–102, 2011.

ZeeshanAli Rana, MianM. Awais, and Shafay Shamail. Impact of using information

gain in software defect prediction models. In De-Shuang Huang, Vitoantonio Bevilac-

qua, and Prashan Premaratne, editors, Intelligent Computing Theory, volume 8588 of

Lecture Notes in Computer Science, pages 637–648. Springer International Publish-

ing, 2014. ISBN 978-3-319-09332-1. doi: 10.1007/978-3-319-09333-8 69. URL

http://dx.doi.org/10.1007/978-3-319-09333-8 69.

Ralf H. Reussner, Heinz W. Schmidt, and Iman H. Poernomo. Reliability prediction for

component-based software architectures. Journal of Systems and Software, 66(Issue 3):241–

252, June 2003.

C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2

edition, 1979. ISBN 0408709294.

172

P.S. Sandhu, R. Goel, A.S. Brar, J. Kaur, and S. Anand. A model for early prediction of faults in

software systems. In The 2nd International Conference on Computer and Automation Engineer-

ing (ICCAE), pages 281–285, 2010. ISBN 978-1-4244-5585-0.

Victor Schneider. Some experimental estimators for developmental and delivered errors in software

development projects. ACM SIGMETRICS Performance Evaluation Review, Vol. 10(Issue 1):

169–172, 1981.

Naeem Seliya and Taghi M. Khoshgoftaar. Software quality estimation with limited fault data: A

semi-supervised learning perspective. Software Quality Journal, 15:327–344, August 2007.

Zongyao Sha and Jiangping Chen. Mining association rules from dataset containing predeter-

mined decision itemset and rare transactions. In Natural Computation (ICNC), 2011 Seventh

International Conference on, volume 1, pages 166 –170, july 2011.

Joanne M. Atlee Shari L. PFleeger. Software Engineering, Theory and Practice. Pearson, 4th

edition, 2010.

Vincent Y. Shen, Tze-Jie Yu, Stephen M. Thebaut, and Lorri R. Paulsen. Identifying error-prone

sofwtare - an empirical study. IEEE Transactions on Software Engineering, SE-11:317–323,

April 1985.

Martin J. Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. Data quality: Some comments

on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208–

1215, September 2013.

IEEE Computer Society. Ieee standard glossary of software engineering terminology. IEEE Stan-

dard, 1990.

Qinbao Song, M. Shepperd, M. Cartwright, and C. Mair. Software defect association mining and

defect correction effort prediction. Software Engineering, IEEE Transactions on, 32(2):69–82,

feb. 2006.

173

Qinbao Song, Zihan Jia, M. Shepperd, Shi Ying, and Jin Liu. A general software defect-proneness

prediction framework. Software Engineering, IEEE Transactions on, 37(3):356 –370, may-june

2011. ISSN 0098-5589.

Kotsiantis Sotiris and Kanellopoulos Dimitris. Association rules mining: A recent overview.

GESTS International Transactions on Computer Science and Engineering, 32:71–82, 2006.

Zhongbin Sun, Qinbao Song, and Xiaoyan Zhu. Using coding-based ensemble learning to improve

software defect prediction. Systems, Man, and Cybernetics, Part C: Applications and Reviews,

IEEE Transactions on, 42(6):1806 –1817, nov. 2012. ISSN 1094-6977.

M.P. Thapaliyal and Garima Verma. Software defects and object oriented metrics - an empirical

analysis. International Journal of Computer Applications, 9(5):41–44, November 2010. Pub-

lished By Foundation of Computer Science.

Mie Mie Thet Thwin and Tong-Seng Quah. Application of neural network for predicting software

development faults using object-oriented design metrics. In Proceedings of The 9th International

Conference on Neural Information Processing, ICONIP’02, volume 5, 2002.

Burak Turhan and Ayse Bener. Analysis of naive bayes’ assumptions on software fault data: An

empirical study. Data Knowl. Eng., 68(2):278–290, February 2009.

Vijay K. Vaishnavi, Sandeep Purao, and Jens Liegle. Object-oriented product metrics: A generic

framework. Information Sciences, 177:587–606, 2007.

Stefan Wagner. Global sensitivity analysis of predictor models in software engineering. In

Proceedings of 3rd International Workshop on Predictor Models in Software Engineering

(PROMISE ’07). IEEE Computer Society Press, 2007.

Stefan Wagner and Florian Deissenboeck. An integrated approach to quality modelling. In Pro-

ceedings of 5th Workshop on Software Quality (5-WoSQ). IEEE Computer Society Press, 2007.

174

Huanjing Wang, Taghi M. Khoshgoftaar, and Naeem Seliya. How many software metrics should

be selected for defect prediction? In FLAIRS Conference, pages 69–74. AAAI Publications,

2011.

HUANJING WANG, TAGHI M. KHOSHGOFTAAR, and QIANHUI (ALTHEA) LIANG. A

study of software metric selection techniques: Stability analysis and defect prediction model

performance. International Journal on Artificial Intelligence Tools, 22(05):1360010, 2013.

Ke Wang Wang, Senqiang Zhou, Qiang Yang, and Jack Man Shun Yeung. Mining customer value:

From association rules to direct marketing. Data Mining and Knowledge Discovery, 11(1):

57–79, 2005.

Qi Wang, Bo Yu, and Jie Zhu. Extract rules from software quality prediction model based on neural

network. In Proceedings of The 16th IEEE International Conference on Tools with Artificial

Intelligence, ICTAI’04, 2004.

Qi Wang, Jie Zhu, and Bo Yu. Extract rules from software quality prediction model based on neural

network. In Proceedings of The 11th International Conference on Evaluation and Assessment

in Software Engineering, EASE’07, April 2007.

Wen-Li Wang, Ye Wu, and Mei-Hwa Chen. An architecture-based software reliability model. In

Proceedings of The Pacific Rim International Symposium on Dependable Computing, 1998.

Elaine J. Weyuker, Thomas J. Ostrand, and Robert M. Bell. Comparing negative binomial and

recursive partitioning models for fault prediction. In Proceedings of the 4th international work-

shop on Predictor models in software engineering, PROMISE ’08, pages 3–10, New York, NY,

USA, 2008. ACM. ISBN 978-1-60558-036-4.

Leland Wilkinson, Anushka Anand, and Dang Nhon Tuan. Chirp: a new classifier based on com-

posite hypercubes on iterated random projections. In Proceedings of the 17th ACM SIGKDD

international conference on Knowledge discovery and data mining, KDD ’11, pages 6–14, New

York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7.

175

Leland Wilkinson, Anushka Anand, and Tuan Nhon Dang. Substantial improvements in the

set-covering projection classifier chirp (composite hypercubes on iterated random projections).

ACM Transactions on Knowledge Discovery from Data, 6(4):19:1–19:18, December 2012.

Sebastian Winter, Stefan Wagner, and Florian Deissenboeck. A comprehensive model of usability.

In Proceedings of Engineering Interactive Systems 2007. Springer-Verlag, 2007.

Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo Cun-

ningham. The Waikato Environment for Knowledge Analysis (WEKA), 2008. URL

http://www.cs.waikato.ac.nz/ml/weka.

Fei Xing and Ping Guo. Support vector regression for software reliability growth modeling and

prediction. In Advances in Neural Networks - ISNN 2005, Second International Symposium

on Neural Networks, Part 1, ISNN (1), volume 3496 of Lecture Notes in Computer Science.

Springer, 2005.

Fei Xing, Ping Guo, and Michael R. Lyu. A novel method for early software quality prediction

based on support vector machine. In Proceedings of The 16th IEEE International Symposium

on Software Reliability Engineering. IEEE, 2005.

Bo Yang, Lan Yao, and Hong-Zhong Huang. Early software quality prediction based on a fuzzy

neural network model. In Proceedings of Third International Conference on Natural Computa-

tion (ICNC 2007). IEEE Computer Society, 2007.

Xiaohong Yuan, Taghi M. Khoshgoftaar, Edward B. Allen, and K. Ganesan. An application of

fuzzy clustering to software quality prediction. In Proceedings of The 3rd IEEE Symposium on

Application-Specific Systems and Software Engineering Technology. IEEE, 2000.

Yuming Zhou and Hareton Leung. Empirical analysis of object-oriented design metrics for pre-

dicting high and low severity faults. IEEE Transactions on Software Engineering, 32(No. 10):

771–789, October 2006.

176

Appendices

177

Appendix A

PRATO: A PRACTICAL TOOL FOR SDP

A.1 Collecting and Combining Defect Prediction Models

Various techniques have been employed to predict software quality prior complete development of

the software, for example regression analysis (Khosgoftaar and Munson, 1990), neural networks

(Khosgoftaar et al., 1994), case based reasoning (Ganesan et al., 2000), genetic algorithms (Bouktif

et al., 2004), fuzzy neural networks (Yang et al., 2007). Based on these techniques different pre-

diction models have been proposed in literature. Studies have depicted that despite the availability

of such a large number of prediction models, their application in software industry is very limited

(Bouktif et al., 2004). One reason for this limited application is considered to be the specificity of

each prediction model (Rana et al., 2008, Wagner and Deissenboeck, 2007). Each model has been

developed for a specific context and a specific programming paradigm. Wagner et al. (Wagner and

Deissenboeck, 2007) and Rana et al. (Rana et al., 2008) have independently discussed the dimen-

sions of specificity of prediction models. Moreover, these studies have emphasized the need of a

generic approach to predict software quality (Rana et al., 2008, Wagner and Deissenboeck, 2007)

that helps quality managers to focus on quality. This Chapter presents PraTo, a tool based on an

earlier approach by Rana et al. (Rana et al., 2008), which is based on existing models and predicts

software defects.

PraTo lets software managers specify a certain scenario by providing their data information,

express their objectives, give their preferences based on the objectives and select a prediction

model that is most suitable for the given preferences. Using PraTo, one can develop a number of

prediction models with different model parameters before selecting the best of them.

Rest of the Chapter is organized as follows: Section A.2 introduces PraTo and describes its

architecture and components. Section A.4 shows application of PraTo with the help of an exam-

ple. Section A.5 discusses the benefits of using PraTo and mentions its limitations. Section A.6

concludes the Chapter and identified the future directions. (Fenton and Neil, 1999)

A.2 Tool Architecture and Working

PraTo provides the Quality Managers with various prediction techniques to use for software defect

prediction. This prediction can be binary, i.e. Defect Prone (D) or Not Defect Prone (ND), as well

as numeric i.e. Number of Defects. To analyze the models the tool collects different metrics for

both types of predictions. In order to develop a prediction model for their software, the managers

are required to provide their software data and select a model from a list of available models. In

case they are unable to decide which model to use, they have the provision to search and use the

best model(s) for the datasets similar to their software data. On the other hand if the managers do

not have enough data to develop a certain model, they can provide domain level information about

their software and find out the nearest dataset. Performance of available models on that dataset can

then be studied to decide which prediction model to choose. The goal of is to reduce their time

spent in study and comparison of different models and help them select the most suitable model

given their preferences.

PraTo is a multiphase approach. A schematic overview of PraT in Figure A.1 shows that it

takes a scenario as input from the user and performs the steps mentioned using the components

depicted outside the dotted box. A scenario is a set of constraints, objectives and contextual infor-

mation regarding the software. After specification of a scenario the input passes through the input

selection phase, where datasets similar to the given scenario are identified with the help of a Uni-

fied Metrics Database (UMD). Before developing a model the user inputs pass through the model

selection phase, which uses Analytic Hierarchy Processing (AHP) to select a model based on cer-

tain performance measures received from a Rule Based System (RBS). AHP studies the models’

performances in terms of the performance measures and performs the selection based on user pref-

erences. In model development phase, the user input is preprocessed using Information Gain (IG)

179

Input SelectionModel Performance parameters, Selected softwaremeasures infoModel SelectionConstra int measures infoSelected software measures, Selected model name Datasets (R)Unif ied MeasuresModel DevelopmentOutput ValidationObject iveAdd it iona lContextua l Unif ied MeasuresDatabase (UMD)Mode lsDeveloped modelReportingInfo Test resultsA Scenario Repos itoryFig. A.1: The Generic Approach

or Principle Component Analysis (PCA) to drop the irrelevant attributes or select the most signifi-

cant dimensions. Later on, the developed model is validated using 10-fold cross validation before

Reporting of the results in terms of ROC points (Egan, 1975).

The rest of the section describes the architecture of PraTo in two steps. First the components

outside the dotted box are discussed followed by a discussion on the phases mentioned inside the

dotted box of Figure A.1. In the end working of PraTo is described.

A.2.1 A Scenario

A scenario is specified in terms of business and data constraints like budget, time, resources, less

training data; objective like find minimum number of modules that need to be tested but do not

ship too many defects; and additional contextual information like software metrics of conventional

paradigm are available and the available data is design and implementation phase data. PraTo

requires the users to specify the severity level of each constraint in interval [0-1]. Objective is

expressed in terms of sensitivity towards shipping defective modules and importance of finding

high number of modules that need to be tested. Additional contextual information can be provided

either by adding a certain dataset or by providing information regarding software metrics in use.

Objectives and constraints are sent to an RBS to find out what values of model performance mea-

180

Tab. A.1: Datasets List

Serial Dataset Parameters Instances Variant ND Modules (%) SVP

ds1 kc1 21 2109 No 84.54% 2

ds2 kc1-classlevel 94 145 No 58.62% 8

ds3 kc1-classlevel-oo 8 145 Yes 58.62% 0

ds4 kc1-originally-classlevel 10 145 Yes 58.62% 1

ds5 kc2 21 522 No 79.5% 0

ds6 kc3 39 458 No 90.61% 0

ds7 pc1 21 1109 No 93.05% 0

ds8 jm1 21 1109 No 93.05% 0

ds9 cm1 21 1109 No 93.05% 0

ds10 kc1-classlevel-num 94 145 No 58.62% 8

ds11 kc1-classlevel-oo-num 8 145 Yes 58.62% 0

ds12 kc1-originally-classlevel-num 10 145 Yes 58.62% 1

SVP = Same Valued Parameters

sures should be considered. Additional information is used to identify datasets similar to other

datasets of NASA projects (Menzies et al., 2012).

A.2.2 Dataset Repository

Datasets used in this study describe the defectiveness information regarding various software mod-

ules have been taken from Promise repository (Menzies et al., 2012). Number of parameters and

instances in each of the dataset has been mentioned in Table A.1. The parameters are the indepen-

dent variables and an additional parameter is used as output to indicate either the class of software

181

(i.e. D, ND) or Number of Defects. Each instance in a dataset represents a software class (or

module) and the parameters are software metrics calculated for that module. Four of the datasets

listed in Table A.1 are variants whereas the rest have been used without any modifications. A

dataset is termed as a variant if the number of parameters of the dataset have been modified. Here,

ds3, ds4 are variants of ds2 and ds11, ds12 are variants of ds10. Furthermore, percentage of nega-

tive instances in each dataset is also mentioned in the table to show any imbalance in a dataset. The

last column mentions the number of parameters that hold same value for all the instances. Such

parameters are not useful in prediction, rather they can potentially have a negative effect on the

prediction.

The datasets ds2 and ds10 are representative of same software modules. The only difference

is that ds2 has binary classification as output parameter and ds10 gives numeric classification.

Two variants of ds2 (kc1-classlevel data) have been formed by dividing the parameters into three

groups. Group A has 8 parameters, Group B has 10 parameters and Group C has 84 parameters,

Group A ⊂ Group B and Group B ∩ Group C = ∅. Parameters in Group A are the most

commonly used parameters for defect prediction identified in (Menzies et al., 2003). The ds3

(kc1-classlevel-oo) comprises of Group A parameters only and is marked as variant in Table A.1.

Values of the parameters in Group B are originally metricd at module level and these parameters

are used to form ds4 (kc1-originally-classlevel). This dataset is also marked as variant in the table.

The values of the parameters in Group C are originally measured at method level and are later

transformed to module level before making the dataset available at PROMSIE website (Menzies

et al., 2012). Parameters used in ds2 are a union of Group B and Group C parameters. The

datasets ds11 and ds12 have been modified using the same procedure to make them variants of

ds10.

A.2.3 Unified Metrics Database

Rana et al have suggested a Unified Metrics Database (UMD) to remove inconsistencies found in

naming of software product metrics and we use the UMD in Input Selection activity (Rana et al.,

2011). The UMD plays a vital role in the selection of datasets with common metrics. Developing

182

Tab. A.2: List of Models in Repository

Serial Model Name

1 Least Square Regression

2 Robust Regression

3 Neural Networks

4 Radial Basis Networks

5 Fuzzy Inference System

6 Classification and Regression Trees

7 Logistic Regression

a consistent UMD requires a dedicated attention. Once the UMD has been developed, it will be

helpful for software managers who need to identify and decide which datasets are similar to their

problem domain or which metrics to use for their projects. The UMD can further be helpful for

future studies on software product metrics and development of prediction models based on these

metrics. It is worthwhile to mention here that the proposed UMD includes base as well as certain

derived metrics.

A.2.4 Models Repository

With a large number of prediction techniques reported in literature, it is required to make them

available for quality managers. Because every technique has different aspects and capabilities, each

of them is has its utility in varying situations. Gray et al have compared a number of techniques

in terms of their modeling capabilities (Gray and MacDonell, 1997). The Models Repository in

PraTo contains the models compared by Gray et al. This list of models is presented in Table A.2.

183

Input Selection

Similar Dataset Selection

Datasets

Repository ( R )

Problem Domain

based Similarity

Common

Measures based

Similarity

UMD

Data Values

based Similarity

Rd

Rc

Measures’

Labels

Rv

D

Fig. A.2: Input Selection

Fig. A.3: Model Selection

184

Fig. A.4: Main Screen of PraTo

A.2.5 Input Selection

Input Selection activity is a three steps activity that outputs datasets similar to the dataset provided

by the user. Three steps of this activity are shown in Figure A.2. In the first step, datasets Rd

having problem domain level similarity with user’s dataset D are selected using meta information

regarding each dataset. In the second step, the datasets Rc, that use common metrics are identified

with the help of UMD. In the third step, these datasets are compared with D and the datasets Rv

having data value level similarity are selected. After each of these steps, the number of datasets

keeps on decreasing such that:

Rv ⊆ Rc ⊆ Rd ⊆ R (A.1)

where R is the set of all the datasets in the Repository, Rd contains the datasets which have domain

based similarity with D, Rc has the datasets which have common metrics with D and Rv has those

datasets which have data values level similarity with D.

185

A.2.6 Context Specification and Model Selection

With more than one parameters involved in model selection of a model poses a problem to the

quality managers and results in subjective selection of a prediction model. In order to reduce this

subjectivity the AHP based model selection is performed. The activity receives the performance

measures needed for selection decision and user preferences of one parameter over the other as

shown in Figure A.3.

Dataset Similarity

Compare a dataset with a public dataset. The closest public dataset is the user scenario. The model

with the best performance, in terms of recall, on the public data is selected.

Providing Set of Constraints

Provide a set of constraints in terms of HR, Budget and Time. Based on the severity of the con-

straints certain values of the performance measures determined. The model with the best values of

the performance measures is selected.

AHP Based Ranking

Select three model performance measures. Provide relative importance of the measures. An AHP

based technique is applied to select a model.

A.2.7 Model Development

These models sometimes use all software metrics for model development but sometimes perform

preprocessing to resolve data compatibility and correlatedness issues and dimensionality reduction.

For example, Principle Component Analysis (PCA) and Factor Analysis (FA) and Information

Gain (IG) are some techniques that can be used to reduce size of input space. PCA transforms

the larger input space to a smaller input space such that all the variables of the larger input space

are represented in the reduced input space. Using influence of all variables is not always a good

186

practice. Experiments have shown that it sometimes deteriorates performance of prediction models

(Rana et al., 2009b). Furthermore, there are scenarios where the data is not suitable for direct PCA.

In such cases some additional preprocessing is needed to make data suitable for PCA. So for these

scenarios we suggest the use of IG to drop irrelevant attributes. The tool presented in this Chapter

handles both types of preprocessing.

A.2.8 Output Validation

After development of a model based on training data, the model is validated using 10-fold valida-

tion. Complete data is split into 10 groups. The first group consists of 10% instances, the second

set consists of 20% instances and so on. The last set consists of the 100% instances of the data.

After these 10 iterations, average results of the iterations are reported.

A.3 Salient Features of PraTo

The user can specify a scenario and select model performance measures using the Rule Based

System shown in Figure A.3. The scenario specification module is meant to guide the user which

performance measures should be used for AHP based model selection. Currently this module

works for three performance measures only i.e. False Positive (FP), True Positive (TP) and Ac-

curacy (Acc). A scenario is specified using the severity level of three parameters: HR, Budget

and Time. Based on severity level of each of the parameters, a rule based system outputs what

value of the above three performance measure should be desired. Once the scenario specification

is completely developed, the manual model selection part may be bypassed. Based on the specified

scenario the system will identify the performance measures and perform the AHP based selection.

User can add a new dataset in the repository using the Add Dataset button. At first step, the

user is asked to provide domain information regarding the new dataset (like how many requirement

phase measures are used in the dataset, how many of the measures are product measurens etc.).

This information is used to identify the datasets similar to the new one. Project types will also be

listed to improve the domain level similarity. To find similarity on the basis of use of measurenns,

further information regarding the measurenns used in the new datasets is also collected but is not

187

Constraints:í Business Domain: Severity (0 � 1): Budget=0 .4, HR=0 .9, T ime=0.7í Data: less training data, skewed data

Objectives:( f ind the min modules that need to be tested but do not ship toomany defects Additional Contextual info:í Design and implementation phase data availableí Conventional paradigm measuresí …more info will be collected after performance measuresmapping

Fig. A.5: A Scenario

fully functional at the moment. These measurens can be selected using the list of metrics (coming

from a unified metrics database) shown in lower left part of Add Dataset screen.

User can select a certain dataset or a certain model that needs to be run. There is a checkbox

’Run all models’ that allows the user to run all the models on the selected dataset. There is another

checkbox ’Run on all datasets’ that allows the user to run a selected model on all the datsets and

see its performance. The screen also provides the options to preprocess data before running a

prediction model. Two types of preprocessing is available so far: Principal Component Analysis

(PCA): Using this preprocessing the prediction model will be developed on the basis of the reduced

dimensions only. Information Gain (IG): There is a text box adjacent to IG checkbox to specify

a threshold. Attributes of the dataset that have IG value lower than this threshold will be dropped

before developing the model. During the IG based preprocessing, binning of the attributes needs

to be performed. Therefore three types of binning options are provided. User can select from Equi

Distance (EQ Dist) bins, Equal Frequency (EQ Freq) bins and a combination of both (Hybrid). ”

Before model development the user has the option to specify the data split. By default the data

split is 2/3 training data and 1/3 test data. ” In order to select random training examples from

each dataset before model development, a checkbox ’Randomize’ should be checked. If it is left

unchecked then first 2/3 examples are taken as training data and the rest as the test data. This check

box needs to be checked every time a new dataset is selected. Plot Input and Residual Errors: User

has the options to have various plots of the input parameters including Capa and PMF plots. User

188

Fig. A.6: Specifying a Scenario

can plot all the input parameters separately or in one figure. User also has the option to plot residual

errors through the ’Plot residual Errors’ checkbox. ” ’Software Error Tolerance’ text box is used

when the underlying dataset predicts number of defects instead of defectiveness of each module.

In such a case only those modules will be considered defect prone which have errors more than

the Software Error Tolerance threshold. Default value of this threshold is kept 5. ” Beta value is

used to calculate F Measure which is a measure to assess models’ performance in case of skewed

datasets. Beta = 1 means that equal importance is given to precision and recall while calculating

the F value. ” k = 10 is used for 10-fold cross validation. User can change this value of k.

The lower pane of the main screen allows the user to perform Analytic Hierarchical Process

(AHP) based model selection using any three performance parameters listed in the rightmost list

item ’Performance Measures’. User has to provide index of the desired performance measures and

the relative importance of each of the performance parameters.

A.4 Application of PraTo

PraT allows the users to specify multiple scenarios and get the prediction information regarding

their software. This section describes one scenario as an example and demonstrates the use of PraT

189

for that scenario.

A.4.1 Scenario Specification:

Consider a scenario where the user has very sever human resource problem but at the same time

does not want to ship too many defects to the client. Such a scenario has been expressed in Figure

A.5. The PraTo user interface (UI) to express this scenario is shown in Figure A.6. Usually HR,

schedule and budget issues are common problems faced by majority of the managers. The top

three lists requires the user to specify the severity level of these three issues. If any of these issues

is not present, the user can assign them zero weight in the respective list. The lower list requires the

user to mention the tolerance to ship defects. If shipment of a few defects can be tolerated because

of tight schedule then higher value of the tolerance is selected. The radio buttons allow the user

to select an objective. Once the user specifies the scenario, PraTo uses an RBS to identify which

labels and performance measures should be used to perform model selection. In this case the three

parameters are TP , FP and Accuracy and their suggested values are Low, Low and Medium

respectively.

A.4.2 Input Selection and Preprocessing:

The user might also like to find out which of the existing datasets have similar characteristics that

my software has. This information can help in pruning the prediction models that have potential

to work well on the user’s dataset. In order to let the user gather this information, PraTo provides

an Add Dataset feature. Add Dataset screen requires the user to provide metadata information

regarding his software and select the software metrics being collected as shown in Figure A.7. The

unified labels of software metrics are displayed in the bottom left list and the user can select one

metric at a time to depict its usage in user’s data. The figure shows that for the given metadata, ds

12 is the nearest dataset.

190

Algorithm 2 IG-based algorithm to select relevant metrics for defect prediction models

Require: AllParameters, Target, α

Ensure: RelevantParameters is returned s.t. RelevantParameters ⊆ AllParameters

{Steps:}Entropy ← calculateEntropy(Target)

RelevantParameters← AllParameters

paramCount← countParams(AllParameters)

for i = 1 to paramCount do

IG[i]← calculateInformationGain(AllParameters[i], Entropy)

if IG[i] < α then

Drop this parameter from RelevantParameters

end if

end for

return RelevantParameters

Removing Irrelevant Attributes

We suggest an algorithm to drop the irrelevant attributes on the basis of IG and use the relevant

attributes to detect a defect-prone (D) or not defect-prone (ND) module. We study the impact of

using this approach on two defect prediction models i.e. Classification Trees (CT) and a Fuzzy

Inferencing System (FIS) based model (Rana et al., 2009a). It is interesting to note that perfor-

mance of both the models has improved after the IG based dimension reduction. We compare the

results of using PCA and IG for both the models and notice that using IG as a replacement to PCA

produces better prediction results in terms of misclassification rate and recall.

The proposed approach, presented in algorithm 2 (Rana et al., 2014), drops irrelevant attributes

from a dataset. Irrelevant attributes are those which do not individually contribute significantly in

prediction.

As seen from algorithm 2, three input values are required for this approach: two non-scalar

values are underlined. First of the two non-scalar inputs, AllParameters, is a n×m matrix with

n software modules and m software metrics used to detect D and ND modules. Second input,

191

Target, is a n × 1 matrix representing the defectiveness of n software modules. The scalar α s.t.

0 ≤ α ≤ 1, represents a threshold to decide whether a certain attribute needs to be dropped or not.

The output, RelevantParameters, is a n × k matrix with k ≤ m and RelevantParameters ⊆AllParameters.

Calculating Entropy and IG: Entropy of a set S is as follows (Mitchell, 1997):

Entropy(S) =k

∑

i=1

− pi log2 pi (A.2)

where k is total number of classes and pi is proportion of examples that belong to class i.

Information gain of an attribute A in a set S is defined as follows (Mitchell, 1997):

IG(S,A) = Entropy(S) −v

∑

i=1

|Si||S|Entropy(Si) (A.3)

where v is total number of distinct values in attribute A (i.e. domain of A) and Si is set of examples

that hold a certain value from domain of A.

Inorder to gauge the performance of each approach, percentage change (increase/decrease) in

Recall, Acc and MC have been calculated using the following expression:

%age change =(DR Reading −NoDR Reading)

NoDR Reading× 100 (A.4)

where NoDR Reading is value of evaluation parameter when model is developed without any di-

mension reduction and DR Reading is value of evaluation parameter when models are developed

after dimension reduction.

Experiments were conducted with a split of training and test data as 67% to 33% respectively

for each dataset. A number of experiments were conducted on random data and best results of each

model in terms of misclassification rate are compared and reported here.

We have used seven datasets available at Promise repository (Menzies et al., 2012) and have

kept α = 0.01. For most of the datasets performance shortfall incurred in case of IG has been

smaller as compared to performance shortfall observed with PCA. Average values of percentage

changes in the three evaluation parameters indicate that on average IG has been better choice for

these datasets.

192

We have compared the two dimension reduction approaches dataset wise and presented the

results in Table A.3. IG has emerged as better approach for majority of these datasets. For ds1, use

of IG has improved performance of both the models and therefore IG is listed as overall winner.

This is to be mentioned here that the largest dataset ds1 is dominated by 84.5% ND modules.

Therefore none of the models could perform very well in identifying the D modules. In case of ds2

(145 instances, 94 attributes), there are many parameters that are not good predictors of D modules

for this ds2 (Rana et al., 2009b). But PCA still keeps their contribution in the smaller input space

hence deteriorating models’ performances. IG has dropped the irrelevant attributes and has resulted

in better Recall for both the models and is reported as a winner. IG has outperformed PCA in case

of ds3 and ds4 as well. Both the datasets are variants of ds2 and are based on selected commonly

used software metrics. Use of both the dimension reduction techniques degrades the models’

performances most of the times. But IG is reported as a winner because of low performance

shortfall. In case of ds5, the third largest dataset with 21 attributes the IG based approach failed

to drop any irrelevant attributes. On the other hand, PCA has handled the correlation among the

Tab. A.3: Winners in Terms of Recall and Acc

Dataset Recall Accuracy Overall

CT FIS CT FIS

ds1 IG IG IG IG IG

ds2 IG IG IG PCA IG

ds3 IG IG PCA IG IG

ds4 IG IG IG IG IG

ds5 PCA No PCA No PCA

ds6 No PCA IG IG IG

ds7 IG PCA IG IG IG

193

attributes and is reported as a winner. ds6 and ds7 are dominated by 90% and 93% ND modules

respectively. Performances of PCA and IG are very close for both these datasets.

Defect prone modules in a software are detected using defect prediction models developed

using software defect data and sometimes need to reduce dimensions of the input data. Usually

principal component analysis (PCA) is used by the defect prediction models for the purpose. PCA

reduces dimensions of the input data, and at the same time, it keeps the representation of all the

input attributes intact. In some cases the representation of all the input attributes can negatively

affect the prediction. Therefore, this Chapter suggests an Information Gain (IG) based approach

to drop irrelevant attributes. This approach calculates information gain (IG) of each of the input

attributes and drops the variables which have IG values below a threshold value α. The proposed

technique has been used to develop classification tree (CT) and fuzzy inferencing system (FIS)

based models for 7 datasets. The proposed approach resulted in lesser performance shortfall as

compared to PCA in case of smaller datasets with a large number of attributes. IG based approach

also showed better results in case of large imbalanced datasets.

Results presented here cannot be generalized. We plan to verify these results by using more

datasets and conducting more experiments with different values of α and fraction of variance. We

also plan to study characteristics of the parameters dropped by the suggested approach to find the

exact reason for better performance of the IG based approach.

Experiments were conducted with a split of training and test data as 67% to 33% respectively

for each dataset. A number of experiments were conducted on random data and best results of

each model in terms of misclassification rate are compared and reported here. For most of the

datasets used, performance shortfall incurred in case of IG has been smaller as compared to perfor-

mance shortfall observed with PCA. Average values of percentage changes in the three evaluation

parameters indicate that on average IG has been better choice for these datasets.

A.5 Analysis and Discussion

Currently the proposed model is handling structural and object oriented development paradigms

only, but it is extendible to other paradigms. It can also be extended for as many quality factors

194

Fig. A.7: Adding New Dataset

Fig. A.8: AHP based Model Selection

195

as there have been models available for. The model is usable at any stage of software life cycle

provided the models repository contains models applicable in that phase. Our model caters for

the specificity of a component predictor by taking three control inputs (Quality Factor, Software

Development Life Cycle Phase, and Software Development Paradigm), which contribute towards

selection of a predictor. We are using the product-based existing models unlike Bouktif et al.

(Bouktif et al., 2004) who are using classification based models.

A.6 Summary

Defect prone modules in a software are detected using defect prediction models developed using

software defect data and sometimes need to reduce dimensions of the input data. Usually principal

component analysis (PCA) is used by the defect prediction models for the purpose. PCA reduces

dimensions of the input data, and at the same time, it keeps the representation of all the input

attributes intact. In some cases the representation of all the input attributes can negatively affect

the prediction. Therefore, this Chapter suggests an Information Gain (IG) based approach to drop

irrelevant attributes. This approach calculates information gain (IG) of each of the input attributes

and drops the variables which have IG values below a threshold value α. The proposed technique

has been used to develop classification tree (CT) and fuzzy inferencing system (FIS) based models

for 7 datasets. The proposed approach resulted in lesser performance shortfall as compared to PCA

in case of smaller datasets with a large number of attributes. IG based approach also showed better

results in case of large imbalanced datasets.

Results presented here cannot be generalized. We plan to verify these results by using more

datasets and conducting more experiments with different values of α and fraction of variance. We

also plan to study characteristics of the parameters dropped by the suggested approach to find the

exact reason for better performance of the IG based approach.

In this chapter we present an approach which takes benefit from existing techniques and com-

bines them together to predict different quality factors.

196

Appendix B

LIST OF UNIFIED AND CATEGORIZED SOFTWARE PRODUCT METRICS

Tab. B.1: Frequently Used Software Measures, Their Use and Applicability

Short Definition

Preserved

Label

Alternate Label Used by Used in

Paradigm

Availability

The halstead er-

ror estimate

B B (Khosgoftaar

and Mun-

son, 1990),

(Ottenstein,

1979),

(Ottenstein,

1981)

OO, Con-

ventional

Imp+

HALSTEAD ERROR

EST, B

(Jiang

et al.,

2008c)


Table B.1 – continued from previous page

Short Definition Preserved

Label


Paradigm

Availability

Bandwidth of the

program

BW BW (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000),

(Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

Conventional Design+

(average nesting

level of

B (Jensen and

Vairavan,

1985)

the control flow

graph of

Band (Jensen and

Vairavan,

1985)

the software) NDAV (Briand

et al., 1993)

Bandwidth (Dick and

Kandel,

2003)


198



Label


Paradigm

Availability

Count of

branches in the

program (Sum of

decisions

C C (Khosgoftaar

and Mun-

son, 1990),

(Ottenstein,

1979)


and subroutine

calls)

BRANCH COUNT (Jiang

et al.,

2008c)


199



Label


Paradigm

Availability

No. of other

classes which this

class is coupled

to

CBO CBO (Gyimothy

et al.,

2005),

(Olague

et al.,

2007),

(Quah and

Thwin,

2003),

(Rana

et al.,

2008),

(Thwin

and Quah,

2002),

(Zhou and

Leung,

2006)

OO Design+

IOC (Pizzi et al.,

2002)

COUPLING

BETWEEN OBJECTS

(Koru and

Liu, 2005a)

ClassCoupling (Nagappan

et al., 2006)


200



Label


Paradigm

Availability

CK-CBO (Nagappan

et al., 2006)

Total code lines

(all lines of

code excluding

comments)

CL CL (Gokhale

and Lyu,

1997),

(Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

OO, Con-

ventional

Imp+

Total LOC (Nagappan

and Ball,

2005b)

NCNB (Brun and

Ernst,

2004)

SLOC (Zhou and

Leung,

2006)

Number of

classes/

Com Com (Ohlsson

and Alberg,

1996)

OO Design+


201



Label


Paradigm

Availability

components in a Components (Ohlsson

et al., 1998)

module Classes (Li et al.,

2006),

(Nagappan

et al., 2006)

Halstead’s diffi-

culty

D D (Jiang

et al.,

2008c),

(Li et al.,

2006),(Shen

et al., 1985)

OO, Con-

ventional

Impl+

Number of code

characters

DChar DChar (Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

OO, Con-

ventional

Imp+

Co (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000)


202



Label


Paradigm

Availability

Code characters (Dick and

Kandel,

2003)

No. of branching

points

Dec Dec (Khoshgoftaar

and Allen,

1999b),

(Ohlsson

and Alberg,

1996)

OO, Con-

ventional

Design+

DECISION

COUNT

(Jiang

et al.,

2008c)

The design com-

plexity of a mod-

ule

DES CPX DES CPX (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)

OO, Con-

ventional

Design+

DESIGN COM-

PLEXITY, iv(G)

(Jiang

et al.,

2008c)


203



Label


Paradigm

Availability

Depth inheritance

tree is length of

the longest path

from the class to

the root

DIT DIT (Gyimothy

et al.,

2005),

(Quah and

Thwin,

2003),

(Rana

et al.,

2008),

(Thwin

and Quah,

2002),

(Yang

et al.,

2007),

(Zhou and

Leung,

2006)

OO Design+

Dep (Ohlsson

and Alberg,

1996)

Depth (Ohlsson

et al., 1998)


204



Label


Paradigm

Availability

DI (Pizzi et al.,

2002)

DEPTH (Koru and

Liu, 2005a)

InheritanceDepth (Nagappan

et al., 2006)

Depth of inheritance

tree

(Li et al.,

2006)

CK-DIT (Olague

et al., 2007)


205



Label


Paradigm

Availability

Halstead’s Soft-

ware Science

Effort

E E (Jensen and

Vairavan,

1985),

(Jiang

et al.,

2008c),

(Khosgof-

taar et al.,

1994),

(Khosgof-

taar and

Munson,

1990),

(Ottenstein,

1979),

(Schneider,

1981)

OO, Con-

ventional

Imp+

HALSTEAD EFFORT(Koru and

Liu, 2005a)

Halstead’s Effort (Li et al.,

2006)


206



Label


Paradigm

Availability

The essential

complexity of a

module

ESS CPX ESS CPX (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)

OO, Con-

ventional

Design+

ESSENTIAL

COMPLEXITY,

ev(G)

(Jiang

et al.,

2008c)

Essential complexity (Li et al.,

2006)

Total executable

statements

EX EX (Ottenstein,

1979)

OO, Imp+

(all lines of code

excluding

EXE (Khosgoftaar

and Mun-

son, 1990)

Conventional

comments, dec-

larations and

blanks)

ELOC (Khosgoftaar

et al.,

1994),

(Khosh-

goftaar

and Allen,

1999c)


207



Label


Paradigm

Availability

STMEXE (Khoshgoftaar

and Allen,

1999b),

(Khosh-

goftaar

and Seliya,

2003)

CL (Guo and

Lyu, 2000)

Executable lines (Dick and

Kandel,

2003)

Size1 (Quah and

Thwin,

2003)

Lines (Nagappan

et al., 2006)

LOC EXECUTABLE (Jiang

et al.,

2008c)

No. of objects in

the

FANin FANin (Ohlsson

and Alberg,

1996)

OO, Design+


208



Label


Paradigm

Availability

incoming func-

tion calls

FAN-in (Ohlsson

et al., 1998)

Conventional

InFlow (Li et al.,

2006)

No. of objects in

the calls

FANout FANout (Ohlsson

and Alberg,

1996)

OO, Design+

made by a certain

function

FAN-out (Ohlsson

et al., 1998)

Conventional

OutFlow (Li et al.,

2006)

Number of deci-

sions

IFTH IFTH (Khoshgoftaar

and Allen,

1999b),

(Khoshgof-

taar et al.,

1996)


Con (Ohlsson

and Alberg,

1996)

Conditions (Ohlsson

et al., 1998)

DEC (Pizzi et al.,

2002)


209



Label


Paradigm

Availability

D (Khosgoftaar

and Mun-

son, 1990)

CONDITION COUNT(Jiang

et al.,

2008c)


210



Label


Paradigm

Availability

Lack of cohesion

in methods

LCOM LCOM (Bouktif

et al.,

2004),

(Bouk-

tif et al.,

2006),

(Gyimothy

et al.,

2005),

(Olague

et al.,

2007),

(Pizzi et al.,

2002),

(Quah and

Thwin,

2003),

(Rana

et al.,

2008),

(Zhou and

Leung,

2006)

OO Design+


211



Label


Paradigm

Availability

LACK OF COHESION

OF METHODS

(Koru and

Liu, 2005a)


212



Label


Paradigm

Availability

Total lines of

code (including

comments)

LOC LOC (Brun and

Ernst,

2004),

(Dick and

Kandel,

2003),

(Gyimothy

et al.,

2005),

(Khosgof-

taar et al.,

1994),

(Khosgof-

taar and

Munson,

1990),

(Khosh-

goftaar

and Allen,

1999c),

(Khosh-

goftaar

and Allen,

1999b),

(Khosh-

goftaar

and Seliya,

2003),

(Munson

and Khos-

goftaar,

OO, Con-

ventional

Imp+

213



Label


Paradigm

Availability

SLOC (Briand

et al., 1993)

TC (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000)

Lines of source code (Dick and

Kandel,

2003)

Lines of code (Li et al.,

2006)

LOC TOTAL (Jiang

et al.,

2008c)

Number of loop

constructs

Loo Loo (Ohlsson

and Alberg,

1996)

OO, Design+

Loops (Ohlsson

et al., 1998)

Conventional

Lop (Khoshgoftaar

and Allen,

1999b)


214



Label


Paradigm

Availability

NL (Khoshgoftaar

and Seliya,

2002)

No. of arcs con-

taining predicate

of loop construct

LP LP (Khoshgoftaar

and Allen,

1999b),

(Khoshgof-

taar et al.,

1996)


N STRUCT (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)

No. of paths in

the subroutine

MacPat MacPat (Khoshgoftaar

and Allen,

1999b),

(Ohlsson

and Alberg,

1996)

OO, Con-

ventional

Design+

N PATHS (Khoshgoftaar

et al.,

1997b)


215



Label


Paradigm

Availability

PATH (Ohlsson

et al., 1998)

Number of com-

ment characters

MChar MChar (Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

OO, Con-

ventional

Imp+

CC (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000)

Comment characters (Dick and

Kandel,

2003)


216



Label


Paradigm

Availability

No. of operators

and operands

N N (Dick and

Kandel,

2003),

(Gokhale

and Lyu,

1997),

(Jensen and

Vairavan,

1985),

(Munson

and Khos-

goftaar,

1992),

(Ottenstein,

1979),

(Ottenstein,

1981),

(Schneider,

1981),

(Xing et al.,

2005)

OO, Con-

ventional

Imp+

N’ (Guo and

Lyu, 2000)


217



Label


Paradigm

Availability

Halstead Length (Li et al.,

2006)

HALSTEAD LENGTH,

N

(Jiang

et al.,

2008c)

Halstead’s esti-

mated program

length

N N (Khosgoftaar

et al.,

1994),

(Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

OO, Con-

ventional

Imp+

NH (Jensen and

Vairavan,

1985)

Ne (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000)


218



Label


Paradigm

Availability

Nh (Dick and

Kandel,

2003)

Jensen’s program

length estimate

NF NF (Dick and

Kandel,

2003),

(Jensen and

Vairavan,

1985),

(Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

OO, Con-

ventional

Imp+

JE (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000)

Nf (Dick and

Kandel,

2003)


219



Label


Paradigm

Availability

Total operator

count

N1 N1 (Khosgoftaar

et al.,

1994),

(Khosh-

goftaar

and Allen,

1999c)

OO, Con-

ventional

Imp+

TOT OPTR (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)

N1 (Dick and

Kandel,

2003)

Total operators (Li et al.,

2006)

NUM OPERATORS,

N1

(Jiang

et al.,

2008c)


220



Label


Paradigm

Availability

Total operand

count

N2 N2 (Khosgoftaar

et al.,

1994),

(Khosh-

goftaar

and Allen,

1999c)

OO, Con-

ventional

Imp+

N2 (Dick and

Kandel,

2003)

Total operands (Li et al.,

2006)

NUM OPERANDS,

N2

(Jiang

et al.,

2008c)

Maximum nest-

ing level

NC NC (Khosgoftaar

and Mun-

son, 1990)

Conventional Imp+

NDMAX (Briand

et al., 1993)


221



Label


Paradigm

Availability

CTRNSTMX (Khoshgoftaar

and Allen,

1999b),

(Khosh-

goftaar

and Seliya,

2003)

MAX LVLS (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)

Max Nesting (Li et al.,

2006)

Number of edges

found in a given

module control

from

N EDGES N EDGES (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)



222



Label


Paradigm

Availability

one module to an-

other

Arcs (Nagappan

et al.,

2006),

(Ohlsson

et al., 1998)

EDGE COUNT, e (Jiang

et al.,

2008c)

No. of func-

tions/procedures

in a module

N IN N IN (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)

OO, Con-

ventional

Design+

(No. of entry

nodes)

NDSENT (Khoshgoftaar

and Allen,

1999b)

Meth (Pizzi et al.,

2002)

NOM (Bouktif

et al., 2004)


223



Label


Paradigm

Availability

Function (Li et al.,

2006),

(Nagappan

et al., 2006)


224



Label


Paradigm

Availability

No. of children is

the no. of imme-

diate descendants

NOC NOC (Bouktif

et al.,

2004),

(Bouk-

tif et al.,

2006),

(Gyimothy

et al.,

2005),

(Olague

et al.,

2007),

(Quah and

Thwin,

2003),

(Rana

et al.,

2008),

(Thwin

and Quah,

2002),

(Zhou and

Leung,

2006)

OO Design+


225



Label


Paradigm

Availability

Kid (Pizzi et al.,

2002)

NUM OF CHILDREN(Koru and

Liu, 2005a)

SubClasses (Nagappan

et al., 2006)

CK-NOC (Olague

et al., 2007)

No. of public at-

tributes

NPA NPA (Bouktif

et al.,

2004),

(Bouktif

et al., 2006)

OO, Con-

ventional

Imp+

Global Set (Li et al.,

2006)

No. of possible

paths from

Pat Pat (Ohlsson

and Alberg,

1996)

OO, Design+

input signal to the

output signal

N PATHS (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)

Conventional


226



Label


Paradigm

Availability

Number of proce-

dures /

PRC PRC (Brun and

Ernst,

2004),

(Khosgof-

taar and

Munson,

1990)

OO, Design+

methods PROC (Khosgoftaar

et al., 1994)

Conventional

Mac (Ohlsson

and Alberg,

1996)

NOM (Bouktif

et al.,

2006),

(Quah and

Thwin,

2003)

QMOOD NOM (Olague

et al., 2007)


227



Label


Paradigm

Availability

No. of methods

that can be po-

tentially executed

in response to a

message received

RFC RFC (Gyimothy

et al.,

2005),

(Quah and

Thwin,

2003),

(Rana

et al.,

2008),

(Thwin

and Quah,

2002),

(Zhou and

Leung,

2006)

OO Design+

by an object RFO (Pizzi et al.,

2002)

RESPONSE FOR

CLASS

(Koru and

Liu, 2005a)

CK-RFC (Olague

et al., 2007)


228



Label


Paradigm

Availability

Number of pa-

rameters in the

routine/module

RSDPN RSDPN (Nagappan

et al.,

2006),

(Wang

et al., 2004)

OO, Con-

ventional

Imp+

PARAMETER COUNT(Jiang

et al.,

2008c)

Total number of

statements

S S (Ottenstein,

1979),

(Schneider,

1981)

Conventional Imp+

Number of State-

ments

(Ottenstein,

1981)

PS (Khosgoftaar

and Mun-

son, 1990)

N STMTS (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)


229



Label


Paradigm

Availability

Number of calls

of a subroutine

(Number of calls

to

TC TC (Khoshgoftaar

and Allen,

1999b),

(Khoshgof-

taar et al.,

1996)


other functions in

a module)

Coh (Ohlsson

and Alberg,

1996)

Cohesion (Ohlsson

et al., 1998)

TCT (Khoshgoftaar

and Seliya,

2002)

FanOut (Li et al.,

2006),

(Nagappan

et al., 2006)

CALL PAIRS (Jiang

et al.,

2008c)


230



Label


Paradigm

Availability

Total character

count

TChar TChar (Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

OO, Con-

ventional

Imp+

Cr (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000)

Total characters (Dick and

Kandel,

2003)

Total comments TComm TComm (Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

OO, Con-

ventional

Imp+

COM (Khosgoftaar

et al., 1994)


231



Label


Paradigm

Availability

Cm (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000)

N COM (Khoshgoftaar

and Allen,

1999c),

(Khoshgof-

taar et al.,

1997b)

CMT (Dick and

Kandel,

2003)

LOC COMMENTS (Jiang

et al.,

2008c)

Unique calls to

other modules

UC UC (Khoshgoftaar

and Allen,

1999b),

(Khoshgof-

taar et al.,

1996)

OO, Con-

ventional

Design+


232



Label


Paradigm

Availability

UCT (Khoshgoftaar

and Seliya,

2002)

Program volume V V (Briand

et al.,

1993),

(Jensen and

Vairavan,

1985),

(Khosgof-

taar et al.,

1994),

(Khosgof-

taar and

Munson,

1990),

(Ottenstein,

1979),

(Ottenstein,

1981)

OO, Con-

ventional

Design+

Halstead’s Volume (Li et al.,

2006)


233



Label


Paradigm

Availability

HALSTEAD VOLUME,

V

(Jiang

et al.,

2008c)

McCabe’s Cyclo-

matic complexity

V(G) V(G) (Khosgoftaar

and Mun-

son, 1990),

(Khosh-

goftaar

and Allen,

1999c),

(Munson

and Khos-

goftaar,

1992),

(Xing et al.,

2005)

OO, Con-

ventional

Design+

MC (Jensen and

Vairavan,

1985)


234



Label


Paradigm

Availability

VG (Briand

et al.,

1993),

(Khoshgof-

taar et al.,

1996),

(Khosh-

goftaar

and Seliya,

2002)

VG1 (Dick and

Kandel,

2003),

(Khosgof-

taar et al.,

1994)

McC1 (Ohlsson

and Alberg,

1996)

M (Gokhale

and Lyu,

1997),

(Guo and

Lyu, 2000)


235



Label


Paradigm

Availability

CMT (Dick and

Kandel,

2003)

Strict Cyclomatic

Complexity

(Li et al.,

2006)

Complexity (Nagappan

et al., 2006)

CYCLOMATIC

COMPLEXITY,

v(G)

(Jiang

et al.,

2008c)


236



Label


Paradigm

Availability

Weighted meth-

ods per class is

sum of complexi-

ties of methods in

a class

WMC WMC (Bouktif

et al.,

2006),

(Gyimothy

et al.,

2005),

(Quah and

Thwin,

2003),

(Rana

et al.,

2008),

(Thwin

and Quah,

2002),

(Zhou and

Leung,

2006)

OO Design+

WEIGHTED METHODS

PER CLASS

(Koru and

Liu, 2005a)

CK-WMC (Olague

et al., 2007)

Halstead vocabu-

lary,

η η (Ottenstein,

1979)

OO, Imp+


237



Label


Paradigm

Availability

η = η1 + η2 Halstead’s Vocabu-

lary

(Li et al.,

2006)

Conventional

HALSTEAD CON-

TENT, µ

(Jiang

et al.,

2008c)

Unique operator

count

η1 η1 (Khosgoftaar

et al.,

1994),

(Khosh-

goftaar

and Allen,

1999c),

(Ottenstein,

1979)

OO, Con-

ventional

Imp+

n1 (Jensen and

Vairavan,

1985)

n1 (Dick and

Kandel,

2003)

NUM UNIQUE

OPERATORS, µ1

(Jiang

et al.,

2008c)


238



Label


Paradigm

Availability

Unique operand

count

η2 η2 (Khosgoftaar

et al.,

1994),

(Khosh-

goftaar

and Allen,

1999c),

(Ottenstein,

1979)

OO, Con-

ventional

Imp+

n2 (Jensen and

Vairavan,

1985)

VARUSDUQ (Khoshgoftaar

and Allen,

1999b),

(Khosh-

goftaar

and Seliya,

2003)

n2 (Dick and

Kandel,

2003)

Unique operands (Li et al.,

2006)


239



Label


Paradigm

Availability

NUM UNIQUE

OPERANDS, µ2

(Jiang

et al.,

2008c)

240

Documents

Improving Software Quality Prediction Using Intelligent