Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Improving Software Quality Prediction UsingIntelligent Computing Techniques
PhD Thesis
Zeeshan Ali Rana
2004-03-0061
Advisors: Dr. Shafay Shamail
Dr. Mian M. Awais
Department of Computer Science
School of Science and Engineering
Lahore University of Management Sciences
May 20, 2016
Dedicated to my family
Lahore University of Management Sciences
School of Science and Engineering
CERTIFICATE
We hereby recommend that the thesis prepared under our supervision by Zeeshan Ali Rana titled
Improving Software Quality Prediction Using Intelligent Computing Techniques be accepted in
partial fulfillment of the requirements for the degree of PhD.
Dr. Shafay Shamail (Advisor)
Dr. Mian M. Awais (Advisor)
Acknowledgements
A Malayan proverb is ”One can pay back the loan of gold but one dies forever in debt of
those who are kind”. Realizing this fact, I would like to pay special thanks to my advisors Dr.
Shafay Shamail and Dr. Mian M. Awais for providing me with every help possible. Their sincere
cooperation and guidance has helped me throughout the course of this research and made it possible
for me to complete this task. I express my gratitude to Dr. Naveed Arshad who has been kind
enough to help me whenever I requested.
My gratitude will be meaningless if I am not grateful to Allah Almighty for His kindness and
blessings. His benevolence made me capable to achieve this milestone. All praises are for the
Almighty.
I’d like to thank my family and friends who have always been supporting me. My father has
supported me in all the hard times I faced during the course of this study. My mother has prayed
for me continuously to make me achieve this goal. My wife has been patient enough when I could
not give her time. My son and daughter for being the latest motivation to complete my work. I
thank my brothers for their selfless support and prayers. My friends have made it look easy many
times. I would specially thank Malik Jahan Khan, Saqib Ilyas, Umar Suleman, Mirza Mubashar
Baig, and Junaid Akhtar for keeping me motivated and providing me the feedback on my work
when I requested. In addition to the above the following have made this experience memorable
for me: Khurram Nazir Junejo, Aadil Zia Khan, Fahad Javed, Khawaja Fahd, Malik Tahir Hassan,
Kamran Nishat, Khalid Mahmood Aamir, and Umar Faiz.
I am also thankful to Higher Education Commission (HEC), Pakistan and Lahore University of
Management Sciences (LUMS), Pakistan for funding this research.
List of Publications
Publications
Journal
1. Zeeshan Ali Rana, Mian M Awais, Shafay Shamail, Improving Recall in Software Defect
Models using Association Mining, Knowledge Based Systems (KBS), Volume 90, December
2015, Pages 1-13, Elsevier.
2. Zeeshan Ali Rana, Mian Muhammad Awais, Shafay Shamail, Nomenclature Unification of
Software Product Measures, IET Software, Volume 5, Issue 1, p.83-102, IET Digital Library,
February 2011. IET.
Conferences and Workshops
1. Zeeshan Ali Rana, Mian M Awais, Shafay Shamail, Impact of Using Information Gain in
Software Defect Prediction Models, Intelligent Computing Theory, Lecture Notes in Com-
puter Science Volume 8588, 2014, pp 637-648 , 10th International Conference, ICIC 2014,
August 3-6, 2014, Taiyuan, China. Springer-Verlag.
2. Zeeshan Ali Rana, Sehrish Abdul Malik, Shafay Shamail, Mian M Awais, Identifying Asso-
ciation between Longer Itemsets and Software Defects, Lecture Notes in Computer Science
(LNCS), In Proceedings of The 20th International Conference on Neural Information Pro-
cessing (ICONIP’13), November 03-07 2013, Daegu South Korea. Springer-Verlag.
3. Hafsa Zafar, Zeeshan Ali Rana, Shafay Shamail, Mian M Awais, Finding Focused Itemsets
from Software Defect Data, In Proceedings of The 15th International Multi Topic Confer-
ence (INMIC’12), December 13-15, 2012, Islamabad Pakistan. IEEE. (Best Paper Award).
4. Zeeshan Ali Rana, Mian Muhammad Awais, Shafay Shamail, An FIS for Early Detection
of Defect Prone Modules, Lecture Notes in Computer Science (LNCS), Vol. 5755/2009. In
Proceedings of the 5th International Conference on Intelligent Computing 2009 (ICIC’09),
September 16-19, 2009, Ulsan South Korea. Springer Berlin / Heidelberg.
5. Zeeshan Ali Rana, Shafay Shamail, Mian Muhammad Awais, Ineffectiveness of Use of Soft-
ware Science Metrics as Predictors of Defects in Object Oriented Software, In Proceedings
of World Congress on Software Engineering 2009 (WCSE’09), May 19-21, 2009, Xiamen
China. IEEE.
6. Zeeshan Ali Rana, Shafay Shamail, Mian Muhammad Awais, Towards a Generic model
for software quality prediction, In Proceedings of 6th Workshop on Software Quality 2008
(WoSQ’08) in 30th ICSE, May 10-18, 2008, Leipzig Germany. ACM.
6
Abstract
Software Quality Prediction (SQP) has been an area of interest for the last four decades. The aim
of quality prediction has been to identify the defect prone modules in software. With the help
of SQP the defect prone modules can be identified and thus improved at early stages of software
development. SQP is done using models that predict the defect prone modules. These prediction
are based on software metrics. Software metrics and defect related information is recorded in form
of datasets. These defect datasets contain instances of defect prone and not-defect prone modules.
Major motive behind quality prediction is to identify defect prone modules correctly in early
phases of development. Imbalanced datasets and late predictions are problems that affect this
motive. In most of the datasets, the number of instances of not-defect prone modules dominate
the number of instances of defect prone modules. This creates imbalance in the datasets. The
defect prone modules are not identified effectively due to the imbalance. Effectively predicting
defect prone modules and achieving high Recall using the public datasets becomes a challenging
task. Predictions based on code metrics are considered late. Majority of the metrics in the datasets
are code metrics which means that accurate predictions can be made once code metrics become
available. Another issue in the domain of software quality and metrics is that software metrics used
so far have inconsistent nomenclature which makes it difficult to study certain software metrics.
In this thesis an association mining (AM) based approach is proposed that improves prediction
of defect prone modules. The proposed approach modifies the data in a manner that a prediction
model learns defect prone modules better even if there are few instances of defect prone modules.
We use Recall to measure performance of the model developed after proposed preprocessing. The
issue of late predictions has been handled by using a model which can work with imprecise values
of software metrics. This thesis proposes a Fuzzy Inference System (FIS) based model that helps
predict defect prone modules when exact values of code metrics are not available. To handle the
issue of inconsistent nomenclature this thesis provides a unification and categorization framework
that works on the principle of chronological use of metric names. The framework has been used to
identify same metrics with different names as well as different metrics with same name.
The association mining based approach has been tested using public datasets and Naive Bayes
classifier. Naive Bayes classifier is the simplest and is considered as one of the best performers.
The proposed approach has increased Recall of the Naive Bayes classifier upto 40%. Performance
of the proposed Fuzzy Inference System (FIS), used to handle the issue of late predictions, has
been compared with models like neural networks, classification trees, and linear regression based
classifiers. The FIS model has performed as good as other models. Upto 10% improvement in
Recall has been observed in case of FIS model. The nomenclature unification of approximately
140 metrics has been done using the proposed unification framework. Out of these 140 metrics
approximately 6% different metrics have been used with same name in literature. Their naming
issues have been resolved based on the chronological use of the names.
Achieving better Recall using the proposed approach can help avoid costs incurred due to iden-
tification of a defect prone module late in software lifecycle when cost of fixing defects becomes
higher. The proposed FIS model can be used for earlier rough estimates initially. Later, better and
accurate estimates can be made when code metrics become available.
8
CONTENTS
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Software Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Software Quality Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Software Metrics for Software Quality Prediction . . . . . . . . . . . . . . 6
1.2.2 Software Quality Prediction Models . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Prediction Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Data Related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Model Related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Preprocessing Imbalanced Datasets . . . . . . . . . . . . . . . . . . . . . 16
1.4.2 Early Prediction of Defects . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Resolving Software Metrics Nomenclature Issues . . . . . . . . . . . . . . 17
1.5 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2. Background Study and Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1 Software Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Software Defect Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Measuring Performance of SDP Models . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.2 ROC Curves and AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Software Defect Prediction Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Factor and Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 Discriminant Analysis and Principle Component Analysis . . . . . . . . . 35
2.4.3 Bayesian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.5 Classification/Regression Trees . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.6 Case-based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.7 Fuzzy Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.8 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.9 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.10 Association Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.11 Ensemble Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.12 Other Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5 Performance Evaluation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.6 Studies to Remove Inconsistencies in Software Measurement Terminology . . . . . 55
2.7 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3. Software Defect Prediction Models: A Comparison . . . . . . . . . . . . . . . . . . . 61
3.1 Description of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.3 Composite Hypercubes on Iterated Random Projections (CHIRP) . . . . . 63
3.1.4 Decision Table - Naive Bayes (DTNB) . . . . . . . . . . . . . . . . . . . 64
ii
3.1.5 Fuzzy Unordered Rule Induction Algorithm (FURIA) . . . . . . . . . . . 65
3.2 Comparison Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.1 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4. Increasing Recall in Software Defect Prediction . . . . . . . . . . . . . . . . . . . . . 76
4.1 Proposed Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.1 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.2 Horizontal Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.3 Generating Frequent Itemsets . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.4 Selecting Focused and Indifferent Itemsets . . . . . . . . . . . . . . . . . 80
4.1.5 Modifying Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.6 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Developing Defect Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.1 Identifying Performance Measure . . . . . . . . . . . . . . . . . . . . . . 85
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5. Early Predictions using Imprecise Data . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1 Imprecise Inputs and Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Proposed Model based on Imprecise Inputs . . . . . . . . . . . . . . . . . . . . . 108
5.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.2 FIS Based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
iii
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6. Resolving Issues in Software Defect Datasets . . . . . . . . . . . . . . . . . . . . . . 121
6.1 Issues related to Software Defect Data . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1.1 Inconsistencies in Software Product Metrics Nomenclature . . . . . . . . 123
6.1.2 Ineffective use of Software Science Metrics . . . . . . . . . . . . . . . . . 126
6.2 Proposed Approaches to Handle the Issues related to Software Defect Data . . . . 128
6.2.1 Metric Unification and Categorization (UnC) Framework . . . . . . . . . . 128
6.2.2 Proposed Approach to Show Ineffective use of SSM . . . . . . . . . . . . 131
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.1 Application of the UnC Framework . . . . . . . . . . . . . . . . . . . . . 134
6.3.2 Ineffective use of SSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.4.1 UnC Framework Resolves Nomenclature Issues . . . . . . . . . . . . . . . 138
6.4.2 Use of SSM Deteriorates Performance . . . . . . . . . . . . . . . . . . . . 142
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7. Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A. PraTo: A Practical Tool for SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.1 Collecting and Combining Defect Prediction Models . . . . . . . . . . . . . . . . 171
A.2 Tool Architecture and Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
A.2.1 A Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.2.2 Dataset Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.2.3 Unified Metrics Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.2.4 Models Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
iv
A.2.5 Input Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.2.6 Context Specification and Model Selection . . . . . . . . . . . . . . . . . 178
A.2.7 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.2.8 Output Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.3 Salient Features of PraTo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.4 Application of PraTo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
A.4.1 Scenario Specification: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
A.4.2 Input Selection and Preprocessing: . . . . . . . . . . . . . . . . . . . . . . 183
A.5 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
B. List of Unified and Categorized Software Product Metrics . . . . . . . . . . . . . . . . 190
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
v
LIST OF FIGURES
1.1 A Software Quality Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Typical Phases in Lifecycle of a Software . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Challenges addressed in this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Our approach to address challenges . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Software Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Confusion Matrix of a Defect Prediction Classifier . . . . . . . . . . . . . . . . . 32
2.3 ROC curve of three classifiers with best performance of C1 . . . . . . . . . . . . . 34
3.1 CHIRP Working in training and testing . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Nemenyi’s Critical Difference Diagram for evaluation using AUC . . . . . . . . . 71
4.1 Major Steps of Our Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Preprocessig Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Questionnaire Results Showing Industry’s Response Regarding Recall . . . . . . . 92
4.4 Trend of Recall across five datasets (Continued on next page) . . . . . . . . . . . . 97
4.4 (Continued from previous page) Trend of Recall across five datasets . . . . . . . . 98
4.5 Percentage Change in Recall across 5 datasets with and without the proposed pre-
processing (Continued on next page) . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 (Continued from previous page) Percentage Change in Recall across 5 datasets
with and without the proposed preprocessing . . . . . . . . . . . . . . . . . . . . 102
4.6 Percentage Change in FPRate across five datasets with and without the proposed
preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1 Accuracy of estimation as project progresses (Shari L. PFleeger, 2010). . . . . . . 107
5.2 Frequency distribution of all input metrics for kc1-classlevel data (Continued on
Next Page) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 (Continued from Previous Page) Frequency distribution of all input metrics for
kc1-classlevel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 Output of phase 1: clusters and membership functions for each input of kc1-
classlevel. (The plot of the membership functions for each input appear in the
same order as the distribution of each input appears in Figure 5.2) . . . . . . . . . 115
5.4 Frequency distribution of all input metrics for jEdit bug data (Continued on Next
Page) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4 Frequency distribution of all input metrics for jEdit bug data . . . . . . . . . . . . 116
5.5 Output of phase 1: clusters and membership functions for each input of jEdit.
(The plot of the membership functions for each input appear in the same order as
the distribution of each input appears in Figure 5.4) . . . . . . . . . . . . . . . . . 117
5.6 ROC point for each model in training and testing . . . . . . . . . . . . . . . . . . 118
6.1 Role of UMD in the Generic Approach for Software Quality Prediction . . . . . . 122
6.2 SPdM Type I and Type II Inconsistencies . . . . . . . . . . . . . . . . . . . . . . 124
6.3 SPdM Unification and Categorization Framework . . . . . . . . . . . . . . . . . . 128
6.4 SPdM Unification and Categorization Framework: Detailed Design . . . . . . . . 129
6.5 Unification and Categorization of B . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 SPdM Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.1 The Generic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.2 Input Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.4 Main Screen of PraTo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
A.5 A Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.6 Specifying a Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.7 Adding New Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
vii
A.8 AHP based Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
viii
LIST OF TABLES
1.1 Typical Techniques Used for Software Quality Prediction . . . . . . . . . . . . . . 10
2.1 Selected Datasets and their Chracteristics . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Selected Software metrics reported in the datasets used . . . . . . . . . . . . . . . 27
2.3 Datasets and their attributes used in this study . . . . . . . . . . . . . . . . . . . . 28
2.4 Two Major Views in Software Defect Prediction . . . . . . . . . . . . . . . . . . . 57
3.1 Results of Classifiers over Selected Data Sets in Terms of AUC . . . . . . . . . . . 67
3.2 Mean AUC and Std. Dev. Over the Complete Range of Tuning Parameters . . . . . 68
4.1 Questions asked from the Software Industry . . . . . . . . . . . . . . . . . . . . . 84
4.3 Top 5 1-Itemsets and their Supporti in each partition . . . . . . . . . . . . . . . . 86
4.2 Minimum Support Thresholds and Itemset Counts for Each Dataset . . . . . . . . 93
4.4 τt and τf , used in this study, for each dataset . . . . . . . . . . . . . . . . . . . . . 94
4.5 Top 3 2-Itemsets and their Supporti in each partition . . . . . . . . . . . . . . . . 95
4.6 Performance of Decision Tree Model in terms of Recall . . . . . . . . . . . . . . 98
4.7 Performance of NB classifier on different number of bins with and without pro-
posed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Metrics Used for this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Evaluation Parameters Used for Comparison . . . . . . . . . . . . . . . . . . . . . 111
5.3 Testing Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1 Examples of metrics with Type I inconsistency . . . . . . . . . . . . . . . . . . . 124
6.2 Examples of metrics with Type II inconsistency . . . . . . . . . . . . . . . . . . . 127
6.3 List of classification models used from WEKA(Witten et al., 2008). . . . . . . . . 132
6.4 Results of numeric classification with and without SSM (Halstead, 1977). . . . . . 137
6.5 Results of numeric classification with and without SSM (Halstead, 1977). . . . . . 138
6.6 Percentage of Preserved Labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.7 Distribution of Software Product Metrics in Software Development Paradigm with
Overlap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.8 Categorization with respect to Software Lifecycle Phase. . . . . . . . . . . . . . . 141
6.9 Effectiveness of SSM reported by all models . . . . . . . . . . . . . . . . . . . . . 143
A.1 Datasets List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.2 List of Models in Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.3 Winners in Terms of Recall and Acc . . . . . . . . . . . . . . . . . . . . . . . . . 185
B.1 Frequently Used Software Measures, Their Use and Applicability . . . . . . . . . 190
x
1. INTRODUCTION
Quality is generally considered a measure of goodness. Some authors state that quality of an en-
tity is the extent to which the entity meets the needs of its users (Kan, 2002). When describing
quality of an entity, adjectives like excellent, good and poor are attached with the word quality.
Entities with excellent quality are considered better than the entities with good or poor quality.
Entities with good quality are considered better than the entities with poor quality. Organizations
working on development of entities wish to develop good and excellent quality entities. Gener-
ally, quality is also considered as a relative term (i.e. varies from customer to customer, user to
user, stakeholder to stakeholder) and is defined in different ways. International Organization for
Standardization defines the term in ISO 9000 (for Standardization, 2005) as: “degree to which a
set of inherent characteristics fulfils requirements (i.e. needs or expectations that are stated , or
generally applied)”.
British Standards Institution (BSI) (Institution, 2008) define quality as “The totality of features and
characteristics of a product or service that bear on its ability to satisfy stated or implied needs”.
Definition of quality by Phil Crosby and Joseph Juran also focus on conformance to requirements:
Definition by Crosby (Crosby, 1979) is “Quality is conformance to requirements or specifications”,
and definition by Juran is “Quality is necessary measurable element of a product or service and is
achieved when expectations or requirements are met”.
An IEEE standard (IEEE Std. 610.12-1990) (Society, 1990) defines quality as
• “The degree to which a system, component or process meets specified requirements.
• The degree to which a system, component or process meets customer or user needs or expec-
tations”.
1.1 Software Quality
Software quality relates to conformance to requirements and expectations of users or customers
(Kan, 2002). It is commonly recognized as lack or absence of defects in software (Kan, 2002,
Jones, 2008, Maier and Rechtin, 2000). A defect is considered to be a failure to meet user’s or
customer’s requirements. The perception of quality as lack of defects corresponds to the basic
definition i.e. conformance to requirements (Kan, 2002). Software quality is defined differently,
some of them are provided below:
• Caper Jones (Jones, 2008) defines software quality as “the absence of defects that would
make software either stop completely or produce unacceptable results.”
• Tom McCabe provides definition of software quality as “high levels of user satisfaction and
low defect levels, often associated with low complexity” (Jones, 2008).
• According to John Musa, software quality is a combination of “low defect levels, adherence
of software functions to user needs, and high reliability” (Jones, 2008).
• Barry Boehm considers software quality to be “achieving high levels of user satisfaction,
portability, maintainability, robustness, and fitness for use” (Jones, 2008).
• Edward Deming thinks of software quality as “striving for excellence in reliability an func-
tions by continuous improvement in the process of development, supported by statistical
analysis of the causes of failure” (Jones, 2008).
• Watts Humphrey says that quality is “achieving excellent levels of fitness for use, confor-
mance to requirements, reliability, and maintainability” (Jones, 2008).
• James Martin says that software quality means “being on time, within budget, and meeting
user needs” (Jones, 2008).
IEEE Standard (IEEE Std. 729-1983) defines software quality as:
• “The totality of features and characteristics of a software product that bear on its ability to
satisfy given needs: for example, conform to specifications.
2
• The degree to which software possesses a desired combination of attributes.
• The degree to which a customer or user perceives that software meets his or her composite
expectations.
• The composite characteristics of software that determine the degree to which the software in
use will meet the expectations of the customer.”
These definition show that software quality is an intangible concept with multiple aspects, has
more than one “standard” definition, and is considered in terms of attributes such as reliability,
maintainability, usability etc. Jones (Jones, 2008) states that a working definition of software
quality must meet the following criteria:
1. Quality must be measurable when it occurs.
2. Quality should be predictable before it occurs.
Lack or absence of defects in software is one of the aspects of software quality (Kan, 2002,
Jones, 2008, Maier and Rechtin, 2000). This thesis uses the definition of quality that corresponds
to lack or absence of defects. The selected definition meets the above mentioned criteria of mea-
surability and predictability (Jones, 2008).
Measurement of the quality highlights the potential areas of improvement in the software prod-
uct itself as well as in the process adopted to develop the product. Measurement of quality also
lets us compare one software product with another one. Therefore, software quality becomes an
important area of interest for the organizations responsible for development of software products.
The software development organizations not only desire to build a product with good intrinsic
quality but also aspire to have minimum variation in their process of development such that all the
developed software products have similar quality. Quality of software can be measured in different
ways. For example a software can be considered of good quality if there are lesser defects faced
by a user, or its number of defects per million lines of source code is low, or number of failures
per operational hour is low. All these examples relate to the definition of software quality that the
software meets the requirements and the number of defects in the functionality of software is low.
3
Measuring defect potential and defect removal effectiveness is critical for software industry
(Jones, 2008). The best way to reduce costs and shorten schedules is to reduce quantity of defects
during development (Jones, 2008). Software community uses numerous methods to cut defect
potentials such as Team Software Process (TSP), Personal Software Process (PSP), Capability
Maturity Model (CMM and CMMI), ISO/IEC 9126, Software Process Improvement and Capabil-
ity dEtermination (SPICE), McCall’s model etc. (Shari L. PFleeger, 2010, Jones, 2008). Presence
of such models and special focus on quality by companies like IBM and HP over past decades
shows the significance of software quality (Shari L. PFleeger, 2010, Kan, 2002, Jones, 2008).
Studies show that using TSP and PSP has benefitted organizations developing application with
large size (≥10,000 function points, which are complexity and difficult) by cutting their defect po-
tentials by 50% (Jones, 2008). Jones (Jones, 2008) report empirical evidence that software quality
correlates with CMM/CMMI maturity level. When compared with similar projects being devel-
oped at CMM/CMMI Level 1, the projects being developed at Level 3 to 5 give better quality and
higher productivity. However, the use of CMM/CMMI is recommended for projects with func-
tion points ≥ 10,000. Organizations use McCall’s quality model which include quality attributes
such as correctness, reliability, efficiency, integrity, usability, maintainability, testability, flexibility,
portability, reusability, and interoperability as different attributes of quality that can be considered
when measuring quality of a software product (Shari L. PFleeger, 2010). These attributes are
also known as quality factors or quality attributes. There are other attributes which are consid-
ered when measuring quality. For example IBM measures quality of its software products in terms
of: capability (functionality), usability, performance, reliability, installability, maintainability, doc-
umentation/information, service, and overall. These quality attributes measured by IBM are also
known as CUPRIMDSO (Kan, 2002). Similarly HP measures quality as FURPS (functionality, us-
ability, reliability, performance, and serviceability) (Kan, 2002). Studies on large software systems
by corporations like IBM, AT&T, and HP have highlighted the significance of software metrics.
Literature suggests the collection of software metrics as a step to measure and control software
quality (Shari L. PFleeger, 2010, Kan, 2002, Jones, 2008). These corporations use metrics to mea-
sure quality of their systems. Studies of these systems reveals that defects have not been randomly
4
distributed in the system, rather they are present in a smaller section of modules known as “error-
prone modules”. As a rule of thumb 50% of defect reports in a large system may belong to 5% of
the modules. Identification of such modules helps in correcting the problems (Jones, 2008).
1.2 Software Quality Prediction
As mentioned earlier measuring defect potentials is one of the most effective methods to evaluate
software quality. Two important concepts required to improve software quality are:
1. Defect Prevention
2. Defect Removal
Methodologies used for defect prevention focus on lowering defect potentials and reducing number
of defects (Jones, 2008). Methods used for defect removal concentrate on improving testing effi-
ciency and defect removal efficiency. Formal design and code inspections are considered the most
effective activities to prevent and remove defects (Jones, 2008). Adopting inspections because they
are good methods to evaluate quality is not desired (Shari L. PFleeger, 2010). Prediction of defects
is another method to evaluate quality of software product. Use of evaluation technique should
depend on appropriateness of development environments (Shari L. PFleeger, 2010).
Quality of software can be predicted beforehand (i.e. before the software becomes operational)
and defect potentials can be reduced using the prediction information. This prediction is done using
a prediction model. As shown in Figure 1.1, a quality prediction model takes software metrics as
input and indicates the predicted quality (for example number of defects). A defect prediction
model either classifies a software module as Defect-prone (D) / Not Defect-prone (ND) or predicts
the number of defects in the software. The predictions can either be made on the basis of historical
data collected during implementation of same or similar projects (Wang et al., 2007), or it can
be made using the design metrics collected during design phase. Both are measurement-based
prediction methods since they involve software metrics in the prediction process. Use of software
metrics in various Software Defect Prediction (SDP) studies has been effective (Ganesan et al.,
2000, Shen et al., 1985, Thwin and Quah, 2002). Mostly two kinds of metrics have been used
5
Software Qua lity Prediction Mode lSoftwareMetrics PredictedQualityFig. 1.1: A Software Quality Prediction Model
to build SDP models; Software Product Metrics (SPdMs) and Software Process Metrics (SPcM).
These types of metrics have been discussed in Section 1.2.1.
Software Quality Prediction (SQP) helps software project managers in controlling the software
project and planning the resources and tests (Munson and Khoshgoftaar, 1990). These factors
facilitate development of better quality software resulting in higher customer satisfaction leading to
good return on investment. SQP benefits during all phases of software development. However, the
literature suggests early prediction of defect-prone modules in order to improve software process
control, reduce defect correction effort and cost, and achieve high software reliability (Kaur et al.,
2009, Fenton et al., 2008, Sandhu et al., 2010, Khoshgoftaar et al., 1996, Jiang et al., 2007, Xing
et al., 2005, Quah and Thwin, 2003, Wang et al., 2004, Yang et al., 2007). Prediction of quality (in
terms of defects) earlier in software lifecycle reduces software costs by allowing the identification
and mitigation of risks (Khoshgoftaar et al., 1996, Xing et al., 2005, Briand et al., 1993, Gokhale
and Lyu, 1997, Wang et al., 1998, Yuan et al., 2000). Early prediction further helps in preparation
of better resource allocation (Wang et al., 2004, Yuan et al., 2000) and test plans (Yuan et al.,
2000, Khosgoftaar and Munson, 1990, Ottenstein, 1979, Mohanty, 1979). SQP also helps the
maintenance of software (Yuan et al., 2000, Jensen and Vairavan, 1985).
1.2.1 Software Metrics for Software Quality Prediction
Measuring characteristics of software is helpful in engineering of software (Fenton and Pfleeger,
1998). For example it helps managers to know whether the requirements are consistent and com-
6
plete, whether the code is ready to be tested, whether the complexity of the code is high. Successful
project managers measure attributes of process and product to gauge quality of the product (Fenton
and Pfleeger, 1998). The software defect prediction studies mostly use two kinds of metrics:
1. Software Product Metrics (SPdMs)
2. Software Process Metrics (SPcM).
Software Product metrics are measurements of attributes associated with software itself, for ex-
ample size, complexity, number of defects in the software and relationship between the software
components. Software Process Metrics are measurements of attributes of the processes which
are carried out during the lifecycle of the software such as type of development model, effec-
tiveness of development process followed, number of meetings during a phase, performance of
testing, and budget overrun (Fenton and Pfleeger, 1998). Some metrics are directly collected for
a particular entity (for example a class or method) while others are derived i.e. they are originally
collected for a certain sub-entity (such as methods in a class) but represent the parent entity (i.e. the
class). A few examples of such metrics are sumLOC TOTAL (Koru and Liu, 2005a), avgHAL-
STEAD EFFORT (Koru and Liu, 2005a), sumCY CLOMATIC COMPLEXITY (Koru and
Liu, 2005a), (which are aggregations of LOC, E and V (G) respectively). These have been orig-
inally collected at method level but represent the respective classes. These metrics and defect
information
The initial work in the domain of software quality prediction was an inspiration from software
science metrics (Halstead, 1977). That work was limited to predicting the number of errors. After-
wards different organizations like Commission of the European Communities’ Strategic Program
for Research in Information Technology (Brocklehurst and Littlewood, 1992), Swedish National
Board for Industrial and Technical Development (Ohlsson and Alberg, 1996, Fenton and Ohlsson,
2000), Northern Telecom Limited, USA (Khoshgoftaar et al., 1996), Nortel, USA (Khoshgof-
taar and Allen, 1999b) started work in the area of classifying fault prone modules (Munson and
Khosgoftaar, 1992). This strategy of classifying fault prone modules was better than the earlier
approach of predicting the quality of the whole software for example predicting number of defects
(Ottenstein, 1979, 1981, Schneider, 1981). It was better in the sense that it pin pointed the modules
7
Requirements Design Coding Testing MaintenanceFig. 1.2: Typical Phases in Lifecycle of a Software
which were more fault prone and needed more attention. Software science metrics (SSM) (Hal-
stead, 1977), proposed by Halstead, are based on number of operators, operands and their usage
and have been proposed by keeping procedural paradigm in mind. These metrics are indicators
of software size and complexity (for example program length N and effort E measure size and
complexity respectively). Earlier studies have found a correlation of software size and complexity
with number of defects (Khosgoftaar and Munson, 1990, Ottenstein, 1979) and have used size and
complexity metrics as predictors of defects. Studies have used SSM for defect prediction and clas-
sification of defect prone software modules as well (Xing et al., 2005, Briand et al., 1993, Gokhale
and Lyu, 1997, Ottenstein, 1979, Jensen and Vairavan, 1985, Koru and Liu, 2005a, Munson and
Khosgoftaar, 1992, Khosgoftaar et al., 1994, Khoshgoftaar and Allen, 1999c, Khoshgoftaar and
Seliya, 2003, 2004, Li et al., 2006, Seliya and Khoshgoftaar, 2007). Fenton et al. (Fenton and
Neil, 1999) have criticized the use of SSM and other size and complexity metrics in defect predic-
tion models because 1) neither the relationship between complexity and defects is entirely causal
2) nor are the defects a function of size. Majority of the prediction models take these two as-
sumptions (Fenton and Neil, 1999). Despite the critique various studies have used SSM to study
software developed in procedural paradigm (Xing et al., 2005, Khoshgoftaar and Allen, 1999c,
Khoshgoftaar and Seliya, 2003) as well as object oriented paradigm (Koru and Liu, 2005a, Seliya
and Khoshgoftaar, 2007, Challagulla et al., 2005).
Since the public availability of software defect data at PROMISE repository (Menzies et al.,
2012), the number of studies on defect prediction has increased. Despite the ceiling effect on
performance of defect prediction models (Menzies et al., 2010), and certain quality issues in the
promise repository data (Shepperd et al., 2013) studies are being done to further investigate the
relationship between the product metrics and software defects (He et al., 2015, Okutan and Yildiz,
2014).
8
1.2.2 Software Quality Prediction Models
Typical life cycle of a software has phases like Requirements Gathering and Analysis, Design,
Coding, Testing, and Maintenance as shown in Figure 1.2. Software quality prediction studies use
software product metrics in different stages of the software development lifecyle (SDLC) (Wang
et al., 2007, Ganesan et al., 2000, Quah and Thwin, 2003, Yang et al., 2007, Gokhale and Lyu, 1997,
Khosgoftaar and Munson, 1990, Ottenstein, 1979, Jensen and Vairavan, 1985, Munson and Khos-
goftaar, 1992, Schneider, 1981, Khoshgoftaar and Seliya, 2003, Challagulla et al., 2005, Grosser
et al., 2003, Khoshgoftaar et al., 1992, Weyuker et al., 2008, Bouktif et al., 2006). Ganesan et
al. (Ganesan et al., 2000) employed case-based reasoning to predict design faults. Yang et al.
(Yang et al., 2007) have extracted rules useful in early stages of software lifecycle to predict defect
proneness and reliability. Yang et al. (Yang et al., 2007) suggest the use of fuzzy self-adaptation
learning control network (FALCON) which can predict the quality based on fuzzy inputs. More-
over, defect proneness of software modules has been predicted earlier in the development phase
using discriminant analysis (Khoshgoftaar et al., 1996). In order to collect the defect proneness
information as early as possible, studies have been conducted using requirements metrics (Jiang
et al., 2007). Xing et al. (Xing et al., 2005) have employed Support Vector Machines for early
quality prediction using design and static code metrics and have achieved correct classification
rate of upto 90%. Later, a study comparing design and code metrics suggests that design metrics
combined with static code metrics can help get better defect prediction results (Jiang et al., 2008c).
Grosser et al. (Grosser et al., 2003) suggested a technique which was suitable for object oriented
(OO) software only. Company specific (Li et al., 2006) and domain specific (such as for telecom-
munication systems (Khosgoftaar et al., 1994)) studies have also been presented. Because the work
has been done in various dimensions, need of a generic model for software quality has been felt
(Bouktif et al., 2004, Wagner, 2007, Wagner and Deissenboeck, 2007, Winter et al., 2007). But
the application and context specific nature of existing models causes difficulty in taking full ad-
vantage of the existing work. Fenton et al. (Fenton and Neil, 1999) have presented a critique of
existing models and highlighted their weaknesses (Fenton and Neil, 1999). Models generic for a
certain quality factor, such as usability (Winter et al., 2007), have also been suggested. Bouktif
9
Tab. 1.1: Typical Techniques Used for Software Quality Prediction
Technique Used in
Bayesian Belief Network (BBN) (Fenton et al., 2007b, Fenton and Neil, 1999)
Naive Bayes (NB) (Turhan and Bener, 2009)
Linear Discriminant Analysis (LDA) (Ohlsson et al., 1998)
Regression Analysis (RA) (Gokhale and Lyu, 1997)
Case Based Reasoning (CBR) (Khoshgoftaar et al., 2006)
Classification Trees (Khoshgoftaar and Allen, 1999a)
Genetic Algorithms (GA) (Bouktif et al., 2004)
Neural Networks (NN) (Wang et al., 2004, Mahaweerawat et al., 2004)
Support Vector Machine (SVM) (Xing et al., 2005)
et al. (Bouktif et al., 2004) have suggested a technique for selecting an appropriate model from
a set of existing models. Their approach reused existing models but it was restricted to selection
of an appropriate model only. Their major focus was on facilitating a company in adapting object
oriented software quality predictors to a particular context. Unavailability of large data repositories
is considered as an obstacle to generalize, validate and reuse existing models (Bouktif et al., 2004).
Software datasets are available at PROMISE website (Menzies et al., 2012) to validate existing
models and we have used kc1 for initial study. Though Wagner (Wagner, 2007) has suggested an
approach to reduce the effort to apply a prediction model, some other issues yet need to be ad-
dressed. It is difficult to avoid specific behavior of a predictor of software quality. To address this
issue models have been divided in different dimensions of specificity. Wagner et al. (Wagner and
Deissenboeck, 2007) have argued that software quality models differ along six dimensions namely
purpose, view, attribute, phase, technique and abstraction. Some examples of techniques used for
defect prediction models are given in Table 1.1.
The prediction models have been used to predict defects in future iterations/releases of a prod-
uct as well as in later phases of development based on data collected in earlier iterations/releases
10
or phases respectively (Wang et al., 2007, Ganesan et al., 2000, Xing et al., 2005, Ottenstein, 1979,
Mohanty, 1979, Ottenstein, 1981, Schneider, 1981, Bouktif et al., 2004, Xing and Guo, 2005,
Khoshgoftaar and Allen, 1999a, Brun and Ernst, 2004). For example, based on defect informa-
tion from a previous iteration, quality prediction model for iteratively developing software can
improve after the delivery of each iteration. With the help of the empirical model an organization
can atleast roughly predict quality of the next iteration. Based on these predictions, new goals can
be set and better quality may be achieved. Also, if the metrics for a particular phase are calculated
and known, then the quality of next phase can be determined and better resource management can
be done based on these predictions. The prediction model can use information from earlier phases
to predict defects in later phases, for example a rule-based model which is developed on the basis
of metrics like cyclomatic complexity and number of decision points in the software. If a software
component has large number of decision points, it has a likelihood of more programming errors.
Similarly Support Vector Machine (SVM) based technique (Xing et al., 2005) has been employed
to predict software quality in early stages of development. Hence prediction models can be devel-
oped if historical data for previous releases is available as well as in scenarios where significant
number of metrics are not available in earlier stages of SDLC.
1.2.3 Prediction Performance
Performance of the models used to predict defectiveness of software modules is measurend using
performance parameters like accuracy, recall, false positive rate, area under the Receiver Operator
Curve (ROC), F-Measure etc. All these parameters have their own significance. For example
accuracy of a model reflects the ability of the model to correctly classify a defect-prone as well as
not-defect-prone (or defect free) module. Models with high accuracy are considered good. Recall,
on the other hand, focuses on correct classification of defect-prone modules. It is desired that a
model has high recall. False positive rate indicates the incorrectly classified defect free modules
and relates to extra testing. Therefore lower values of false positive rate are desired if extra testing
is not affordable. Values for area under the ROC range from 0 to 1 and its value is desired to
be closer to 1. Further details about these performance parameters can be found in Chapter 2.
11
Many studies have focused on maximizing area under the curve (Lessmann et al., 2008, Jiang
et al., 2008a,c, 2007). It has been reported in literature that limits of achieving standard goal of
optimizing area under the curve have been reached (Menzies et al., 2010). Maximizing area under
the ROC curve sometimes results in scenarios where we have to compromise on higher values of
recall because that high recall is obtained at the cost of high false positive rate. Therefore, when
the objective is to predict as many defect-prone modules as possible, Recall seems to be a measure
preferred over the area under the curve measure. Therefore, the present study does not focus on
the principle of optimizing the area under the curve. Instead this thesis proposes a preprocessing
technique that works for increased recall of defect prediction models.
1.3 Challenges
Most of defect prediction models use software product metrics for the prediction. One reason
for product based models is the availability of public domain defect data (Menzies et al., 2012).
Existing studies based on public domain data in particular and others in general have certain lim-
itations. Major limitations of the existing work have been categorized and discussed in the rest of
the section.
1.3.1 Data Related
Benchmark Datasets
In existing studies, sometimes data used for the study is not publicly available (Khoshgoftaar et al.,
1996, Quah and Thwin, 2003, Wang et al., 2004). Availability of benchmark data has been con-
sidered important since long and benchmarking organizations like Gartner Group (GG), David
Consulting Group (DCG), International Software Benchmarking Group (ISBG), PRedictOr Mod-
els In Software Engineering (PROMISE) have been developing software data repositories (Jones,
2008, Menzies et al., 2012). NASA Metrics Data Program (MDP) (Facility) has been another
venture to make software data available for researchers. In 1997 ISBG was created such that the
benchmark data becomes available in form of CD also, unlike the case of older groups (i.e. GG
12
and DCG etc.). This data can now be purchased commercially from ISBG. In order to make the
data available publically and in soft form PROMISE repository has been created in early 2004-05.
Researchers and practitioners can now obtain and use data without purchasing it commercially.
Literature suggests that the experiments conducted using the public data can be replicated and the
studies reporting such experiments are more useful (Menzies et al., 2010, 2008).
Imbalanced Data
Available public datasets are imbalanced and have larger number of defect free modules. Imbal-
anced data is considered as a problem for classification of rare class in other domains also (Wang
et al., 2005, Alshomrani et al., 2015, Garca et al., 2014). The imbalance in datasets refrains the
models to predict defect prone modules with better performance. The literature suggests the use of
Area Under ROC curve (AUC) to asses performance of a model. AUC does not allow recall to rise
above a certain threshold because while maximizing the area AUC it includes the cost of FP rate.
Recall is another performance measure used to assess performance of prediction. Recall focuses
on correct classification of defect-prone modules which is more important for software companies
since they aim to deliver lower number of defective modules.
Quality Issues in Public Datasets
There are issues related to quality and information in the public domain datasets. The datasets
sometimes contain single valued attributes and in some cases they have repeated rows (Shepperd
et al., 2013). The datasets include examples with conflicting class labels for example there are
more than one software modules which have same values for software metrics but one of them
is labeled as defect-prone and others are labeled as not-defect-prone. Researchers have used the
software inappropriately even if their use deteriorates the performance of prediction model (Koru
and Liu, 2005a, Seliya and Khoshgoftaar, 2007, Challagulla et al., 2005).
13
1.3.2 Model Related
Models’ Performance
The metrics in public data have been extensively used for proposal of new models. However, the
performance achieved so far has been stagnant for the last few years. Though the data mining and
intelligent computing techniques are being used to get better performance but either the standard
goal needs modification or there is a need to collect different data with new information. This
ceiling effect (Menzies et al., 2008) on performance is observed when optimizing Area Under
ROC Curve. Davis et al. (Davis and Goadrich, 2006) argue that use of ROC curves may not be
useful in case of imbalanced datasets. One reason for the ceiling effect could be the use of ROC
for the models developed using imbalanced datasets. Better performance can be achieved in terms
of other metrics such as Recall.
Late Predictions
Studies use code metrics for more accurate predictions because measurements collected from soft-
ware product code perform better than the design measures (Jiang et al., 2008c). The models based
on code metrics use certain code attributes that are not necessarily applicable in certain develop-
ment paradigms. In other cases the predictions are made at a later stage in the SDLC as they use
code metrics, thus reducing the benefits of early predictions (Xing et al., 2005, Jiang et al., 2007).
Such predictions are considered late. Jiang et al. (Jiang et al., 2007) have tested early prediction of
software quality using requirements metrics without any promising results. Design and code met-
rics have been used to identify the defect prone modules with a high success rate (Khoshgoftaar
et al., 1996, Jiang et al., 2007, Xing et al., 2005, Wang et al., 2004, Jiang et al., 2008c). Jiang et al.
(Jiang et al., 2008c) have compared the prediction models based on design and code metrics and
have asserted that the models based on code metrics usually perform better than the models based
on design metrics. The combination of the design and code metrics gives better prediction results
(Jiang et al., 2008c) but delays the defect prediction until the code metrics are available. Early
prediction models based on design and code metrics are difficult to develop because precise values
of the model inputs are not available. Also, in initial phases of the SDLC data for prediction of the
14
same release is not easily available for development of prediction model. Conventional prediction
techniques require exact inputs, therefore such models cannot always be used for early predictions
when exact data is not available. Innovative prediction methods that use imprecise inputs, however,
can be applied to overcome the requirement of exact inputs.
Appropriate Model Selection
In the presence of many software prediction techniques with variable range of their applications in
SDLC and programming paradigms, selection of a prediction model becomes difficult for an orga-
nization (or project manager) (Bouktif et al., 2004). Selection of a prediction model is generally
based on a number of parameters such as software metrics available, phase of software devel-
opment lifecycle, software development paradigm, domain of the software (Jones, 2008), quality
attribute to be predicted, product based and value based views of the model (Rana et al., 2008,
Wagner and Deissenboeck, 2007) and so on. Selecting a model on the basis of so many parameters
poses a problem and results in subjective selection of a prediction model. In order to reduce this
subjectivity, a generic approach is needed which can help in objectively selecting a model.
1.3.3 Other
Metrics Nomenclature
The software metrics being used in defect prediction studies have issues in their nomenclature
(Khosgoftaar and Munson, 1990, Jensen and Vairavan, 1985, Jiang et al., 2008c, Vaishnavi et al.,
2007, Garcıa et al., 2009). They have been used with inconsistent names and interpretations. For
example model by Guo et al. (Guo and Lyu, 2000) refers to lines of code as TC (Total Code
lines) whereas many other models (like (Khosgoftaar and Munson, 1990)) refer to it as LOC. On
the other hand, models by Khoshgoftaar et al. (Khoshgoftaar et al., 1996, Khoshgoftaar and Allen,
1999b) refer to TC as ‘Total calls to other modules’. Similarly, other than some well known metrics
like Halstead’s program volume (V) and effort (E) (Halstead, 1977), and total Lines of Code many
metrics have been used with different names by different researchers. Such inconsistencies make
the existing models difficult to compare and significance of uniform naming and definition has
15
been highlighted in literature (Vaishnavi et al., 2007, Garcıa et al., 2009).
Causality of Defects
Major objection on the models based on software product metrics has been that these models do not
focus on cause of defects, and hence do not find a causal relationship between the software metrics
and defective software modules (Fenton and Neil, 1999). Studies to find the causal relationship
have also been conducted for example (Fenton et al., 2008, Fenton and Neil, 1999, Fenton et al.,
2007a,c,b). However, use of product metrics for the purpose has not reduced. The studies to find
causal relationship require expert based judgements which are difficult to get. At the same time
the data required for causal analysis is not available in the manner the data of product metrics is
available. The available public data has resolved the issue of lack of data but has introduced certain
new challenges such as imbalanced classes and data with poor quality.
Generality of Approaches
Though some attempts have been made to develop generic approaches to predict software quality
(Rana et al., 2008, Bouktif et al., 2006), but there are limitations due to different interpretations,
nomenclature and representation of model input parameters. For example, numerous product met-
rics like program Volume (V) and total Lines of Code (LOC), have been used with different names
by different researchers which has generated inconsistency (Dick and Kandel, 2003, Jensen and
Vairavan, 1985, Khosgoftaar and Munson, 1990, Koru and Liu, 2005a, Li et al., 2006, Ottenstein,
1979). The generic approach presented in (Rana et al., 2008) requires such inconsistencies to be
removed.
1.4 Our Approach
From the challenges described in the previous section, this thesis mainly addresses three limitations
shown in Figure 1.3. The first limitation is handled by finding associations between software
metrics and software defects. The second challenge is addressed by using fuzzy logic and the third
issue is resolved by standardizing naming of software metrics. This thesis focuses on software
16
Improving Software Qua lity Prediction UsingInte lligent Computing Techniques Naming Issues inSoftware MetricsImba lanced Datasets Late PredictionsFig. 1.3: Challenges addressed in this thesis
product metrics. The thesis does not discuss software process metrics. It also does not discuss
product metrics which are derived.
1.4.1 Preprocessing Imbalanced Datasets
Association between software metrics and software defects can be found and used for better pre-
diction models in presence of imbalanced datasets. This study uses Association Rule Mining
(ARM) to study the relationship between software metrics and software defects using public do-
main datasets (Menzies et al., 2012) as shown in Figure 1.4. Association Rule Mining (ARM) is an
important data mining technique and is employed for discovering interesting relationships between
variables in large databases (Sha and Chen, 2011). It aims to extract interesting correlations, fre-
quent patterns, associations or causal structures among sets of items in the transaction databases
or other data repositories (Sotiris and Dimitris, 2006). The study identifies the attributes and their
ranges that co-occur with defects. These ranges, termed as focused itemsets, can be used for better
planning of resources and testing. The ranges can further help study the relationship between the
ranges and software defects. The thesis also shortlists certain attributes and ranges that do not nec-
essarily cause defects. Frequency of attributes with critical ranges is calculated among all datasets
under study to find importance of each attribute. Attributes with critical ranges are more important
than the attributes with indifferent ranges. The focused itemsets have been used to preprocess five
public datasets. Performance of Naive Bayes (NB), the best defect prediction model over these five
preprocessed datasets has increased upto 40%. The approach has not decreased the performance
17
of the model at any instance.
1.4.2 Early Prediction of Defects
In order to address the issue of late predictions, this thesis presents a model that works with impre-
cise inputs. This model is a fuzzy inference system (FIS) that predicts defect proneness in software
using vague inputs defined as fuzzy linguistic variables and applies the model to real datasets, as
mentioned in Figure 1.4. The model can be used for earlier rough estimates when exact values of
software measurements is not available. Performance analysis in terms of recall, accuracy, mis-
classification rate and other measures shows the usefulness of the FIS application. Predictions by
the FIS model at an early stage have been compared with conventional prediction methods (i.e.
classification trees, linear regression and neural networks) that use exact inputs. In case of the FIS
model, the maximum and the minimum performance shortfalls were noticed for true negative rate
(TNRate) and F −measure respectively. Whereas for Recall, the FIS model performed better
than the other models even with the imprecise inputs. Work by Yang et al. (Yang et al., 2007) is
similar to ours in the sense that they also intend to predict the quality when exact values of software
metrics are not known and the domain knowledge and experience of the project managers can be
used to approximate the metrics values. Furthermore, they also intend to find the rule set which has
the capability to reason under uncertainty, which is a limitation of most of the quality prediction
models (Fenton and Neil, 1999). This study also extracts some rules in an attempt to overcome the
ceiling effect problem (Menzies et al., 2008) in the defect prediction models.
1.4.3 Resolving Software Metrics Nomenclature Issues
The present work removes inconsistencies found in naming of software product metrics and unifies
them by introducing a Unified Metrics Database (UMD). The UMD will be helpful for software
managers who need to identify and decide which datasets are similar to their problem domain
or which metrics to use for their projects. The UMD can further be helpful for future studies
on software product metrics and development of prediction models based on these metrics. As
a part of the development process of UMD, this thesis identifies two types of inconsistencies in
18
Improving Software Qua lity Prediction UsingInte lligent Computing Techniques Naming Issues inSoftware MetricsImba lanced Datasets Late PredictionsAssociation Mining Fuzzy Inference Nomenc latureUnificationFig. 1.4: Our approach to address challenges
naming of attributes and presents a resolution framework. The suggested framework resolves these
inconsistencies on the basis of definition of the metrics and chronological usage of metric labels.
This thesis also identifies the metrics inappropriately used for defect prediction and shows that
these metrics are ineffective for prediction of defects in object-oriented paradigm (Rana et al.,
2009b).
1.5 Our Contributions
Approach presented in the previous section has addressed the challenges in the domain of study
and has provided encouraging results. Figure 1.5 indicates our contributions when addressing
the aforementioned challenges. The association mining based preprocessing has increased perfor-
mance (Recall) of an SDPM by 40%. Fuzzy inference system developed for early prediction of
defects gives performance comparable to models that need precise inputs. The third challenge has
been addressed by developing a database of metrics with unified names. As mentioned earlier the
software defect datasets are imbalanced. This imbalance affects prediction of defect-prone mod-
ules. This thesis suggests that association between software metrics and software defects can be
used to improve prediction of defect-prone modules. The thesis proposes a preprocessing approach
that has resulted in improved Recall of Naive Bayes based model by 40% as indicated in Figure
19
1.5. These results have been obtained by experimenting with 5 datasets.
The issue of late predictions has been addressed by developing a model that can work when
exact values of software metrics are difficult to collect, however, rough estimates of the metric
values can be made. These rough estimates of metric values can be used to predict software defects
using the proposed fuzzy inference system based on Sugeno’s fuzzy inference principle as shown
in Figure 1.5. The model has been developed for 2 datasets.
The issue of different nomenclature for software metrics has been resolved by suggesting and
applying a framework that preserves the name of a metric that has been associated to it chronolog-
ically. More than 140 software metrics have been categorized using this framework and a unified
database has been developed as specified in Figure 1.5.
In addition to the above contributions, a tool for practitioners and researchers has been been
developed that helps its users to predict defects using 11 models. The tool also allows its users
to aggregate model results such that more than one models can be benefitted from. The users can
compare performance of different models on benchmark datasets, specify their scenario and select
the most suitable model for their scenario. The users can also provide a dataset and find a the
dataset from public domain which is the most similar to their dataset.
1.6 Thesis Layout
Rest of the thesis is organized as follows:
Chapter 2 presents background of the Software Defect Prediction domain, metrics and datasets
used for the prediction, and a survey of software prediction and related studies. The survey includes
studies with models based on data mining, machine learning and statistical techniques. Different
models used for defect prediction include linear regression, pace regression, support vector regres-
sion, logistic regression, neural network, naive bayes, instance-based learning, J48 trees, linear
and quadratic discriminant analysis, logistic regression, naive bayes, bayesian networks, k nearest
neighbors, support vector machines, classification and regression trees, and random forest. The
chapter also describes the prediction studies based on these models and their limitations.
Chapter 3 presents the comparative study performed to select the best model from the existing
20
Improving Software Quality Prediction UsingIntelligent Computing Techniques Naming Issues inSoftware MetricsImbalanced Datasets Late PredictionsAssociation Mining Fuzzy Inference NomenclatureUnificationRecall of NaïveBayes Classifierincreased by 40%. Predictions basedon Fuzzy inputs arecomparable withpredictions usingprecise data. Unified MetricsDatabase (UMD)developed.Fig. 1.5: Our contributions
21
ones. It presents, in detail, a comparative study of defect prediction models and the criteria to select
the best model to be used for improving Recall. This (replicated) study in inspired by the work of
Lessman et al. (Lessmann et al., 2008) and uses their method and statistics for comparison. The
study presented in Chapter 3 compares three additional models with the two models they adjudged
as best performing models in defect prediction.
Chapter 4 presents an association mining based approach to preprocess data and improve
Recall of defect prediction models. Most of the pubic datasets (Menzies et al., 2012) have sig-
nificantly larger number of examples which correspond to a Not Defect-prone (ND) module as
compared to examples corresponding to Defect-prone (D) modules. This prevents a model to learn
D modules effectively. The preprocessing approach presented in this chapter allows the models
to learn D modules in imbalanced defect datasets. The chapter presents the results of developing
Naive Bayes (NB) classifier (one of the best classifiers alongwith Random Forests in field of defect
prediction (Menzies et al., 2010, Lessmann et al., 2008)) for five datasets. Upto 40 % of perfor-
mance gain has been observed in terms of Recall. Stability of the approach has been tested by
experimenting the algorithm with different number of bins.
A fuzzy based model for early prediction of defect prone modules is given in Chapter 5. It is
desired and is useful to predict defect prone modules when code metrics are not yet available since
this can help avoid defects later in the lifecycle. Defects caught later in the lifecycle have higher
costs of fixing. This chapter proposes a model that works with imprecise inputs. The model can
be used for earlier rough estimates when exact values of software measurements are not available.
Chapter 6 addresses issues related to limitations due to different interpretations, nomenclature,
and representation of model input parameters. This chapter identifies two types of inconsisten-
cies in naming of software product metrics and presents a resolution framework. The suggested
framework resolves these inconsistencies on the basis of definition of the metric and chronological
usage of the metric name. A Unified Metrics Database (UMD) is also introduced, in this chapter.
Moreover, the chapter identifies the metrics that are not necessarily applicable in certain devel-
opment paradigms (for example object oriented paradigm) shows that these metrics degrade the
performance of prediction models.
22
Chapter 7 presents the conclusion and future directions.
The thesis also presents in Appendix A, a tool developed during the current study. The pro-
totype tool named PraTo has been developed to remove the aforementioned limitations. This tool
is envisaged to increase utility of prediction models for software industry. PraTo has a collection
of over 15 defect prediction techniques and a number of public datasets. The public datasets are
available in original as well as preprocessed form to improve performance of the models to be
built. PraTo also enables a user to select a suitable model for a certain scenario by asking the user
to provide a context. Based on the context, the most suitable model is used for prediction. A user
can provide a context by specifying a scenario. A user may specify a scenario in following ways:
1. Compare a dataset with a public dataset. The closest public dataset is the user scenario. The
model with the best performance, in terms of recall, on the public data is selected.
2. Provide a set of constraints in terms of Human Resource, Budget and Time. Based on the
severity of the constraints, values of the performance measures are determined. The model
with the best values of the performance measures is selected.
3. Select three model performance measures. Provide relative importance of the measures. An
Analytic Hierarchical Processing (AHP) based technique is applied to select a model.
Appendix A also suggests an approach which is composed of existing models (called compo-
nent models). Given a software in terms of its collected metrics, the approach selects the most
appropriate applicable model to predict the desired quality factor value. Like an integrated ap-
proach suggested by Wagner et al. (Wagner and Deissenboeck, 2007), our approach also requires
the prior analysis of the important quality factors. The proposed approach will automate the se-
lection of an appropriate prediction model and will calculate desired quality factor on the basis of
the selected model. To propose this approach we have studied the techniques applied for software
quality prediction in structural and object oriented programming paradigms. We have presented
a logical grouping of dimensions under which a prediction technique should be selected. This
multidimensional model helps quality managers in the selection of a prediction method.
23
Appendix B lists the software metrics as they appear in a Unified Metrics Database (UMD) and
includes base as well as derived metrics.
1.7 Summary
Software Quality has multiple facets and defectiveness of a software module is one on the facets of
software quality. Early information regarding defectiveness of software modules helps in resource
planning and focusing on potentially problematic areas of software. Defect prediction models are
used to get this defectiveness information beforehand. These models predict the software modules
as defect-prone and not-defect-prone based on the software metrics. Availability of public datasets
in PROMISE (Menzies et al., 2012) repository has resulted in numerous defect prediction studies.
There are many challenges faced by the defect prediction community such as:
1. Public domain datasets being used to validate defect prediction models are imbalanced.
Models developed using these datasets are unable to achieve very high Recall.
2. Most of the studies use code metrics available in public datasets. Predictions based on code
metrics are late.
3. Available datasets and software defect prediction studies use different nomenclature for cer-
tain metrics.
4. Relationship between software metrics and defects is unknown.
5. Existence of numerous models has made it difficult to select an appropriate prediction model.
6. There are quality issues related to the public domain datasets. The datasets sometimes con-
tain single valued attributes and in some cases they have repeated rows.
7. The datasets include examples with conflicting class labels, for example there are more that
one software modules which have same values for software metrics but one of them is labeled
as defect-prone and others are labeled as not-defect-prone.
24
The above list of challenges in the field of software quality prediction in general and software
defect prediction in particular is not exhaustive. There are other issues being addressed by numer-
ous studies who mine software repository to improve quality of the software. Present work is one
of such studies which addresses the first three limitations and presents the remedy in the following
chapters.
25
2. BACKGROUND STUDY AND LITERATURE SURVEY
A large number of Software Defect Prediction (SDP) models have been developed and reported in
literature. This chapter briefly describes software defect prediction and discusses datasets used for
development of the SDP models. This chapter also presents a survey of various software quality
prediction studies. Furthermore, studies related to evaluation of defect prediction models and
handling inconsistencies in software measurements terminology are also discussed. The chapter
also characterizes viewpoints found in the literature, and highlights limitations and issues in the
domain of software defect prediction.
2.1 Software Defect Prediction
Software defect prediction is an estimation of defect proneness of a software (or its modules).
This prediction is done using a prediction model which takes software metrics as input and pro-
vides defect proneness information at the output as shown in Figure 2.1. The software metrics
are the software related data collected during the software lifecycle, for example data about struc-
tural complexity of the software, coupling between software modules, and cohesion of software
modules. These data are sometimes available to support defect prediction studies. Many software
development organizations choose not to share their software related data publicly and hence their
data is used in limited number of defect prediction studies only. These prediction models are typ-
ically based on Artificial Intelligence. Software defect prediction models are usually evaluated
using confusion matrix (Jiawei and Micheline, 2002). Receiver Operator Curve (ROC) is another
parameter used to gauge performance of software defect prediction models.
The following sections first discuss software defect datasets, explain the model evaluation pa-
rameters, and afterwards, present a summary of defect prediction studies.
Defect Prediction ModelSoftware Metrics Defect Prone (D) /Not Defect Prone (ND)Number of DefectsFig. 2.1: Software Defect Prediction
2.2 Software Defect Data
Software defect datasets provide software measurements and defectiveness information as at-
tributes of the datasets. Measurements of software are represented in form of software metrics
where each metric becomes an attribute of the dataset. Presence or absence of defects is also pro-
vided as an attribute. A dataset is a collection of records where each record represents a software
or a software module.
The Datasets used in defect prediction studies describe the defectiveness information regard-
ing various software modules. Each instance in a dataset represents a software module and the
attributes are software metrics calculated for that module. Class attribute in each dataset indicates
if the software module is defect prone (D) or not-defect prone (ND). Given a number of datasets
and metrics in those datasets, it is not possible to describe all datasets and metrics in a single study.
We have selected 17 datasets based on their use in literature. These selected datasets have similar
attributes which makes the comparison relatively easier. Table 2.1 lists these datasets and their
characteristics such as the number of instances in each dataset, percentage of (instances of) ND
modules in each dataset, and the development language used for the software. The percentages
show that the datasets are dominated by ND modules.
27
Tab. 2.1: Selected Datasets and their Chracteristics
Dataset
Language No. of Instances Percentage of
ND Modules
(Approx. %)
No. of At-
tributes
Group
CM1 C 498 90 22 1
JM1 C 10885 81 22 1
KC1 C++ 2109 85 22 1
KC2 C++ 522 80 22 1
PC1 C 1109 93 22 1
PC3 C 1563 90 38 2
PC4 C 1458 88 38 2
MW1 C++ 403 92 38 2
AR1 C 121 93 30 3
AR4 C 107 81 30 3
AR6 C 101 85 30 3
PC5 C++ 17186 97 39 4
MC1 C++ 9466 99 39 4
KC3 Java 458 91 40 5
Kc1-class-
level
C++ 145 59 95 6
jEdit-Bug-
Data
Java 274 8 6
These datasets are different from each other in at least two ways: they use different attributes
(software metrics), and they have different number of attributes. The datasets have been divided
into 5 groups based on the number of common attributes as shown in Table 2.1. Group 6 consists
28
of 2 datasets that do not have metrics common with other datasets and use metrics from Object
Oriented paradigm. The grouping presented here is further used when evaluating the suggested
preprocessing approach in Chapter 4. Certain variants of the datasets have also been used but their
use in not part of main thesis. Another grouping and preparation of variants of the datasets can be
found in Appendix A.
Table 2.2 provides a list of metrics used in these datasets. Description of other metrics can
be found in Appendix B. Table 2.3 provides the information on metrics used in each dataset.
A cross(×) in a cell indicates the presence of a metric in the corresponding dataset. Metrics
information regarding datasets not listed in this table can be found in respective chapters.
Tab. 2.2: Selected Software metrics reported in the datasets used
Software Metric Name
v(G) McCabe’s Cyclomatic Complexity
ev(G) Essential Complexity
iv(G) Design Complexity
LOC McCabe’s Lines of Code
V Halstead Program Volume
L Halstead Program Level
D Halstead Program Difficulty
I Halstead Intelligence Content
E Halstead Effort
B Halstead Error Estimate
T Halstead Programming Time
N Halstead Program Length in terms of Total Op and
Total Opnd
LOCode Halstead’s Lines of Code
LOComment Lines of Comment
Continued on next page
29
Table 2.2 – continued from previous page
Software Metric Name
LOBlank Blank Lines
LOCodeAndComment Lines of Code and Comment
Uniq Op Unique Operators
Uniq Opnd Unique Operands
Total Op Total Operators
Total Opnd Total Operands
BranchCount Total Branches in Program
Tab. 2.3: Datasets and their attributes used in this study
Attribute
CM1 JM1 KC1 KC2 KC3 PC1 PC3 PC4 PC5 MC1 MW1AR1 AR4 AR6
loc × × × × × × × × × × × × × ×v(g) × × × × × × × × × × × × × ×ev(g) × × × × × × × × × × ×iv(g) × × × × × × × × × × × × × ×CALL PAIRS × × × × × × ×CONDITION
COUNT
× × × × × × × × ×
CYCLOMATIC
DENSITY
× × × × × × × × ×
DECISION
COUNT
× × × × × × × × ×
DECISION
DENSITY
× × × × × × ×
Continued on next page
30
Table 2.3 – continued from previous page
Attribute CM1 JM1 KC1 KC2 KC3 PC1 PC3 PC4 PC5 MC1 MW1AR1 AR4 AR6
DESIGN
DENSITY
× × × × × × × × ×
EDGE
COUNT
× × × × × ×
ESSENTIAL
DENSITY
× × × × × ×
LOC EXE-
CUTABLE
× × × × × × × × ×
PARAMETER
COUNT
× × × × × ×
GLOBAL
DATA COM-
PLEXITY
× × ×
GLOBAL
DATA DEN-
SITY
× × ×
n × × × × × × × × × × × × × ×v × × × × × × × × × × × × × ×l × × × × × × × × × × × × × ×d × × × × × × × × × × × × × ×i × × × × × × × ×e × × × × × × × × × × × × × ×b × × × × × × × × × × × × × ×t × × × × × × × × × × × × × ×lOCode × × × × × × × ×
Continued on next page
31
Table 2.3 – continued from previous page
Attribute CM1 JM1 KC1 KC2 KC3 PC1 PC3 PC4 PC5 MC1 MW1AR1 AR4 AR6
lOComment × × × × × × × × × × × × × ×lOBlank × × × × × × × × × × × × × ×locCodeAndComment× × × × × × × × × × × × × ×MAINTENANCE
SEVERITY
× × × × × ×
MODIFIED
CONDITION
COUNT
× × × × × ×
MULTIPLE
CONDITION
COUNT
× × × × × × × × ×
NODE
COUNT
× × × × × ×
NORMALIZED
CYLOMATIC
COMPLEX-
ITY
× × × × × × × × ×
uniq Op × × × × × × × × × × × × × ×uniq Opnd × × × × × × × × × × × × × ×total Op × × × × × × × × × × × × × ×total Opnd × × × × × × × × × × × × × ×branchCount × × × × × × × × × × × × × ×PERCENT
COMMENTS
× × × × × ×
defects × × × × × × × × × × × × × ×Continued on next page
32
Table 2.3 – continued from previous page
Attribute CM1 JM1 KC1 KC2 KC3 PC1 PC3 PC4 PC5 MC1 MW1AR1 AR4 AR6
Collecting and extracting data is a very complex process. Usually data extracted is of poor
quality. This means that there may be some features or data points in the data that might affect the
results of a data mining procedure. Use of public datasets has been frequent in defect prediction
studies. These datasets have been used despite the critique on the datasets (Gray et al., 2011,
Shepperd et al., 2013) because they are publicly available at PROMISE repository (Menzies et al.,
2012). One major problem in these data sets is of repeated data points i.e. observations of module
metrics and their corresponding fault data occurring more than once. This may lead to over-
fitting of classification models. As the proportion of repeated information increases, so does the
performance of the classifier. It is worth pointing out that the severity of repeated data points is
algorithm specific. Naıve Bayes classifiers have been reported to be fairly resilient to duplicates
(Kolcz et al., 2003). Also, the training of learning methods on such data is not very problematic.
A very simple oversampling technique is to duplicate minority class data points during training,
however these data points should not be included in the testing data set. Another problem is of
imbalanced data sets i.e. for this case, class distribution of faulty and faultless modules is skewed
towards faultless modules. However, only some classifiers (like C4.5) are affected by the class
imbalance problem. Then there is the problem of inconsistent patterns. They occur where the
same set of metrics is used to describe both a ‘defective module’ module and a ‘non-defective’
module. They typically introduce an upper bound on the level of performance achievable but are
otherwise harmless. Presence of constant attributes do not seem to pose problems to the learners
and only data set KC4 contains the problem of repeated attributes. Missing values also may or may
not be problematic depending on the type of classification method used. It is not possible to check
correlated attributes, since the code for any data set is not available.
33
Fig. 2.2: Confusion Matrix of a Defect Prediction Classifier
2.3 Measuring Performance of SDP Models
This section describes the performance indicators used to evaluate defect prediction models. In the
rest of the section, confusion matrix is described, the process of ROC curve analysis is presented
and AUC as an evaluation measure is explained. Furthermore, definition and importance of Recall
as a measure is also provided.
2.3.1 Confusion Matrix
Binary classification is a two-class prediction problem, in which the outcomes are labeled as either
positive or negative. For defect prediction, binary classification involves mapping all instances of
a data set containing defective and non-defective modules to defective and non-defective classes.
There are four outcomes possible from this binary classification. If a module is defective and it is
classified as defective then it is called true positive (TP ), but if it is non-defective then it is called
false positive (FP ). Similarly, if a module is non-defective and it is classified as non-defective
then it is called true negative (TN ). However, if it is classified as defective then it is called false
negative (FN ). Figure 2.2 shows a confusion matrix based on this output.
The probability of detection (PD) is defined as the probability that a defective module is cor-
rectly classified as defective. It is also known by the terms sensitivity, recall, true positive rate
(TPR), rate of detection and hit rate.
PD =TP
FN + TP(2.1)
The probability of false alarm (PF ) is defined as the proportion of non-defective modules that
are correctly identified as fault-free. It is also known by the terms specificity, false positive rate
(FPR) and false alarm rate.
34
PF =FP
TN + FP(2.2)
Both these measures and other similar measures (e.g. precision, accuracy etc.) are derived
from the confusion matrix and tell a one-sided story (Lessmann et al., 2008, Jiang et al., 2008a,
Metz, 1978). Some measures like J-coefficient, F-measure and G-mean, are also derived from the
confusion matrix and are more useful as reported in (Jiang et al., 2008a). However, they still do
not aid in simple classifier comparison and selection.
2.3.2 ROC Curves and AUC
As opposed to the performance parameters mentioned in Section 2.3.1, Receiver Operating Char-
acteristic (ROC) curves enable an objective and simple comparison between algorithms, the results
of which can be converged with other studies which also use ROC curve analysis for model evalua-
tion. Many classifiers enable users to specify and alter a threshold parameter so that corresponding
models can be generated. For such models, a higher PD can be produced but with a higher PF
and vice versa. The (PF, PD) pairs obtained by alterations of the threshold parameter form a
ROC curve with PD on the x-axis and PF on the y-axis as shown in Figure 2.3. Every ROC
curve passes through the points (0, 0) and (1, 1). The former represents a classifier that always
predicts a module as non-defective while the latter represents a classifier that always predicts a
module as defective. The ultimate goal is the upper-left corner (0, 1) which represents a model that
identifies all defective modules without making any error. The advantage of ROC curves is that
they are robust towards imbalanced data sets and to changing and asymmetric misclassification
costs. These features are characteristic of software prediction tasks. Therefore, ROC curves are
particularly suited for this task.
The Area Under the ROC curve (abbreviated as AUC) is a common measure used for depicting
accuracy of classifiers from ROC curves based on the same data. Higher AUC values mean that
the model is more towards the upper left portion of the ROC graph. If value of AUC = 1, then
the test is perfect. In general, a value well above 0.5 indicates that the model is effective and
gives valuable advice as to which modules should be focused on while testing and debugging
35
Fig. 2.3: ROC curve of three classifiers with best performance of C1
Source: http://gim.unmc.edu/dxtests/ROC3.htm
the respective software system. The AUC measure is especially useful for imbalanced data sets.
Its use can highly improve convergence across studies in defect prediction because it measures
only accuracy of prediction and does not take into account the operating conditions like class and
cost distributions. Thus it represents an objective and general measure for reporting performance
with respect to predictions. It has a clear statistical meaning: it measures the probability that a
model correctly classifies a randomly chosen model as faulty or faultless. For more information
on ROC curves and AUC in software defect prediction, the reader is referred to (Lessmann et al.,
2008)(Jiang et al., 2008a)(Demsar, 2006). For more details on ROC curves, in general, (Metz,
1978) provides a comprehensive analysis.
2.4 Software Defect Prediction Studies
Initial work in the area of quality prediction was limited to relationship between software science
metrics (Halstead, 1977) and number of errors in the software for example (Ottenstein, 1979, 1981,
Schneider, 1981). In 1977, Halstead (Halstead, 1977) introduced software complexity metrics,
also called software science metrics, which laid down a base for the study of software metrics and
quality.
36
Early work done on quality prediction (Khosgoftaar and Munson, 1990, Ottenstein, 1979,
1981) was of empirical nature. It was inspired from mathematical models and laid a base for
use of statistics and probability in connection with quality prediction.
In 1990s, the focus of the studies changed from predicting number of defects to classifica-
tion and / or identification of fault-prone modules (Briand et al., 1993, Ohlsson and Alberg, 1996,
Munson and Khosgoftaar, 1992, Khosgoftaar et al., 1994, Khoshgoftaar et al., 1997a). Prediction
can be made with the help of software metrics based software quality models (Wang et al., 2007).
These model can either predict a certain quality factor like number of errors or classifies the mod-
ules as fault prone or not fault prone (Wang et al., 2007). The models which output a quality factor
are considered as prediction models (Wang et al., 2007) whereas the later are termed as classifi-
cation models (Wang et al., 2007, Gokhale and Lyu, 1997). Various artificial intelligence based
and mathematically and statistically inspired techniques, for example neural networks, discrimi-
nant analysis, optimized set reduction were used for prediction purpose (Munson and Khosgoftaar,
1992, Briand et al., 1993, Khosgoftaar et al., 1994, Khoshgoftaar et al., 1996, Khoshgoftaar and
Allen, 1999a, Ganesan et al., 2000, Khoshgoftaar and Allen, 1999b). Rest of the section provides
a survey of different defect prediction studies.
2.4.1 Factor and Regression Analysis
Khoshgoftaar et al. (Khosgoftaar and Munson, 1990) have developed prediction models on the
basis of factor and regression analysis. Their model assimilated the relationship between metrics
of program error and software complexity metrics. They have also investigated the relationship
between program error measures and the metrics short listed (on the basis of factor analysis) from
software science metrics (Halstead, 1977). They have employed factor analysis to reduce the set
of all collected metrics to highly related metrics and then have performed the regression analysis
on that smaller set of metrics to find the number of errors the software modules might contain.
37
2.4.2 Discriminant Analysis and Principle Component Analysis
Munson et al. (Munson and Khosgoftaar, 1992) have classified program modules as fault-prone
and not fault-prone using discriminant analysis. They have used uncorrelated complexity and soft-
ware metrics to categorize a program as fault-prone or as not fault-prone. Earlier studies have
shown that metrics of software complexity and errors during software life-cycle have some rela-
tionship (Endres, 1975, Basili and Perricone, 1984, Chmura et al., 1990). So they have developed
the assignment model based on complexity information. They have used historical development
metrics and quality data to develop the predictive model and then used the model to categorize
the modules according to their metric profile. They first employed Principle Component Analysis
(PCA) based approach to extract uncorrelated metrics before applying discriminant analysis on the
basis of those uncorrelated metrics. The discriminant analysis accomplishes the job of classifying
the modules.
Khoshgoftaar et al. (Khoshgoftaar et al., 1996) have used PCA and discriminant analysis to
find the set of most significant metrics and classify the fault-prone and not fault-prone modules.
They kept the process and product metrics as independent variable. The class of a module which is
either fault-prone or not fault-prone is the dependent variable in their case. They have conducted
their analysis on the data of a large telecommunication system. They performed PCA on the
process and product metrics to find which metrics are significant for quality and which are not. The
PCA, in their model, identifies the highly co-related and uncorrelated data. The uncorrelated data
forms the principle components which represent the same data but in a new co-ordinate system.
Now these principle components, which are domain metrics, are input to the classification model,
which is a non-parametric discriminant analysis based model. They developed two models and the
misclassification errors up to 31.1% and 22.6% were observed respectively.
2.4.3 Bayesian Models
Abe et al. (Abe et al., 2006) have applied Bayesian classifier on project metrics to predict the
quality of a completed software. For the selection of metrics on which the Bayesian classifier will
be applied, they have suggested and used Wilcoxon rank sum test (Mann and Whitney, 1947) on
38
each metric. This test compares the location of two populations and based on that comparison it
tells if one of the populations is different from the other or not? They have applied this test to
find statistical difference between the distributions of successful and unsuccessful projects. Such
differences can single out the metrics which affect quality. After applying this test they select 10
metrics out of 29 which can contribute towards the quality of the software. They then apply the
classifier to predict whether the given software project is successful (of good quality) or not.
Li et al. (Li et al., 2006) have shared empirical results of a quantitative defect prediction
approach. Their major focus had been on improvement of testing and resources. Their work
has categorized the metrics available before release for field defects predictions and hence helps
in future improvement. They have compared seven prediction models which include clustering
algorithm as well as the regression models using rank correlation. On the basis of the comparison
they have identified the important predictor for a certain type of software. They have also used
Bayesian Information Criterion (BIC) to determine important and prioritized areas for product
testing.
Bouktif et al. (Bouktif et al., 2006) have presented an approach for improving software quality
prediction. Their approach was to re-use and adapt already available quality prediction models.
They have given an idea of using simulated annealing algorithm on top of a Bayesian classifier
approach. They have generalized the approach of selecting the prediction model by combining
quality experts and model the expertise as Bayesian classifier and run their suggested algorithm.
The algorithm outputs the best subset of expertise from the set of all those expertise. They have
tackled this problem of finding optimal subset as an optimization problem.
Khoshgoftaar et al. (Khoshgoftaar et al., 1997a) have introduced process based metrics to
predict software quality. They have used reliability indicators to predict the quality of the soft-
ware. They have also emphasized that quality of a process reflects the quality of the product itself.
According to them early prediction of reliability indicators motivates the use of reliability im-
provement techniques prior to module integration. Their core work had been on improvement of
integration and testing processes, which eventually improves the software quality as well. They
have suggested that product metrics, as used by various quality prediction models, are not good
39
tools for the task in systems which are evolving with each iteration (for example the systems using
spiral life cycle). They have classified the module as fault-prone and not fault-prone and this clas-
sification helps them focus on modules which are more prone to fault. Their classification model
was probability based and has performed approximately 35.5% misclassifications.
Khoshgoftaar et al. (Khoshgoftaar and Allen, 1999a) have suggested a logistic regression-
based model for software quality prediction and classifying modules as fault-prone and not fault-
prone. Unlike other regression modeling techniques they have taken into account the prior proba-
bilities of misclassifications and cost of misclassification to the classification rule for the logistic
regression-based model. They have performed the analysis on a subsystem of a military real-time
system.
Mockus et al. (Mockus et al., 2005), using regression analysis, have predicted customer per-
ceived quality by measuring service interactions like defect reports, requests for assistance, field
technician patches and other parameters in a large telecommunication software system. They have
investigated the impact of problem occurrence and frequency of problem occurrence and discov-
ered that these will negatively affect the customer perceived quality. Mockus et al. (Mockus et al.,
2005) have used the data collected by automated project monitoring and control and found that
deployment schedule, hardware configurations and software platforms can affect the probability
of a software failure. Furthermore, their findings have suggested that these factors equally affect
the customer perceived quality. Their work has been more helpful in planning of customer support
processes unlike other prediction models which help the planning of development.
Nagappan et al. (Nagappan and Ball, 2005b) have employed code churn metrics for predicting
the system defect density. They have used eight relative metrics like ratio of changed LOC to total
LOC, deleted LOC to total LOC etc., and applied the statistical regression models to predict defect
density. They have found that absolute measures of code churn are not good predictors of defect
density.
Neil et al. (Neil and Fenton, 1996) have highlighted the importance of factors like contextual
information to improve prediction of software defects. They have used a combination of software
process and product metrics to build Bayesian Belief Network (BBN). Their expert driven BBN
40
was capable of making better prediction on the basis of contextual information.
Fenton et al. (Fenton and Neil, 1999) have criticized existing models for being unable to take a
holistic view while predicting defects. For example, certain models predict defects using size and
complexity metrics whereas others use testing metrics only, consequently ignoring the potential
predictors like process metrics. Moreover, these models do not take into account the relation-
ship between software metrics and defects. Fenton et al. (Fenton et al., 2002) have presented a
BBN to overcome the aforementioned limitations. Their model has shown encouraging results
when used to predict defects in multiple real software projects. The BBN has been improved to
work independent of software development life cycle (Fenton et al., 2007b) and has also been em-
ployed by different studies (Fenton et al., 2008, Amasaki et al., 2003, Dabney et al., 2006, Pai and
Bechta Dugan, 2007) to predict software quality.
Causal relationship between software metrics and defects is important in understanding and
improving software processes (Card, 2004, 2006, Chang and Chu, 2008). Thapaliyal et. al (Tha-
paliyal and Verma, 2010) have shown that in object oriented paradigm, ‘Class Size’ is a metric that
has strong positive relationship with defects but ‘Coupling between Objects’ and ‘Weighted Meth-
ods per Class’ are insignificant to be used to predict defects. This study employed weighted least
square model for empirical analysis. Similarly, use of contextual information has been reported to
be ineffective as compared to code metrics only (Bell et al., 2011).
2.4.4 Neural Networks
Khoshgoftaar et al. (Khosgoftaar et al., 1994) have suggested a neural network based classifica-
tion model to identify the modules which have high-risk of error and have applied it in quality
evaluation and resource planning in telecommunication software. They have compared the classi-
fication results with the results of discriminant analysis based classification. Their finding was that
neural network based approach is better in terms of providing management support in software
engineering. Also neural networks based techniques are simpler and easier to use and produce
more accurate models as compared to discriminant analysis.
41
Pizzi et al. (Pizzi et al., 2002) have performed Neural Networks (NN) based classification of
software modules and predict the quality accordingly. They have classified the modules (objects)
on the basis of quality attributes like maintainability, extensibility, efficiency and clarity. Upon
classification the low quality objects might be reviewed by project manager for betterment. Their
approach recommends to ask an expert software architect to determine class labels and then use
neural networks to classify objects. Since the labeling of classes involve a human so subjectivity
can not be avoided in this approach. To compensate such imprecision they have performed prepro-
cessing by using median-adjusted class labeling. They have applied their approach on 19 different
metrics and based on the values of those metrics, the class labels were set by the expert architect.
Thwin et al. (Thwin and Quah, 2002) have conducted an empirical study and propose a NN-
based approach to estimate software quality. Similar study has been conducted afterwards by Quah
et al. (Quah and Thwin, 2003). They have used two different neural network models (Ward NN
(WNN) model and General Regression NN (GRNN) model) for estimating number of defects (re-
liability) per class and number of line changes (maintainability) per class. They have used object
oriented metrics like inheritance related metrics, cohesion metrics and coupling metrics as inde-
pendent variables. They have also used traditional complexity metrics for prediction purpose. The
WNN model was based on Gaussian function while the GRNN was an approach which estimated
about the continuous variables. After applying both approaches they have found that GRNN pre-
dicts the quality more accurately. Unlike Quah et al. (Quah and Thwin, 2003), Thwin et al. (Thwin
and Quah, 2002) have taken into account the reliability metric (i.e. number of defects) only.
Wang et al. (Wang et al., 2004) have also used neural network based approach for quality pre-
diction but they have extracted rules from the prediction model and then used those rules to detect
the fault prone software modules. They have used genetic algorithms and rule based approach on
top of NN based approach. They have discouraged the use of NN as black box and have introduced
an interpretable NN-based model for software quality prediction.
42
2.4.5 Classification/Regression Trees
Khoshgoftaar et al. (Khoshgoftaar and Allen, 1999b) have used Classification and Regression
Trees (CART) to predict fault-prone software components in embedded systems. The goal of their
prediction was to improve reliability of embedded systems. They have used measurement and
fault data from previous releases to develop a prediction model. This model was applied to the
modules which were currently being developed to predict the software modules which have high
risk of faults to be discovered later on. They have taken software metrics as independent (predictor)
variable and treated the classes of fault-prone or not fault-prone modules as dependent variables.
Gokhale et al. (Gokhale and Lyu, 1997) have used regression tree modeling for the prediction
of software quality. The independent variables in their prediction model were software complexity
metrics. The total number of faults and the set of classification of modules were considered as
dependent variables. They have compared their approach with the defect density model for predic-
tion and have found that their approach had lower misclassification rate as compared to the defect
density model. The misclassification rate as quoted by them is 14.86% for their tree modeling
and 20.6% for defect density techniques. In addition their approach was robust to the presence of
outliers and was capable enough to handle the missing values as well. Also, their approach has
catered the highly uncorrelated data and performed stably.
2.4.6 Case-based Reasoning
Ganesan et al. (Ganesan et al., 2000) have targeted better reliability and give a Case-based Reason-
ing (CBR) approach for quality prediction. In contrast to other approaches which use qualitative
metrics and CBR for predicting quality by identifying similar cases, their approach predicts quan-
titative metrics of quality. They have narrated a case study in which a family of full-scale industrial
software systems applied CBR for software quality modeling. While predicting faults the accuracy
of CBR system was found to be better than a corresponding multiple linear regression model. The
CBR system has given better results for approximately 67% of the datasets and equivalent results
for the rest of the 33%.
Grosser et al. (Grosser et al., 2003) have presented a case-based reasoning based stability
43
prediction model for object-oriented systems. Their approach was to record stability value of a
certain set of components or items. These stability values are already known and are correct.
These stored values act as reference values for new instances of components. The newly arrived
component is compared with the stored component. The set of components which the arriving
component is most similar to decide the stability of the new component, i.e the stability value of
the nearest set of components is the stability value of the newly arrived component. They have
used successive versions of Java Development Kit (JDK) as validation data for their model.
2.4.7 Fuzzy Approaches
Yuan et al. (Yuan et al., 2000) have applied subtractive fuzzy clustering to classify the modules
as fault-prone and not fault-prone and predict the number of faults. They cluster the data by
picking up potential centers of clusters and selecting data point with highest potential as the center.
This process of selecting centers for forthcoming clusters continues until the potential values of
remaining data points fall below a certain threshold. Then rules are generated for each class using
these membership functions of data points. They have used Sugeno-Principle for fuzzy inference.
These rules are then used to predict number of faults. They perform clustering on product as well
as process metrics. The approach used by Yuan et al. (Yuan et al., 2000) is different from the
approach of Khosgoftaar et al. (Khoshgoftaar et al., 1997a) in the sense that Yuan et al. have used
fuzzy clustering while Khoshgoftaar et al. have used probability based technique for classification.
Dick et al. (Dick and Kandel, 2003) have employed fuzzy clustering based on software metrics
and PCA on those clusters together to identify which are more significant modules in terms of
effort and resources needed. Their approach have identified the few relatively complex modules
from the whole software which helps risk mitigation of those modules. They have performed the
fuzzy cluster analysis of three datasets and got impressive results. They have used 11 metrics for
each data set. Different number of clusters were considered optimal for different datasets. Once the
clusters were made, Dick et al. (Dick and Kandel, 2003) have identified that modules representing
high metric values have high risks. They have used PCA to find for each dataset linearity and
homogeneous ordering.
44
Attempts have been made to identify some rules useful in early stages of software lifecycle to
predict defect proneness and reliability (Yang et al., 2007). Yang et al. (Yang et al., 2007) suggest
the use of fuzzy self-adaptation learning control network (FALCON) which can predict the quality
based on fuzzy inputs. They intend to predict the quality when exact values of software metrics
are not known and the domain knowledge and experience of the project managers can be used
to approximate the metrics values. Furthermore, they also intend to find the rule set which has
the capability to reason under uncertainty, which is a limitation of most of the quality prediction
models (Fenton and Neil, 1999).
Fuzzy Unordered Rule Induction Algorithm (FURIA) (Huehn and Huellermeier, 2009) is a
modified version of the rule-based classifier RIPPER (Cohen, 1996). Apart from rule learning
and simple and comprehensible rule sets, like that of RIPPER, FURIA has some more extensions.
Whereas RIPPER learns conventional rules and has rule lists, FURIA learns more general fuzzy
rules and has unordered rule sets. Unordered rule sets may lead to uncovered examples which are
dealt, in the FURIA algorithm, by a rule stretching method which generalizes the existing rules
until they cover the uncovered examples.
2.4.8 Support Vector Machine
Xing et al. (Xing et al., 2005) have used Support Vector Machines (SVM) for early prediction of
software quality. Their proposed model being an SVM-based model does not need large amounts
of data. The SVM based predictor gives correct classification 89% of the times. They suggested
a new SVM based approach named Transductive SVM (TSVM) which achieved 90.03% correct
classification rate (CCR). This improved CCR is achieved on the expense of more training time
than other methods. In another research Xing et al. (Xing and Guo, 2005) have presented a soft-
ware reliability growth model (SRGM) based on support vector regression (SVR). That technique
has been classified as statistical and probabilistic technique and is discussed later in this chapter.
Brun et al. (Brun and Ernst, 2004) have employed machine learning based approach to predict
the latent code errors. They worked on identifying the program properties which are good indi-
cators of errors in later phases of SDLC. They have generated machine learning models of those
45
properties of programs which are exhibited if errors are found. Then they have applied those mod-
els to already developed code to find and rank the properties which are more fault-revealing. They
have used SVM and decision tree learning tools for the classification of such properties.
2.4.9 Genetic Algorithms
Bouktif et al. (Bouktif et al., 2004) have suggested genetic algorithms based improvement in
quality prediction techniques for object oriented systems by combining a set of existing models.
Their approach helps an interested organization select an appropriate quality model. The quality
factor they have talked about is stability of a software developed in object oriented paradigm. They
believe that a stability prediction model should be used to assess and reduce the implementation
cost of the new requirements. But this prediction model needs some input from the previous
versions, so it can not be used unless a few versions have been developed. For this reason they
collect the metrics for each class which contribute towards its stability. These metrics include Coh,
NPPM (number of public and protected methods) and stress. Based on these metrics they build a
decision tree classifier that predicts the stability of a given class whether it is stable or not. They
encode this classifier as a chromosome to be used in genetic algorithms.
Wang et al. (Wang et al., 2007) have introduced a genetic algorithms based approach to select
relevant metrics from the set of all collected metrics in the development of a certain software. This
approach helps designing more accurate predictors of software quality since all of the collected
metrics are not contributing towards the quality of the software but there is a certain subset of
these metrics. So all of the collected metrics should not be used as input to the prediction model.
Instead only the appropriate software features should be contributing to the prediction. They have
selected these relevant features through a genetic algorithm based feature selection model. They
have suggested that the prediction models usually built are not adaptive learners. When the trained
models are provided with an outlier as input, they fail to classify it correctly. But avoiding outliers
in real-world software is not practically possible. So they have suggested an outlier detection
technique which prunes the small clusters of software which might degrade the performance of
the prediction model. They have achieved both these goals, the appropriate feature selection and
46
software clustering for outlier detection by combining them together in the fitness function. They
have validated their model on data collected over four releases of two large telecommunication
systems. The systems were developed using high level languages Ruby and Sapphire. They have
used 14 file level metrics and 18 routine level metrics to test their idea. During this validation
activity, they observed an overall misclassification rate of upto 24.6 %.
2.4.10 Association Mining
Use of association mining for defect prediction has also been reported. Association Rule Min-
ing (ARM) has been useful to predict software defect correction effort and determine association
among software defects (Song et al., 2006, Priyadarshin, 2008, Karthik and Manikandan, 2010).
An association rule based classifier, CBA2, has been empirically evaluated to predict software de-
fects (Baojun et al., 2011). Accuracy and comprehensibility of CBA2 has been comparable to C4.5
and RIPPER, two recognized defect classification models. Kamei et al. (Kamei et al., 2008) have
presented a hybrid approach to classify the software module as fault-prone or not-fault-prone. The
hybrid of ARM and logistic regression has performed better in terms of lift, however, its perfor-
mance has been inferior in terms of support and confidence when compared with the individual
models based on logistic regression, linear discriminant and classification tree. Association rules
have been employed to identify the action patterns that may cause defects (Chang et al., 2009).
Each rule represents actions as antecedent and number of defects as consequent. The antecedents
can be of numeric or categorical type, whereas the consequent is discretized as low, medium and
high. Actions co-occurring with defects are used to avoid future defects. The proposed approach
has shown promising results when used for a business project.
2.4.11 Ensemble Models
Ensemble models work on the principle of creating multiple learning models and combining their
output to get the desired output. Random Forests (Rnd For) (Breiman, 2001) is a decision tree-
based ensemble classifier. It consists of many decision trees. The trees are constructed using the
following strategy: the root node of each tree contains an initial sample of the data that is the same
47
size as the original. Each tree has a different initial sample. At each node, a subset of variables
is randomly selected from all the input variables to split the node. The best split is adopted. This
splitting is done for each new node without pruning to the largest extent possible for building the
tree. When all trees are grown, a new instance or set of instances are fitted to all the trees and the
mode of all the trees is selected as the prediction of the new instance(s) (Jiang et al., 2008a).
Composite Hypercubes on Iterated Random Projections called CHIRP (Wilkinson et al., 2011),
is a new covering algorithm and not modified version of an existing algorithm. It is a non-
parametric, ensemble model. It deals with the problems of the ”curse of dimensionality,” com-
putational complexity and separability that are usually faced by supervised classifiers. CHIRP
addresses these obstacles by using the three stages: projection, binning and covering in an iterative
sequence. For using this model, prior knowledge of the structure of the data is not required since
it does not need to be customized for each data set.
The classifier DTNB is a combination of Decision Table and Naıve Bayes (Hall and Frank,
2008). At each point in the search, the algorithm evaluates the merit of splitting the attributes into
two groups: one for Naıve Bayes and the other for the decision table. The resulting probability
estimates from both the models are combined to give the result of this hybrid classifier. Initially, all
attributes are modeled by the decision table. A forward selection search is used to select attributes.
At each step, the selected attributes are modeled by Naıve Baye’s and the remainder by the decision
table. At each step, the algorithm also considers dropping an attribute entirely from the model.
2.4.12 Other Studies
Ottenstein (Ottenstein, 1979) has suggested a mathematical model for prediction of number of de-
fects in the system before the testing starts. This effort can help in planning and testing phases
of the SDLC. Ottenstein investigated the relationship between the software science metrics (Hal-
stead, 1977) and number of bugs. In short this model was based on the study of software science
metrics and was tested on the data available in literature by then. The proposed model was useful
in estimating time needed for testing, generating better schedules, estimating amount of computer
time needed to perform testing etc. Furthermore, this model could lead to improved reliability esti-
48
mates. The model showing relationship between Halstead Volume (V) and number of bugs had the
least error rate of 12.3%. This error rate was constant across several datasets used by them. Later
on Ottenstein has improved the work and presented a model (Ottenstein, 1981) predicting the total
number of bugs during the development of the project. The difference was that the previous model
was predicting number of bugs in the system at the validation stage of the software lifecycle while
this model was telling the total number of bugs during the development of a project. The model
was validated on the projects developed by professional programmers and projects developed by
students. Projects developed by students can further be grouped into projects with fewer errors and
projects with large number of errors. Both groups had approximately equal number of programs.
This model, did not give promising results for the projects with large number of errors.
Mohanty (Mohanty, 1979) has discussed assessment of various aspects which affect software
quality for example accessibility and testability. Mohanty has also talked about statistical meth-
ods to estimate software reliability and found that the estimates for reliability and mean-time-to
failure were highly correlated with real values. Mohanty’s work targeted various phases of SDLC
and multiple means of assessing software quality at different phases for example methodologies
for design evaluation through entropy function, estimation of development effort through software
science metrics, test effectiveness measurement through suitable test plans. This work contributed
significantly in affirming that quantifiable attributes of software can help control and manage soft-
ware projects.
Schneider (Schneider, 1981) has presented estimators on the basis of experimental study. These
estimators were formulae to estimate the number of software problem reports. The estimators were
validated using data taken from and were found consistent with the data.
Jensen et al. (Jensen and Vairavan, 1985) have conducted an experimental study on software
metrics and their relationship with each other for Real-Time Systems developed in Pascal. Their
model was also an empirical and mathematical model but they did not validate the model to study
the relationship between errors and metrics. Their study of relationship between errors and soft-
ware metrics revealed that a new metric (namely NF ) is a better estimator of program length.
Approximately 91% of the programs tested by them suggested that NF is a better approximation
49
than NH, the Halstead’s program length metric (Halstead, 1977).
Brocklehurst et al. (Brocklehurst and Littlewood, 1992) have suggested a statistically inspired
model which is a good candidate of almost a generic model for reliability prediction but with some
limitations. They have narrated that the ability to depict the past correctly did not guarantee the
ability to predict the future accurately. So the models already been discussed in literature by then
were not true predictors according to their claim. They have presented a new concept of detecting
semantic differences between the predicted value and actual value. They have used u-plot, very
similar to the concept of bias in statistics, to assess the predictive accuracy. They validated their
model on three datasets and found promising results.
Ohlsson et al. (Ohlsson and Alberg, 1996) had empirical evidence that fault-prone software
modules can be identified before the coding starts. This prediction was aided by the design metrics
and some complexity metrics. They have carried out the study at Ericsson Telecom AB and de-
veloped a tool named ERIMET which helps in calculating certain metrics for only those modules
which are affected by a change. Their study has supported the fact that a small number of software
modules are responsible for major portion of total faults encountered in the system. They have
validated the model accuracy using a technique called Alberg diagram (a slight variation of Pareto
diagram), introduced by them in the same paper. They intended to find the relationship between
design metrics and the functions test failures. For this reason they have modeled their predictor
on the basis of design metrics. Their prediction model also incorporated some complexity and
size metrics and the finding was that the metric size does not give any better results than the four
design metrics used by them. Their model was developed on telecommunication data and showed
promising results.
Wang et al. (Wang et al., 1998) have not directly talked about software quality but they present
a model for software reliability which eventually helps in estimating the software quality. They
use the Markov chain properties for estimation in their model. They estimate the reliability of the
components independently and then the components are mapped into state diagrams for further
use. The transition between states is considered as Markov process. They have used different
reliability models for different architecture styles.
50
Guo et al. (Guo and Lyu, 2000) have proposed a statistical technique for early prediction of
software faults. Their approach does not need the prior knowledge of number of faults in the
modules to be predicted as fault-prone or not fault-prone. This kind of information is present in
early stages of software development. They have suggested that software size and complexity
metrics can be used to develop a model for software quality prediction. They select the appropriate
class for the module based on the values given by their criterion termed as Akaike Information
Criterion (AIC) (Akaike, 1974). AIC puts a software module in the class of modules with similar
characteristics.
Xing et al. (Xing and Guo, 2005) have presented a software reliability growth model (SRGM)
based on support vector regression (SVR). Their proposed SRGM has proved to be better predictor
of reliability than other suggested reliability predictors on the data used by them. They have
identified the problem with conventional SRGMs and rectified that. The conventional SRGMs
made unrealistic assumptions about the fault distribution, which restrains these models to analyze
actual failure data. They have compared their model with three different reliability growth models
and the results have shown that their technique is better than the other compared techniques.
Nagappan et al. (Nagappan and Ball, 2005a) have advocated that static analysis tools can
be good predictors of pre-release defect density and consequently the quality of the software.
They have used a statistical regression technique to build prediction model but do not provide
the regression equations for the sake of protecting proprietary data. The task of static analysis
tools was to detect the low level programming errors and the errors usually not uncovered by
conventional testing. They have stated that static analysis defect density has a positive correlation
with pre-release defect density and therefore one can easily use the static analysis defect density
as an indicator of pre-release defect density and as a result decisions on testing, code inspections,
redesign etc. can be improved.
Liu et al. (Liu et al., 2005) have presented a performance prediction method for component-
based applications. Their approach was applicable at design level i.e. before development of a
significant prototype version of the application. Moreover, they have designed the suggested tech-
nique specifically for component-based server side applications. This design-level technique fa-
51
cilitates system architects deciding appropriate application architecture and improve design before
the significant implementation of application has been done. They have employed their approach
to build a quantitative performance model for the application once the design of the application is
available. The independent input variables of the model include the design description of the appli-
cation and the performance profile of the platform using which the application is to be developed.
They have implemented two different architectures on different implementations of Enterprise Java
Beans (EJBs). Then the performance predictions from the model were validated by measuring per-
formance of the above said implementations of the two architectures.
Nagappan et al. (Nagappan et al., 2006) have conducted an empirical study to advocate that
failure-prone entities have correlation with code complexity metrics and based on values of com-
plexity metrics, the component failures can be predicted. But no single metric or set of metrics can
be a best defect predictor. They have used principle component analysis on different code metrics
and have built a regression model to predict the post release defects.
Briand et al. (Briand et al., 1993) have presented an application of Optimized Set Reduction
(OSR) to construct model which can identify high-risk software components. They have evalu-
ated the accuracy of their stochastic model on Ada components. The motivation behind the use
of OSR based model was that in software engineering usually data is not complete and may not
be comprehensive enough to fulfill the needs of a model. So they have suggested a robust model
which reliably classifies the high-risk components. OSR is a technique which is partially based on
machine learning principles and uni-variate statistics and was developed at the University of Mary-
land (Briand et al., 1993, 1992). The output of OSR is logical expressions representing patterns in
the dataset and the OSR classification model works on the basis of these logical expressions.
Reussner et al. (Reussner et al., 2003) have suggested a parameterized reliability prediction
model for component-based software architectures. They have emphasized that the architectures
should be parameterized by the component usage profile and the reliability of the required com-
ponent. So their empirical technique collects the usage information and models dependencies
between service reliability as state machines. In this way they have overcome the problem seen in
earlier suggested models, which was neglecting the component usage information and the context
52
related information.
Grassi et al. (Grassi and Patella, 2006) have suggested a reliability prediction methodology for
service-oriented computing environments. They have associated a flow graph with each running
service in order to get its internal failure probability and usage profile. They have presented steps
to build this flow graph and once such a graph is built and associated with a specific service it can
portray reliability information for the service it is associated with.
Kim et al. (Kim et al., 2007) have presented an innovative approach for fault prediction. They
have argued that a changed entity, a newly added entity, an entity which introduced a fault most
recently and the logically coupled entities to that entity tend to introduce faults soon. So to address
these localities they keep track of recent faults and their location (the change history of a software
project) and predict the most fault prone entities. The cache keeps the current list of the most fault-
prone entities. If a fault is introduced after a certain revision then a cache hit is considered if that
fault introducing entity is already in the cache, otherwise it is counted as a cache miss. Similarly
when a bug in an entity is fixed the presence of that entity is also checked in the cache and it is
counted as a hit if the entity is already there. They have suggested the cache size to be 10% of the
total number of entities and use Least Recently Used (LRU) as the replacement algorithm. The hit
rate tells how accurately the prediction model is working. If the hit rate is good then it means that
the cache keeps the most fault-prone entities and hence makes the approach dynamic.
2.5 Performance Evaluation Studies
Studies to investigate the methods to build and evaluate defect prediction models have also been
conducted. These studies compare different prediction models, asses the impact of using different
metrics, discuss the evaluation parameters that should be used to gauge performance of the models.
Examples of such studies include (Lessmann et al., 2008, Menzies et al., 2010, Arisholm et al.,
2010). In this chapter we conduct a study similar to (Lessmann et al., 2008) and include the new
models proposed after 2008.
Lessman et al (Lessmann et al., 2008) proposed a framework for evaluating different classi-
fiers. The framework is tested on 22 classifiers using 10 data sets from the PROMISE Repository
53
(Menzies et al., 2012). A split-sample setup is used, that is, the data sets are randomly partitioned
into 1/3 for learning (model building) and 2/3 for performance estimation (testing). For models
having hyperparameters, a set of candidate values is selected for each hyperparameter and all com-
binations are experimentally checked by using 10-fold cross validation on the training data. The
combination of hyperparameters that produces the best predictive result is chosen and is used to
build a model on the whole training data set. The split sample setup and 10 fold cross validation
have been used in other classifier evaluation experiments as well (Mende, 2010, Menzies et al.,
2007, Koru and Liu, 2005b). The results are evaluated and compared using ROC curve analysis,
more specifically, using AUC.
Based on previous works on defect predictors built from static code features (Lessmann et al.,
2008, Menzies et al., 2008, 2007), Menzies et al. (Menzies et al., 2010) point out that “better
data mining technology is not leading to better defect predictors.” They call this the ceiling effect.
Due to this ceiling effect, they claim that researchers have reached “the limits of the standard
goal” of optimizing AUC (pd, pf) and in their work, they explore the effects of changing the
standard goal. For this, they propose and evaluate a software, WHICH, a meta-learner that can be
customized according to varying objectives depending on the domain. They also claim that varying
the standard goal will help break through the ceiling. Another major use, complementing the idea
of customization of goals, is that WHICH will help different businesses achieve different goals.
Considering resources available to businesses, different businesses have different goals when it
comes to defect prediction (Menzies et al., 2010). Menzies et al. (Menzies et al., 2010) also
include details over the debate on the usefulness of static code metrics and argue in favor of these
metrics.
Fenton et al (Fenton and Neil, 1999) present a critical review of the existing literature on soft-
ware metrics and statistical models developed using the metrics. They report that prediction models
are based on: size and complexity metrics, testing metrics, process quality data and multivariate
approaches. In particular, the Pareto principle: “a very small proportion of the defects in a system
will lead to almost all the observed failures in a given period of time,” implies that removing large
number of defects in a system may not actually improve its reliability. In their critique of current
54
approaches to defect prediction, they identify problems related to these approaches and suggest a
framework based on Bayesian networks as a solution to these problems. They argue that using
only complexity metrics as indicators of defects is not sufficient and appropriate. They point out
that statistical methodologies should be used such that a major focus is on data quality and method
of evaluation used. Finally, they stress on the importance of identifying the link(s) between defects
and failures.
One of the aims of any Software Engineering experiment is that it should be repeatable and
the results should be generalizable. Mende (Mende, 2010) perform two replications and reports
the experience gained and problems faced. Such studies highlight the information needed for
replication of a defect prediction experiment. Replication of defect prediction experiments may
not produce exact results since many statistical techniques used are inherently non-deterministic
(Mende, 2010). But results should at least be consistent. Mende (Mende, 2010) notes that details
given and evaluation experiment performed by Lessman et al (Lessmann et al., 2008) seem to be
up to mark. For the same reason the techniques used by Lessman et al (Lessmann et al., 2008)
have also been adopted by Jiang et al (Jiang et al., 2008a) and Mende et al (Mende and Koschke,
2009).
Jiang et al. (Jiang et al., 2008a) provide an evaluation of a variety of performance measures
based on different scenarios on eight data sets from the PROMISE repository (Menzies et al.,
2012). A comparison of several performance measures is also given, including that of AUC and lift
charts. Cost curves are introduced along with their merits. The experiment is conducted using six
classifiers implemented in WEKA (Witten et al., 2008). The conclusion presented is that different
performance measures are appropriate for different software systems. This is because different
systems have different requirements when it comes to defect prediction (Menzies et al., 2010).
Mende et al. (Mende and Koschke, 2009) consider the evaluation of a defect prediction model
with respect to effort or cost related to quality control activities for each module. A trivial model is
compared with five other classifiers, first using only the measure LOC and then an effort-sensitive
performance measure. For the former, the trivial classifier performed well, but for the latter, it
performed the worst. The authors claim that costs of quality assurance of a module rely to some
55
degree on the size of the module, which is also supported by Koru et al. (Koru and Liu, 2005b).
This work also closely follows the experimental setups of Lessman et al. (Lessmann et al., 2008)
and Jiang et al. (Jiang et al., 2008a) in the selection of data sets, algorithms and evaluation method-
ology. However, for evaluation, it considers two additional measures, both based on cost associated
with each module.
Menzies et al. (Menzies et al., 2007) emphasize the need for convergence of studies and how
the use of publicly available data sets can enable researchers to compare their techniques. The
paper advocates the use of static code attributes to learn defect predictors and shows that the subset
of attributes used by a particular classifier for a particular data set is important as opposed to a
best set of attributes since different classifiers give different performances on different data sets. It
also proposes the use of Naıve Bayes with logNums, and using WEKA it shows how the module
proposed in the paper performs better than the three WEKA algorithms: OneR, J48 and Naıve
Bayes. For model building, it uses split sample setup and 10-fold cross validation. To evaluate the
classifiers, it uses ROC curve analysis and to show results it makes the use of performance deltas.
One useful way of surveying techniques, tools and models used for a particular field or problem
domain is to consult literature review(s) on that subject. These help in identifying major compo-
nents and trends. Thus Hall et al. (Hall et al., 2011) give a broad overview of tools and techniques
used in defect prediction, current trends that are followed and reports limitations as well. It reports
that the impact on model performance of specific context variables such as maturity, application
area and programming languages has not been studied extensively. It also reports that performance
of models increases when the size of system increases (there is more data). It also reports on
the type of independent variables used for defect predictions. Three main categories have been
identified: process metrics (e.g. previous change and fault data), product metrics (e.g. static code
metrics) and metrics relating to developers. Fault severity is another aspect that has not been stud-
ied in detail. However, severity is a difficult concept to measure. Menzies et al. (Menzies et al.,
2007) describe severity as a vague concept to reliably investigate. Hall et al. (Hall et al., 2011) also
discusses the quality of defect prediction studies and the NASA data sets (Menzies et al., 2012) in
particular. Catal et al. (Catal and Banu, 2009) have also performed a systematic study of existing
56
defect prediction models.
Jiang et al. (Jiang et al., 2008b) explore the performance of different classifiers based on mis-
classification costs: “the ratio of costs for false positives to the costs of false negatives.” They
assume that misclassification costs for each module are the same. With this assumption, they
confirm that different misclassification costs have an immense effect on the selection of suitable
classifiers. Assigning misclassification costs is used to bias the classification models. This tech-
nique is useful in analyzing domain-specific learning of classifiers but the results of such studies,
for comparison between classifiers, are difficult to converge (Lessmann et al., 2008).
Demsar (Demsar, 2006) presents a set of guidelines to carry out a statistical analysis which
is as accurate as possible while comparing a set of models using multiple data sets. A group of
non-parametric statistical methods are proposed for comparing classifiers based on conditions that
ensure that parametric statistical analysis will not be suitable for this comparison. An analysis of
the performance of the recommended statistics on classification tasks is provided. It is also checked
that the proposed statistics are more convenient than parametric techniques. The work focuses on
the examination of new recommendations and introduces the Nemenyi test for making all pairwise
comparisons. After empirical evaluations, it also proposes the use of the Wilcoxon signed ranks
test and the Friedman test over others.
Koru et al. (Koru and Liu, 2005b) discuss how module size affects prediction results. The
change in module size is due to the multiple definitions of a module. Defect prediction experi-
ments done on data sets that take functions and methods as modules (e.g. JM1 and KC2) report
low probability of detecting defective modules. Studies that considered larger program units as
modules (e.g. a set of source files) reported successful defect prediction results. This is because
for large modules, static attribute values show more changes which make it easier for a statistical
algorithm to differentiate between a large defective and non-defective module whereas it is harder
to differentiate between a small defective and non-defective module. By testing out KC1 data class
level modules and method level modules, Koru et al. (Koru and Liu, 2005b) conclude that defect
prediction should be done at a higher level (e.g. class level) as opposed to lower granularity (e.g.
method level). Like Menzies et al. (Menzies et al., 2007), they also advocate the choice of different
57
classification techniques for different data sets.
Becker et al. (Becker et al., 2006) have suggest guidelines to select an appropriate perfor-
mance prediction method from a larger set of prediction methods for component-based systems.
In order to improve the prediction capabilities of a model, they have presented a set of charac-
teristics which need to be there in a prediction model. These include accuracy, adaptability, cost
effectiveness, compositionality, scalability, analyzability and universality. They have presented a
comparison framework on the basis of these characteristics and compared the existing performance
prediction techniques. They have mentioned the inherent weaknesses and strengths of each dis-
cussed technique. After the comparison they have given guidelines to select prediction model for
component-based systems.
2.6 Studies to Remove Inconsistencies in Software Measurement Terminology
Various studies have attempted to standardize software metrics (Purao and Vaishnavi, 2003, Vaish-
navi et al., 2007), remove inconsistencies in software measurement terminology (Garcıa et al.,
2006, 2009) and establish a uniform vocabulary of software measurement terminology and princi-
ples (IEEE, 1998, 2008, ISO/IEC, 2001). Purao et al (Purao and Vaishnavi, 2003) have presented
a uniform and formal representation which aggregates and unifies the metrics dealing with related
aspects of the object oriented (OO) software. They have noticed that fragmented work on OO soft-
ware metrics has resulted in proposal of similar metrics. Vaishnavi et al (Vaishnavi et al., 2007)
have studied the OO product metrics and have suggested a generic framework for the OO product
metrics. The framework has been formalized based on structural attributes of the underlying met-
rics and relatedness of the metrics with each other. Garcia et al (Garcıa et al., 2006, 2009) have
appreciated the problem of inconsistencies in software metrics related literature and have suggested
an ontology based approach to establish a consistent terminology for software measurement. They
have identified the inconsistencies among existing research, standards and even within a certain
standard. Their work includes the semantic relationship of software metrics and development of a
concept glossary. A number of standards have also been developed to establish a uniform vocabu-
lary of software measurement terminology, principles and methods (IEEE, 1998, 2008, ISO/IEC,
58
2001). IEEE Std 1061-1998 (IEEE, 1998) presents a framework for software quality metrics that
allows to divide a metric into subfactors. This approach can be used to study existing metrics,
find their commonalities and identify which metrics can be considered as sub-metrics of a certain
metric. ISO/IEC 9126-1 (ISO/IEC, 2001) focuses on uniform terminology of product quality re-
lated terms. An adoption of ISO/IEC 15939:2007 (IEEE, 2008) defines a measurement process
for software and systems engineering. The standard describes a detailed method to carry out the
measurement process. The standard can be applied to select appropriate metrics on the basis of in-
formation needs of an organization. Based on the process outlined in the standard, an organization
may collect, store, analyze and interpret the measurement data according to the information needs.
In certain scenarios, an organization may not be following any standard to carry out the mea-
surement process but still wants to use existing quality prediction models. Such organization will
certainly have some data to build a prediction model. It will be very helpful for the organization
if there exists some dataset similar to the organization’s data and the defect prediction information
regarding that dataset can be used to analyze the organization’s data. But the metrics used by the
existing datasets have inconsistent labels. In order to match two datasets, atleast their metric labels
should be consistent. The above standards have not mentioned the problem of inconsistent labeling
of software metrics. So far, software product metrics used in quality prediction have also not been
collected at one place.
2.7 Analysis and Discussion
From the preceding sections we can observe that software defect prediction studies present two
major views. These views are different from each other in terms of focus, nature, use of static
code metrics, and availability of data as given in Table 2.4. The table also provides examples of
studies from each view. View 1 emphasizes on the significance of the causes of defects and un-
derstanding relationship between software metrics and defects. View 2 includes studies that have
used public datasets focused on selection of software metrics that are important to find defects, im-
provements in classification of defects, and identification of the software metrics that do not help
in classification of defect prone modules. View 1 requires expert opinion to be incorporated and
59
works on probabilistic approaches. On the other hand, effort made in support of View 2 is based
on empirical evidence. View 1 suggests that these metrics are insignificant unless process metrics
or additional information is also used whereas View 2 emphasizes that defects in software can be
predicted through software code metrics. View 1 highlights the importance of investigating the
causal relationship between software metrics and defects while View 2 leads to build classification
models. Because of the public availability of software defect data a larger number of empirical
studies belong to View 2. Studies that belong to View 1 are relatively smaller in number. Al-
though, significance of the view presented by View 1 cannot be denied, but similar study cannot be
performed without data. Association mining can potentially bridge the gap between the two views
by giving associations between software metrics and defects. These associations do not directly
identify causes of defects but can be further investigated to understand the relationship between
software metrics and defects.
It is not easy to develop empirical approach at initial phases of SDLC, since it needs data to
build a model. For example statistical techniques like Discriminant Analysis (Khoshgoftaar et al.,
1996, Munson and Khosgoftaar, 1992) and Factor Analysis (Khosgoftaar and Munson, 1990) need
previous data or data from similar projects to build the prediction model on. Based on the metrics
calculated from these data the quality of the next phase can be determined. This prediction is very
helpful in iterative development approaches where the organization can atleast roughly predict the
quality of the next iteration. The statistical techniques can be employed either after the release or
right after the design when everything about the number of classes and total number of functions
is known. Mohanty (Mohanty, 1979) also discussed statistical models for estimation of software
reliability and error content but those models were applicable once the software has been deployed.
Certain approaches have the potential to predict quality in early stages. Some of them can be
applied even if very small amount of data is available in the initial phases. For example during
design phase we do not have number of software metrics (like Lines of Code, Number of Bugs)
other than some design metrics. In such scenarios SVM and Rule-based Systems can be employed
for predicting quality. The AI techniques which are further based on machine learning (Thwin and
Quah, 2002, Pizzi et al., 2002) also need previous or similar data to first perform the learning and
60
Tab. 2.4: Two Major Views in Software Defect Prediction
View 1 View 2
Focus Causes of defects.
Relationship be-
tween software
metrics and defects
Important metrics.
Classification models
based on correlation
etc..
Nature Expert opinion
based
Empirical
Use of Static
Code Metrics
Suggest to incorpo-
rate expert opinion
Use as they are
Availability of
Data
Most of the times
data not publicly
available
Most of the times data
is publicly available
Studies in each
group
(Neil and Fenton,
1996, Fenton and
Neil, 1999, Fen-
ton et al., 2002,
2008, Pai and
Bechta Dugan,
2007, Klas et al.,
2010)
(Menzies et al., 2007,
Peng et al., 2011, Wang
et al., 2011, Bell et al.,
2011, Song et al., 2011,
Sun et al., 2012, Bishnu
and Bhattacherjee,
2012, WANG et al.,
2013, He et al., 2015,
Okutan and Yildiz,
2014)
61
then to classify based on the learnt model. So, like statistical techniques, the NN-based models
(Quah and Thwin, 2003, Wang et al., 2004, Pizzi et al., 2002), Case Based Reasoning (Ganesan
et al., 2000) and other classification techniques (Khoshgoftaar and Allen, 1999b, Dick and Kandel,
2003) are dependent on data of previous releases or that of the similar projects.
No distinction has been noticed in the use of a prediction approach for a particular development
paradigm. For example, So any prediction approach can be used for a particular paradigm. The
choice of the approach depends on the kind of data available to build the prediction model and/or
the goal of the prediction. Use of metrics related to a certain paradigm for predicting defects in
another paradigm has been observed.
Software quality factor, also has significant role in the overall quality prediction model. When-
ever software quality is predicted, the basic objective is set for example whether we need to predict
stability of this software or reliability of the software. Khan et al. (Khan et al., 2006) also men-
tion that each prediction model has its own objectives in this regard. For example Becker et al.
(Becker et al., 2006) only predict performance of the software where as Xing et al. (Xing and Guo,
2005) predict software reliability. These objectives should also be kept in mind while selecting
a technique appropriate for a software. A few examples of quality factors are number of errors,
performance(Becker et al., 2006, Liu et al., 2005), reliability(Reussner et al., 2003, Xing and Guo,
2005, Grassi and Patella, 2006), stability (Grosser et al., 2003, Bouktif et al., 2004), dependability
(Grassi, 2004), customer perception (Mockus et al., 2005).
The literature survey revealed that many metrics have been used with same definition but dif-
ferent names and sometimes different metrics have been given same name.
2.8 Summary
Software metrics specifically code metrics are used to develop software defect prediction models.
Literature suggests that the code metrics are often used in SDP models. Collection of code metrics
is easy and literature reports that code metrics give good predictions. Though the code metrics
predict good, they cannot be used for early predictions because code becomes available late in
software lifecycle. Literature suggests that for early predictions requirements and design metrics
62
are used. All these metrics predict occurrence of defects and not cause of defects, reason being the
data regarding causal relationship is not available in code metrics.
Defect data for prediction is publicly available also. The public datasets also consist of code
metrics. These datasets are dominated by examples of defect free modules. This makes prediction
of defect prone modules difficult. Software defect prediction studies have also been done using
data which is not publicly available (Khoshgoftaar et al., 1996, Quah and Thwin, 2003, Wang et al.,
2004). Also there are studies that identify cause of defect unlike the studies that predict occurrence
of defects based on size and complexity of software. Literature also suggests that the experiments
conducted using public data can be replicated and the studies reporting such experiments are more
useful.
A variety of software defect prediction models exist in literature using a range of techniques
such as neural networks, evolutionary algorithms, bayesian belief networks, regression based tech-
niques, decision trees, decision tables, naive bayes classifier. Empirical studies have also been
performed to investigate relationships between software metrics and defects. Different studies
have investigated the causes of defects, selected the software metrics that are important to find de-
fects and identified the software metrics that do not help in classification of defect prone modules.
Further, use of association mining for defect prediction has also been reported. Bayesian Belief
Networks (BBN) have been widely used to discover causes of defect. Software process metrics
have been combined with software product metrics to overcome serious problems implicit in defect
prevention, detection and complexity. Significance of causal analysis in software quality predic-
tion has been highlighted and holistic models for software defect prediction, using BBN, have been
presented as alternative approaches to the single-issue models proposed in literature. Bayesian net-
works have also been used for accurate predictions of software defects in a range of real projects,
without commitment to a particular development life cycle. Association rule mining has also been
employed to discover the patterns of actions which are likely to cause defects in the software. De-
fect causal analysis has been acknowledged prudently in software process improvement techniques
as well.
There are variety of measures available to assess performance of prediction models. Accuracy,
63
Precision, J-Coefficient, F-Measure, G-Mean, Recall, Area Under ROC Curve (AUC) are a few
of them. Existing studies prefer AUC over other performance measures. In software domain, de-
tection of defective module is more important that detection of a defect-free module, suggesting
that Recall is important in domain of software defects. In literatures, Accuracy and Precision are
not considered good performance measures as compared to AUC and Recall in case of imbalanced
datasets. Also, performance achieved using number of data mining techniques suffer from ceil-
ing effect. This means that the performance of new defect prediction models is not improving
significantly.
A significant number of studies have been done to evaluate performance of existing defect
prediction models. These studies use public datasets for the performance evaluation such that the
comparison between the models can be refuted and/or replicated. Lessman et al (Lessmann et al.,
2008) in 2008 have compared performance of 22 defect prediction models.
In the following chapter we present a comparative study of models reported after 2008. We
have used Lassman’s comparison framework and steps for the study.
64
3. SOFTWARE DEFECT PREDICTION MODELS: A COMPARISON
This chapter presents a comparative analysis of existing models which helps select the prediction
model with best performance such that the performance of the selected model is improved using
association mining based preprocessing approach. This comparative analysis has followed steps
by Lessman et al (Lessmann et al., 2008) which are selecting classifiers, selecting datasets, com-
paring the performance of the classifiers on multiple datasets using statistical tests, and presenting
statistical difference in performance of the selected classifiers.
To conduct the comparative study of defect prediction models, we select 12 datasets from the
list of datasets provided in Chapter 2. These 12 datasets have been selected based on a similar
study with novel findings on performance of the classifiers (Lessmann et al., 2008). The difference
between their work and present work is that they used 10 datasets and the present study uses 12
datasets for comparison of classifiers. Lessman et al compared 22 classifiers and the present work
compares 5 classifiers namely Random Forests, Naive Bayes, CHIRP, DTNB, and FURIA. First
two of these have been evaluated by Lessman et al (Lessmann et al., 2008) whereas the rest of the
3 have been reported in literature after the study by Lessman et al (Lessmann et al., 2008). The
scale of the critical difference in performance shown in the present study is different from the one
used by Lessman et al (Lessmann et al., 2008) because the number of datasets and models has been
different.
The study by Lessman et al (Lessmann et al., 2008) provides a significant framework for com-
parison between performances of defect prediction models. Defect prediction models reported in
literature after 2008 should be compared with the previous models to see if significant change in
prediction performance has been achieved. Therefore, the present study compares the aforemen-
tioned 5 models using the framework provided by Lessman et al (Lessmann et al., 2008). Random
Forests and Naive Bayes have been selected from existing comparison for the present comparison
because Random Forests has been the best of all in their study (Lessmann et al., 2008) and NB
has been extensively used (Lessmann et al., 2008, Menzies et al., 2010, Mende, 2010, Jiang et al.,
2008a, Mende and Koschke, 2009, Menzies et al., 2007, Demsar, 2006) in literature for defect
prediction. As a reference we are providing a brief description of the 5 models.
3.1 Description of Models
3.1.1 Naive Bayes
Naive Bayes (NB) classifier is a simple probabilistic classifier based on Bayes’ Theorem (Jiawei
and Micheline, 2002). NB classifiers are called “naive” since they assume independence of each
feature. This classifier accepts the data sample in the form of n-dimensional feature vector X
where each dimension corresponds to measurement made on n attributes of the sample. For a
feature vector X = x1, x2, x3, x4, ..., xn each xi measures an attribute Ai of the data sample.
Given the data point X without class labels, the NB classifier predicts the class of X .
In case of defect prediction, each vector represents a software module where each attribute is
a metric value corresponding to an attribute of the software module (for example size, cyclomatic
complexity, essential complexity, number of children etc.). The number of classes in our case
is 2 which are Defect-Prone (D) and Not-Defect-Prone (ND). For the 2 class problem the NB
classifier maximizes the posterior probability of a class Cj conditioned on X . The classifier uses
the Bayes Theorem:
P (Cj|X) =P (X|Cj)P (Cj)
P (X), Cj ∈ {D,ND} (3.1)
The NB classifier assigns a class label Cj to an unknown sample X if and only if P (Cj|X) >
P (Ck|X) for j 6= k. This shows that the classifier works on the principle of maximizing P (Cj|X).
Using the Bayes Theorem in Equation 3.1 only P (X|Cj)P (Cj) is maximized because P (X) re-
mains constant for both the classes. Generally, the class probabilities are assumed to be equally
likely by this classifier and prior probabilities of the classes are estimated as P (Cj) =sjs
where sj
is the number of training samples belonging to class j and s is the total number of samples. The
66
NB classifier outperforms most classifiers because of the “naive” assumption mentioned earlier in
the section. The classifier also follows a smart mechanism to deal with missing values. When in
training phase the classifier does not consider an instance having attribute(s) with missing value(s)
for calculations of class prior probabilities. When performing classification, the attribute(s) with
missing value(s) is/are dropped from calculations. With our proposed preprocessing approach of
introducing missing values in the dataset (approach presented in Chapter 4), this elegant treatment
of missing values makes NB a good candidate for selection.
3.1.2 Random Forests
Random Forests (Rnd For) (Breiman, 2001) is an aggregation of decision tree-based classifiers.
Each tree is specialized for a part of training set and its vote is counted towards the final classifi-
cation. The trees are developed using different initial sample data but the size of the initial sample
remains same for all the trees. At each node, the set of discriminative attributes is randomly ob-
tained and the best split is adopted for further processing. Training for trees in the Random Forests
is based on tree bagging. If a training set X = x1, ..., xn has responses Y = y1, ..., yn, B is number
of trees in the random forest, bagging repeatedly selects a random sample with replacement of the
training set and fits trees to these samples:
For i = 1, ..., B :
1. Sample, with replacement, n training examples from X, Y ; call these Xi, Yi.
2. Train a decision or regression tree fi on Xi, Yi.
Once all trees are trained, a test instance (or a set of test instances) is classified by all the
trees. Vote from each tree is taken with each vote having equal weight. The test instances x‘
is classified based on simple majority vote scheme or by averaging the predictions from all the
individual regression trees on x‘:
f = 1B
∑B
i=1 fi(x‘)
67
3.1.3 Composite Hypercubes on Iterated Random Projections (CHIRP)
Composite Hypercubes on Iterated Random Projections (CHIRP) (Wilkinson et al., 2011) is an
ensemble classifier that classifies test instances based on majority vote obtained by m runs of
CHIRP. CHIRP iteratively employs three stages namely projection, binning and covering for each
class as shown in Figure 3.1. The scoring is done based on Composite Hypercube Description
Regions (CHDR) (which can also be considered as sets of rectangles) (Wilkinson et al., 2012). A
Hypercube Description Region (HDR) represents the set of points less than a fixed distance from
the center (another point), whereas a CHDR is the set of points in union of zero or more HDRs.
The CHDRs are used to identify if a set of points forming a region belongs to any class or not
(Wilkinson et al., 2012). When scoring a given test instance, CHIRP transforms and re-scales the
test point, passes it through list of CHDRs, and then for each CHDR the test point is projected.
The first rectangle in a CHDR that encloses the projected point determines class of the test point.
In case no rectangle encloses the test point, the nearest rectangle determines the class.
3.1.4 Decision Table - Naive Bayes (DTNB)
The Decision Table - Naive Bayes (DTNB) classifier (Hall and Frank, 2008) works in the manner
as a decision table does except that the selection of discriminative attributes is different. Instead of
the standard mechanism of selecting the discriminative attributes by maximizing cross validation
performance, the attributes are split into two groups: one for Naıve Bayes and the other for the
Decision Table. One split is modeled by NB and the other by DT. Overall class probability esti-
mates are generated after combining the probability estimates given by each model. The overallCHIRP Train ing CHIRP Scor ingPr oj ect in g B inn in g Cov er in gP ick Class Mor eClasses?Y es Tr an sforman d R escaleP ass thr ou ghL ist ofR egion s Classify b asedon Pr oj ect ionFig. 3.1: CHIRP Working in training and testing
68
class probability estimates are calculated as follows (Hall and Frank, 2008):
P (y|X) =α× PDT (y|XDT )× PNB(y|XNB)
P (y)(3.2)
where XDT and XNB are sets of attributes in DT and NB respectively, P (y) is the prior probability
of the class, α is a normalization constant, PDT (y|XDT ) is the class probability estimate obtained
from DT and PNB(y|XNB) is the class probability estimate computed for NB.
3.1.5 Fuzzy Unordered Rule Induction Algorithm (FURIA)
Fuzzy Unordered Rule Induction Algorithm (FURIA) (Huehn and Huellermeier, 2009) is an ex-
tension of a rule learner RIPPER (Cohen, 1996). Rule sets generated by FURIA are simple and
comprehensible as they are in case of RIPPER. Unlike RIPPER (which learns conventional rules
and has rule lists), FURIA learns fuzzy rules and has unordered rule sets. Handling fuzzy rule sets
makes decisions made by FURIA more general and less abrupt as compared to the conventional
(non-fuzzy) models having sharp decision boundaries and abrupt transitions between classes. In
order to deal with the systematic bias (of conventional models) in favor of a class that is to be
predicted, FURIA works on the principle of separating one class from all the other.
The problems introduced due to the fuzzy approach are well anticipated by FURIA. FURIA
provides resolution to the problem of a conflict that may arise if an instance is equally well covered
by rules from different classes, however this conflict is less likely to arise. The cases like an
instance is not covered by any rules are dealt by a rule stretching method which generalizes the
existing rules until they cover the uncovered examples.
3.2 Comparison Framework
To compare the performance of multiple algorithms over multiple data sets, the use of statistical
measures is common. Researchers propose the use of the Friedman test followed by the corre-
sponding post-hoc Nemenyi test (Lessmann et al., 2008, Jiang et al., 2008a, Demsar, 2006). The
Friedman test is a non-parametric statistical test. It is a version of ANOVA and is applied to nu-
meric and ranked data, rather than actual data, which makes it less susceptible to outliers. ANOVA
69
was specifically designed to test mean accuracies (significance of difference between multiple
means) across different data sets. However, the Friedman test is preferred over ANOVA (Less-
mann et al., 2008, Demsar, 2006). The procedure of performing the test is similar to that of other
hypothesis tests. The hypothesis being tested in this setting is:
H0: The performance of each pair of defect prediction models has no significant difference.
vs.
H1: At least two models have significantly different performance.
All models are ranked according to their AUC values, for each data set, in ascending order. The
mean rank, ARi, is calculated for each model i over all data sets. The test statistic of the Friedman
test is calculated as:
χ2F =
12K
L(L+ 1)[
L∑
i=1
AR2i −
L(L+ 1)2
4], (3.3)
ARi =1
K
K∑
j=1
rji , (3.4)
where K is the total number of data sets, L is the total number of classifiers and rij is the rank
of classifier i on data set j.
However, the χ2F statistic is quite conservative (Demsar, 2006), therefore the following statistic
is preferred:
FF =(K − 1)χ2
F
K(L− 1)− χ2F
. (3.5)
The test statistic is distributed according to the F-Distribution with L− 1 and (L− 1)(K − 1)
degrees of freedom. In our experiments, the measure of interest is the mean AUC estimated over
10-fold cross validation, using 95% confidence interval (α = 0.05) as a threshold to judge the
significance (α < 0.05). Using these, the critical value of the test is obtained. If the calculated test
statistic is greater than the critical value, H0 is rejected which would imply that the performance
difference among at least two classifiers is significant.
If H0 is rejected, the post-hoc Nemenyi test is applied to detect which specific pair of classifiers
differ significantly. For every two models, it tests the null hypothesis that their mean ranks are
70
statistically same. For this, the critical difference (CD) is calculated. The null hypothesis is
rejected, if the difference between the mean ranks of the models exceeds CD, which is calculated
as:
CD = qa,∞,L
√
L(L+ 1)
6K. (3.6)
The value qa,∞,L is based on the Studentized range statistic (Lessmann et al., 2008).
Tab. 3.1: Results of Classifiers over Selected Data Sets in Terms of AUC
CM1 JM1 KC1 KC3 MC1 MC2 MW1 PC1 PC2 PC3 PC4 PC5 AR
NB 0.57 0.64 0.70 0.56 0.72 0.76 0.79 0.80 0.55 0.71 0.90 0.71 2.21
Rnd For 0.76 0.78 0.74 0.63 0.75 0.65 0.86 0.78 0.77 0.83 0.95 0.78 1.42
CHIRP 0.53 0.55 0.56 0.49 0.50 0.69 0.65 0.52 0.50 0.50 0.63 0.62 4.50
DTNB 0.35 0.64 0.68 0.57 0.66 0.72 0.75 0.76 0.85 0.63 0.78 0.68 3.00
FURIA 0.64 0.60 0.58 0.60 0.50 0.58 0.57 0.64 0.49 0.50 0.83 0.68 3.88
3.3 Experiment
The goal of this experiment is to compare the competitive performance of five classifiers described
in Section V. The accuracy of each classifier (in terms of AUC) is obtained on a test set that is a
randomized version of the actual data set. This process is repeated for each data set and is also
called the split-sample setup. As mentioned earlier, the complete procedure in this section has been
adopted from (Lessmann et al., 2008).
All data sets are randomized and partitioned into 2/3 for training sets (model building) and 1/3
for test sets (performance estimation). The merits of this scheme are that it provides an objective
estimate of a model’s generalized performance and it enables easy replication. Also, it is used
extensively in defect prediction studies as reported in (Lessmann et al., 2008).
71
Tab. 3.2: Mean AUC and Std. Dev. Over the Complete Range of Tuning Parameters
NB Rnd For CHIRP DTNB FURIA
CM1 0.568 ± 0.000 0.747 ± 0.024 0.531 ± 0.000 0.482 ± 0.076 0.583 ± 0.037
JM1 0.640 ± 0.000 0.700 ± 0.037 0.550 ± 0.003 0.638 ± 0.010 0.583 ± 0.013
KC1 0.695 ± 0.000 0.728 ± 0.018 0.562 ± 0.002 0.688 ± 0.005 0.604 ± 0.017
KC3 0.555 ± 0.000 0.633 ± 0.033 0.502 ± 0.017 0.577 ± 0.020 0.582 ± 0.045
MC1 0.717 ± 0.000 0.738 ± 0.025 0.500 ± 0.000 0.648 ± 0.016 0.504 ± 0.017
MC2 0.762 ± 0.000 0.655 ± 0.012 0.635 ± 0.036 0.722 ± 0.002 0.630 ± 0.029
MW1 0.786 ± 0.000 0.822 ± 0.043 0.646 ± 0.003 0.761 ± 0.011 0.751 ± 0.073
PC1 0.799 ± 0.000 0.783 ± 0.032 0.529 ± 0.012 0.715 ± 0.045 0.620 ± 0.040
PC2 0.550 ± 0.000 0.733 ± 0.065 0.500 ± 0.000 0.609 ± 0.142 0.490 ± 0.004
PC3 0.707 ± 0.000 0.821 ± 0.024 0.505 ± 0.002 0.594 ± 0.039 0.536 ± 0.043
PC4 0.897 ± 0.000 0.938 ± 0.017 0.614 ± 0.013 0.689 ± 0.049 0.830 ± 0.023
PC5 0.713 ± 0.000 0.773 ± 0.014 0.621 ± 0.006 0.713 ± 0.021 0.675 ± 0.017
Except for NB, all other models have tuning parameters, also known as hyper-parameters.
These hyper-parameters enable a model to be adapted to a particular problem. Since each data
set represents a different problem, each of the models with hyper-parameters has to be tuned for
each particular data set to acquire a characteristic assessment of that classifiers performance. To
this end, a grid-search technique is used for classifier selection step: for each tuning parameter,
a set of values are selected as candidates. Each model is empirically evaluated over all possible
combinations of its hyper-parameters using 10-fold cross validation on its training data i.e. to split
the data into 10 partitions, and perform 10 iterations where each partition serves as the test set
once, and the other 9 partitions are used as a training set. This is a common scheme in statistical
learning. Since AUC is used for classifier comparison, it is also used to guide in the selection of the
72
tuning parameters during the search. Thus, the model that achieves the maximal performance with
a particular combination of hyper-parameters is used on the test data and the results are recorded
and reported.
As mentioned above, NB does not require any parameter setting. The parameter settings for
Rnd For have been taken from (Lessmann et al., 2008), where settings for two parameters were
considered. For number of trees, the values: [10, 50, 100, 250, 500, 1000] were assessed and for ran-
domly selected attributes per tree the values [0.5, 1, 2].√M were assessed, where M is the number
of attributes of the data set. For CHIRP, default value for numV oters = 7 in WEKA and is the
same as the one mentioned in (Wilkinson et al., 2011) for the corresponding parameter m. How-
ever, we tuned the models over values: [7, 10, 15] to capture performance improvements. Similarly,
for DTNB and FURIA, parameter tuning was performed even though for both, a default parameter
setting was mentioned. DTNB has one tuning parameter, crossV al. The default is leave-one-
out-cross-validation (crossV al = 1). However, we use the following values: [1, 3, 5, 10, 15]. For
FURIA the parameters and the respective candidate values used are: folds = [3, 5, 10, 15, 20],
minNo = [2, 10, 15] and optimizations = [2, 5, 10, 15]. All algorithms were implemented in
version 3.7.5 of WEKA.
3.4 Results
In this section we present and analyze the results obtained from the experiment described earlier.
Table 3.1 shows the performances obtained in terms of AUC of each of the classifiers on each of
the data sets. The last column displays the mean rank ARi of each classifier over all data sets. The
mean rank is used in conducting the Friedman test. For each data set, the best AUC value obtained
over all classifiers is highlighted in boldface.
To get a better understanding of the overall performance of the models, the mean and standard
deviation of their performances in terms of AUC are reported, across all values of hyper-parameter
combinations, in Table 3.2. By comparing the two tables, we can see if there are any outliers
in Table 3.1 and we can also observe the amount of performance variations for each model over
the selected hyper-parameter settings. However, for models with no hyper-parameters (like NB),
73
this check on performance reporting cannot be applied. But in our case, we have a better idea
of the performance of NB on similar data sets from previous studies. Based on those studies,
we can safely assume that the values obtained here are not outliers. Another problem with this
approach is that, since models with a larger set of candidates over all hyper-parameters intuitively
give more accurate results for mean and standard deviation, we cannot get objective results unless
the number of overall candidates is set as constant for all models. The latter may not be desirable as
different models require different number of candidate values for each hyper-parameter to capture
performance gain. A better way is to perform the experiment more than once using different
randomized sets of the original data sets for each run. Take the mean of performances over all runs
and report the best mean and its standard deviation for each model over each data set. But this is
more time consuming and not ideal with resource constraints.
From Table 3.1 it is evident that classifiers mostly achieve AUC results above 0.5. Only one
AUC value falls far below 0.5 and that is of DTNB for data set CM1. But on average it has values
above 0.6. Apart from this, only a couple of AUC values of FURIA and CHIRP are 0.5 or below.
PC4 shows the overall best performance of models for a data set (with an average AUC of 0.82
over all models). If we consider sizes of te data sets, it becomes evident that size does not appear to
impact the accuracy of the models directly. This can be observed by comparing the performances
between of the large data sets, such as JM1 (with number of modules = 7782) and smaller data sets
such as CM1 (with number of modules = 327) and MC2 (with number of modules = 125). It can
be seen that there is no pattern that suggests that models perform better or worse on large or small
data sets. Same is the case with number of attributes. In terms of efficiency, NB performed quite
well, since it did not require any parameter adjustment. With higher values of tuning parameters,
Rnd For, CHIRP and DTNB were not as efficient. CHIRP’s parameter tuning did not seem to
produce a significant difference in performance in terms of AUC. This is also evident from Table
3.2 where the standard deviation of AUC for CHIRP over most data sets does not exceed 0.01. In
fact, it is the only model that has standard deviation for some data sets equal to 0. Another point to
be noted for CHIRP is that, it was mentioned in (Wilkinson et al., 2011) that increasing the value of
m (numV oters), will increase the accuracy of the model. But that is not the case. Initial checking
74
Fig. 3.2: Nemenyi’s Critical Difference Diagram for evaluation using AUC
with m = 7 and m = 20 was done on randomized subsets of data sets CM1 and PC5. For CM1,
the claim holds true with AUC = 0.52 and AUC = 0.55 with m = 7 and m = 20, respectively.
Also, since it is the same data set and same model with only one variable (the tuning parameter)
changing, percentage of correctly identified instances is a good measure of checking accuracy.
With m = 7 and m = 20 this percentage is 84.72 and 85.19, confirming our previous result. But,
with PC5 we have for m = 7 and m = 20, AUC = 0.52 and AUC = 0.55 and percentages 76.88
and 76.79, respectively. This shows that with a greater value of m, performance actually decreased
(even if the difference may not be significant).
3.5 Analysis and Discussion
To compare the performances of each classifier and rank them in order of overall performance, it is
important to check whether there is a significant statistical difference among their AUC values. The
test statistics are calculated using Table 3.1. The Friedman test tests whether there is a difference
in the performance among the 5 classification models over the 12 data sets. For this experiment,
L−1 = 5−1 = 4 and (L−1)(K−1) = (5−1)(12−1) = 44 degrees of freedom. With α = 0.05
the critical value of F-Distribution is 2.58. For the more conservative Chi Square distribution, the
75
critical value at α = 0.05 is 9.49. The test statistics are calculated, using (3.3) and (3.5), as follows:
χ2F = 12×12
5(5+1)[2.212 + 1.422 + 4.52 + 32 + 3.882 − 5(5+1)2
4] = 29.78,
FF = (12−1)29.7812(5−1)−29.78
= 17.98.
Since 17.98 > 2.58 and even for the more conservative test we have 29.78 > 9.49, the null
hypothesis (H0) which states that the performances of these 5 models over the 12 data sets is
statistically equal, is rejected. Consequently, we proceed to Nemenyi’s post hoc test with α = 0.05
i.e. performing pair-wise comparisons for every two models and checking whether the difference
in their performances surpasses the critical difference. The critical value qa,∞,L is 2.73 (Demsar,
2006). The critical difference is calculated, using (3.6), as:
CD = 2.73√
5(5+1)6×12
= 1.76.
The results of the pair-wise comparison are depicted in Figure 3.2, utilizing a revised version
of significance diagrams introduced in (Demsar, 2006). All classifiers are plotted on the y-axis
and their corresponding average rank is plotted on the x-axis along with a line whose length is
equal to the critical difference CD. The right end of the line shows which models, whose mean
rank lies further towards the right, are significantly outperformed by the particular classifier. Thus,
all classifiers that do not overlap in this plot perform significantly different. Considering this, we
can see that even though Rnd For performs better than all other algorithms, its performance is
not significantly different from NB and DTNB. CHIRP performs the worst and its performance is
significantly different from all other models except for FURIA and DTNB. Based on these results,
we do not recommend CHIRP for defect prediction. FURIA’s performance was significantly less
from that of Rnd For. But it statistically performs as well as NB. So this model may perform better
than other models used in defect prediction studies, like CART, RBF net and K-Star. However,
DTNB gives respectable results and based on the critical difference, performs as well as NB and
Rnd For. Furthermore, the results observed in this empirical evaluation confirm conclusions of
previous works regarding the accuracy of Rnd For and NB for defect prediction using static code
features.
Comparing Table 3.1 and Table 3.2, we can see that the lowest value of Table 3.1 (DTNB’s
performance for CM1) is an outlier since this AUC value is far lower than a couple of standard
76
deviations from the mean. However, even in Table 3.2, this model does not perform better than
any other model for CM1. In fact, the ranking for Table 3.2 is the same as that for Table 3.2 which
means that similar results from test statistics will be obtained when they are applied on Table 3.2.
This observation further strengthens confidence in the results and findings of this empirical study.
It should be noted that Nemenyi’s statistics tests the null hypothesis that performance of two
models does not differ (same has been reported earlier (Lessmann et al., 2008)). But if H0 is not
rejected, it does not guarantee that the null hypothesis is true. For example, Nemenyi’s test is
unable to reject H0 for Rnd For and DTNB i.e. the hypothesis that they have the same mean ranks.
So the difference in their performances may only be due to chance. But this difference can also
occur because of Type II error i.e. that there is a significant difference between the two models but
the test is unable to detect it with α = 0.05. This implies that rejecting the null hypothesis only
means that there is a high probability that two models differ significantly (where this probability
= 1− α).
Considering this, an overall conclusion that can be reached from this experiment is that only
predictive performance is not sufficient to assess the worth of a classifier and has to be supple-
mented by other standards. This argument is also supported Mende et al. (Mende and Koschke,
2009) where tests show that the average performance in terms of AUC of a trivial model turned
out to be the same as that of NB. So other criteria like computational efficiency and transparency
should also be considered.
So far we have evaluated the accuracy of 5 different classifiers based on their average AUC
values over 12 data sets that were cleaned versions of data sets from the PROMISE repository
(Menzies et al., 2012). The significance of the results is measured using statistical tests. Two of
the classifiers, NB and Rnd For are used as base models since their performances stands out in a
number of other defect prediction studies. The other three are statistical models proposed during
the past five years: DTNB (2008), FURIA (2009) and CHIRP (2011).
We evaluated whether these new models are useful for defect prediction studies by comparing
their performances with those of the baseline models. It turns out that only one model, DTNB,
gives a reasonable performance. CHIRP, due to its low accuracy, is not recommended for use
77
in defect prediction studies and the performance of FURIA is also questionable. We have also
provided a comprehensive detail of the characteristics of the data sets used.
Defect prediction from static code metrics is a fairly recent and active area of research and quite
a lot of improvements can be made in classifying faulty modules based on static code metrics. One
of these includes the comparison of performance between different classifiers for predicting de-
fects in software systems. This constitutes one of the least studied areas in empirical software
engineering (Mende and Koschke, 2009). Studies like these will help in improving the understand-
ing of existing models and will also help incorporate new and novel classification models for this
problem domain.
3.5.1 Threats to Validity
This section focuses on some threats to validity of the comparative study presented in this Chapter.
We have already given problems related to data set characteristics in an earlier section. Seemingly
another problem regarding data sets is that this study covers a limited number of data sets from only
one source. However, there is a debate between how representative these data sets are (Lessmann
et al., 2008, Jiang et al., 2008a, Mende and Koschke, 2009, Menzies et al., 2007). Lessman et al.
(Lessmann et al., 2008) favor the use of these data sets and refer several other studies that have
argued in favor of the selected data sets in terms of their generic characteristics and suitability for
software defect prediction. On the other hand, Mende et al. (Mende and Koschke, 2009) take
the opposing stance that these data sets may not be representative and so the results of their study
cannot necessarily be generalized.
These data sets might require pre-processing (Gray et al., 2011), depending on the models used
and also depending on problems related to or some particular characteristics of the data sets. These
problems have not been empirically investigated for the new models in this study. As mentioned
in (Lessmann et al., 2008), classification is only one of the steps within a multistage data mining
process. Other steps can increase performances of certain classifiers. Keeping this in mind, pre-
processing of data sets for CHIRP might have improved the accuracy of the results for this model
(Wilkinson et al., 2011). This applies to other models as well, including the models NB and Rnd
78
For.
Further and extensive parameter tweaking for each of the new models may produce better
results. However, a set of representative candidate values for each hyper-parameter need to be
identified first. This may require exhaustive testing of each model on data sets similar to the ones
used for defect prediction. Tuning of parameters affects both accuracy and efficiency of these
models.
Another possible problem is related to the sampling procedure which might bias results (Less-
mann et al., 2008, Menzies et al., 2007). However, in response to this problem, Lessman et al
(Lessmann et al., 2008) state that the split-sample setup is an acknowledged and common ap-
proach for classification experiments. Also, the size of the chosen data sets appears sufficiently
large to substantiate this setting.
3.6 Summary
This chapter describes the process followed for the comparative study of existing models and
the presents the results of the study. The datasets used are from the Promise repository and are
described in Chapter 2. From those datasets 8 have been used to compare performance of defect
prediction models because a similar comparative study has used those 8 datasets and to compare
the performance of new models with the older ones it was required to be the same datasets.
The five models compared on the eight datasets are Naive Bayes Classifier (NB), Random
Forest (RndFor), CHIRP, DTNB, and FURIA. Based on the statistical ranking of the 5 models, NB
and RndFor appear as winners. This study further selects one of them to test the preprocessing
approach proposed in Chapter 4. Since the data being used has quality issues like duplicate data
points, the tree based RndFor has been considered prone to over-fitting. Secondly, the proposed
preprocessing introduces missing values in data. The criteria used by NB to handle the missing
values has been simpler as compared to the criteria used by RndFor. Therefore, this study selects
NB for further steps.
79
4. INCREASING RECALL IN SOFTWARE DEFECT PREDICTION
Publicly available defect prediction data is dominated by Not Defect-prone (ND) modules. In
such scenarios identifying the software metrics values that associate with scarcely available Defect-
prone (D) modules in a challenging task. Non availability of sufficient number of D modules (as
training examples) is one of the reasons that models do not achieve high Recall. This chapter
presents an association mining based approach that allows the models to learn defect prone mod-
ules in imbalanced defect datasets. As shown in Figure 4.1 the datasets are Preprocessed using
the proposed approach, a defect prediction model is developed using the preprocessed data and
Performance Analysis of the model is performed in terms of Recall as Performance Measure. The
Preprocessing step partitions data and finds important itemsets such that the prediction of D mod-
ules can be improved. Afterwards, the preprocessed datasets are used for Model Development.
As discussed in the previous chapter, Naive Bayes (NB) classifier has been used as a test case to
evaluate the proposed preprocessing. Performance of the NB classifier is analyzed as a next step.
Significance of Recall for Performance Analysis is highlighted through a questionnaire distributed
in software industry. The results show that the algorithm has either improved Recall of the NBPreprocessing ModelDevelopment PerformanceAnalysisPerformanceMeasureFig. 4.1: Major Steps of Our Methodology
Eq ui~Freq ue ncyBinning Ho rizo nta l Pa r~t it io ning,DS = Pt U Pf Ge ne rat io n ofFreq ue ntIte msetsSe lect io n ofFoc used a nd I nd if~fe re nt Ite msetsCompatibleDatasetDatasetSe lect io nIte mset S up~po rt Ca lc ulat io n Ra nges hav ingst ro ng assoc iat io nw it h defectsRa nges hav ingwea k assoc iat io nw it h defects
Datasets Set as ‘ Miss ing’Va lue in Pa rt it io nPfU n~ Pa rt it io n DataFig. 4.2: Preprocessig Step
classifier (upto 40%) or left it unchanged.
4.1 Proposed Preprocessing
In the preprocessing step we modify the dataset before development of NB based classifier. Ac-
tivities in the preprocessing step are shown in Figure 4.2. We have used the datasets discussed
in Chapter 2and discretized the inputs through Equi-Frequency Binning. After the discretization
step, we have partitioned (Wang et al., 2005) the data into two; one partition Pt includes defect
prone instances whereas the other partition Pf consists of not-defect prone instances. Afterwards,
frequent itemsets are generated in each partition using Apriori algorithm (Jiawei and Micheline,
2002) and support of each itemset is calculated in respective partition to gauge its usefulness and
strength. Support is defined as number of instances containing A and B divided by total number
of instances. We generate those itemsets only that show association between software metrics and
D modules, i.e. the itemsets that can be used to further generate Class Association Rules (CARs)
(Liu et al., 1998). CARs are special type of association rules with class attribute as consequent. In
the next step, the important itemsets (called focused itemsets) are selected from partition Pt and
their values are set as missing in the partition Pf . Both partitions of the data are combined before
defect prediction model is developed.
81
Algorithm 1 Focused Itemsets based Approach to Preprocess Data
Require: Dataset, n, αD, αND, β, τ1,τ2
Ensure: Datasetm
1: Data = discretize(Dataset, n)
2: [dataDdataND] = partitionClassWise(Data)
3: for each class c ∈ {D,ND} do
4: FrequentItemsetsc = generateFIusingApriori(datac, αc)
5: SortedItemsetsc = sort(FrequentItemsetsc, DescSupport)
6: end for
7: for each itemset i ∈ {SortedItemsetsD ∩ SortedItemsetsND} do
8: if Supporti > τ1 in SortedItemsetsD AND Supporti > τ1 in SortedItemsetsND then
9: if |Supporti in SortedItemsetsND - Supporti in SortedItemsetsD| < τ2 then
10: Mark the corresponding attribute as Indifferent
11: else
12: Mark the corresponding itemset as Focused
13: end if
14: end if
15: end for
16: for each itemset i ∈ {SortedItemsetsD do
17: if Supporti > β then
18: dataModifiedND = setMissing(dataND, i)
19: end if
20: end for
21: Return Datasetm = [dataD dataModifiedND]
82
The proposed algorithm identifies associations between software metrics and Defect prone (D)
modules. The metric values strongly associated with D modules (called Focused Itemsets) have
been identified using the Algorithm 1. Algorithm 1 presented here takes 7 inputs and returns the
preprocessed data. The 7 inputs are: the Dataset to be preprocessed; the number n of bins each
attribute will be divided into; αD and αND as minimum support values to generate frequent itemsets
in partition D and ND respectively; threshold value β to decide if a focused itemset should be set
missing; thresholds τ1 and τ2 that help in marking the attributes as indifferent attributes, i.e. the
attributes that do not significantly help in classification of D modules. The algorithm returns a
modified version of the input data. In the modified version the focused itemsets appearing with
ND modules are replaced with missing values. This is done to prevent the prediction model to
learn ND modules associated with metric values corresponding to the focused itemsets. The rest
of the section describes the preprocessing step in detail.
4.1.1 Discretization
The proposed approach needs to find association between defect-prone modules and certain ranges
of software metrics’ values. This association can be found if the attributes are of categorical nature.
However, most of the metrics in the datasets used are quantitative in nature. Therefore, the data
needs to be discretized. We discretize the data by dividing each quantitative attribute into equi-
frequency bins (Jiawei and Micheline, 2002). Essentially, this step divides each attribute into
intervals of metric values.
4.1.2 Horizontal Partitioning
The partitioning step horizontally divides a dataset into two parts, Pt and Pf . Partition Pt has
instances of D modules and the partition Df contains instances of modules that are ND. During
this step, each dataset DS is divided into two subsets such that |DS| = |Pt|+ |Pf | , DS = Pt∪Pf
, and Pt ∩ Pf = φ
83
4.1.3 Generating Frequent Itemsets
We have applied Apriori algorithm (Jiawei and Micheline, 2002) on each partition to find frequent
itemsets. An itemset is frequent if it is occurring many times in a partition and satisfies a minimum
Support threshold. As mentioned earlier, an itemset in our case is an interval, so this step finds
out those intervals of all metrics that co-occur frequently either with D modules in partition Pt or
with ND modules in partition Pf .
Our approach focuses on special itemsets, which do not include the class attribute. Normally,
Apriori algorithm does not distinguish between class and non-class attributes and uses the dis-
cretized values of a class attribute to generate frequent itemsets. A class attribute is included in a
frequent itemset if it is associated with another item in the data. Our approach requires the Apriori
algorithm to generate only those itemsets which are individually or collectively associated with
class attribute (i.e. Defects).
Support count Counti of an itemset iseti is the frequency of its occurrence in a partition. If
m is the total number of independent attributes and n is the number of intervals for each attribute,
Support of iseti in a partition Pj is calculated as follows:
Supporti =Counti|Pj|
× 100 (4.1)
where i ≤ m × n and j ∈ {t, f}. By convention, value of Supporti varies from 0% to 100%
instead of 0 to 1.0 (Jiawei and Micheline, 2002). It is pertinent to mention that in each partition,
the itemsets can have different support values as compared to their support in the whole dataset.
Also the itemsets can have different lengths. An Itemset with one item is called 1-Itemset and an
Itemset with k items is known as k-Itemset. A 1-Itemset is essentially one interval from the range
of values an attribute can have (Jiawei and Micheline, 2002).
4.1.4 Selecting Focused and Indifferent Itemsets
In order to find critical ranges for each attribute, we need to identify itemsets in partition Dt that
satisfy a minimum support threshold αt. At the same time a similar threshold αf of minimum
support is applied on itemsets in partition Df . The itemsets (or intervals) with Support ≥ αt
84
and Support ≥ αf are itemsets of interest and are called Itemsett and Itemsetf respectively.
Focused itemsets appear in Itemsett only and do not appear in Itemsetf whereas indifferent
itemsets appear in both:
Focused = Itemsett − Indifferent (4.2)
Indifferent = Itemsett ∩ Itemsetf (4.3)
Indifferent itemsets are the itemsets that appear in both partitions Pt and Pf and satisfy αt
and αf thresholds in the respective partition. These itemsets do not affect the classification of D
modules and can be ignored when developing a classification model with high Recall. Attributes
with indifferent bins indicate that they do not facilitate classification and can be dropped before
developing a defect classification model. Itemsets that appear in the partition Pt, satisfy αt and
are not indifferent itemsets are called focused itemsets. The attributes with focused itemsets are
good indicators of defects. These ranges of values further need to be studied in order to get a better
understanding of causes of defects.
Identification of focused and indifferent itemsets need to be validated. For each dataset, we
compare the attributes with focused itemsets with a list of all attributes ranked with respect to their
Information Gain (IG) (Jiawei and Micheline, 2002). The attributes with focused itemsets should
be ranked higher whereas the attributes with indifferent itemsets should be ranked lower by the IG
based ranking approach.
4.1.5 Modifying Dataset
After marking the itemsets as focused and indifferent, the algorithm modifies the data such that
classification of D modules can improve. The algorithm modifies the instances in partition Pt and
affects the ability of model to learn ND modules. All the focused itemsets that co-occur with
ND modules and have support greater than the threshold β (provided by the user) are set to be
missing values. This will prevent the prediction model to learn ND modules from the instances
in Pt. Same itemsets appearing in partition Pf remain unchanged. The modified partition and the
unchanged partition are combined together to form the complete dataset. The resultant modified
dataset is returned by the algorithm.
85
4.1.6 Time Complexity Analysis
Asymptotic time analysis of Algorithm 1 has been done to see how much time will the suggested
approach take when applied on large datasets. If a dataset has n instances, then each of the lines
1–3 will execute once, and each method call will have complexity of O(n). The loop at line 4 will
iterate 2 times for a binary class problem (in our case) and equal to the number of classes in a multi
class problem. Number of classes is a constant, say nc, therefore the loop will iterate a constant
number of times. Time complexity of line 5 is a function of number of instances n, number of
attributes in dataset na, and number of items nI . In our case nI = na×nb, where nb is the number
of bins for each attribute. But the time complexity of line 5 is bounded by O(n× nI) for itemsets
of length 1 (also known as 1-itemsets). For k-itemsets the running time becomes O(n× nkI ). The
method call at line 5 takes O(n2I). The loop at line 8 iterates nI times and each statement in the
loop takes constant time. So time complexity for this loop is O(nI). Time complexity for the next
loop (line 17) is also O(nI). Hence the time complexity of the whole algorithm is as follows when
generating k-itemsets:
T ime Complexity = O(n) +O(n× nkI ) +O(n2
I) +O(nI) +O(nI)
= O(n× nkI )
(4.4)
When k = 1 (i.e. 1-itemsets) the time complexity is O(n × nI). Presently we are focusing
on 1-itemsets to improve performance of NB classifier. It is pertinent to mention that the defect
prediction activity is not frequently performed. It is usually performed infrequently prior to coding
and testing phases in a project.
4.2 Developing Defect Prediction Model
Once the focused itemsets in partition Pt have been set as missing, the data is used to develop the
prediction model. The model is developed with different levels of preprocessing; for example, by
setting only one itemset as missing, then by setting another itemset as missing and so on. There is
a range of prediction models reported in literature, Naive Bayes being among the best (Lessmann
86
et al., 2008, Menzies et al., 2010). In second phase of our process we develop Naive Bayes and
Decision Tree based models using the modified data with 10 bins for each attribute and evaluate
the models using 10-fold cross validation.
Naive Bayes (NB) classifier is a simple probabilistic classifier based on Bayes’ Theorem (Ji-
awei and Micheline, 2002):
P (Cj|X) =P (X|Cj)P (Cj)
P (X), Cj ∈ {D,ND} (4.5)
The NB classifier assigns a class label Cj to an unknown sample X if and only if P (Cj|X) >
P (Ck|X) for j 6= k. This shows that the classifier works on the principle of maximizing P (Cj|X).
Using the Bayes Theorem in Equation 3.1 only P (X|Cj)P (Cj) is maximized because P (X) re-
mains constant for both the classes. Generally, the class probabilities are assumed to be equally
likely by this classifier and prior probabilities of the classes are estimated as P (Cj) =sjs
where
sj is the number of training samples belonging to class j and s is the total number of samples.
The classifier follows a smart mechanism to deal with missing values. When in training phase the
classifier does not consider an instance having attribute(s) with missing value(s) for calculations
of class prior probabilities. When performing classification, the attribute(s) with missing value(s)
is/are dropped from calculations. With our proposed preprocessing approach of introducing miss-
ing values this elegant treatment of missing values makes NB a good candidate for selection.
4.3 Performance Analysis
The preprocessing should not deteriorate the performance of the model developed without prepro-
cessing. Performance of the developed models is evaluated in terms of Recall. We expect that
effort to increase Recall will result in increased FPRate as well. This requires the proposed ap-
proach to take into account the possibility of high FPRate and its acceptance in software industry
(users of the prediction model). To this end we have performed 2-fold evaluation of the proposed
approach. First we have collected responses from software industry against the questions listed in
Table 4.1. Responses to these questions will indicate if the industry accepts the models with high
FPRate with the benefit of increased Recall. If our approach is to be useful, the industry should
87
accept high Recall at the cost of high FPRate. Secondly we have empirically tested the stability
of the approach by using 5 public datasets. We have verified that the Recall of the developed
model does not deteriorate the performance of the model when no preprocessing has been done on
data. Model is developed with different number of bins for this process.
Tab. 4.1: Questions asked from the Software Industry
Q. No. Statement of Question Rationale
Q1 If you have very serious time and bud-
get constraints for testing but there are
enough human resources, which of these
will be your acceptable result from a
given Defect Prediction Model (DPM)?
To know the important performance met-
rics for industry
a. High TPRate b. High FPRate c. High
Accuracy d. Low Accuracy e. Low
FPRate f. Low TPRate
Q2 If you have very serious time and bud-
get constraints for testing but there are
enough human resources, which of these
results will you NOT accept from a given
Defect Prediction Model (DPM)?
To see what cannot be tolerated by the
software industry
a. High TPRate b. High FPRate c. High
Accuracy d. Low Accuracy e. Low
FPRate f. Low TPRate
Q3 If you have very few human resources
for testing but there is enough time left
in project delivery, which of these results
will you accept from a given Defect Pre-
diction Model (DPM)?
To see if the software industry agrees to
test extra modules given that defect prone
modules are predicted better
Continued on next page
88
Table 4.1 – continued from previous page
Q. No. Statement of Question Rationale
a. High TPRate and High FPRate b. Low
TPRate but Low TPRate as well c. High
TPRate low Accuracy d. None of these
Q4 For you, the cost of misclassifying a De-
fective module as Defect-free is higher
than the cost of misclassifying a Defect-
Free module as Defective
To confirm that testing extra modules is
acceptable if defective modules are pre-
dicted better. The respondent should rate
4 or 5 (Agree or Strongly
a. Strongly Disagree b. Disagree c. Equal
Cost d. Agree e. Strongly Agree
Agree) if Recall is more important
Q5 For you, correct detection of a defective
module is more important than the correct
detection of defect-free modules
To confirm that high Recall is more im-
portant than FPRate. The respondent
should rate 4 or 5 (Agree or Strongly
Agree) if Recall is more important
a. Strongly Disagree b. Disagree c.
Equally Important d. Agree e. Strongly
Agree
Q6 What are the scenarios where cost of mis-
classifying a defect-free module is more
than the cost of misclassifying a defective
module?
To identify the scenarios where high
FPRate is not acceptable
a. Never b. When time for testing is too
short c. When budget constraint is too
sever d. Other
89
4.3.1 Identifying Performance Measure
Since increase in FPRate is expected while improving Recall, we have consulted software in-
dustry to know if performance shortfall in FPRate is acceptable for the software industry and if
Recall is important for them at the cost of increased FPRate. We have received 30 responses from
the software industry and the responses to each question are shown in Figure 4.3. The results from
the questionnaire show that majority of industry personnel prefer Recall over other performance
measures of defect prediction models. The results also show that project managers can tolerate
FPRate to certain extent as shown in Figure 4.3c.
4.3.2 Results
Results are reported in three steps. In first step results of preprocessing are presented, in second
step development of prediction models with fixed number of bins is reported, in the third step
results for Naive Bayes model have been reported for different number of bins. The third step
presents the results on selected datasets. Results reported in this section have been obtained using
Weka (Witten et al., 2008). We have performed the same set of actions on each datset and have
recorded the performance. As a first step, we have discretized the datasets and divided each at-
tribute into 10 equi-frequency bins (Jiawei and Micheline, 2002, Cios et al., 2007). Each dataset is
then partitioned into Pt and Pf as discussed in Section 4.1.2. Within each partition we have applied
Apriori algorithm with minimum Support values (αt, αf ) shown in Table 4.2 to generate frequent
itemsets. Table 4.2 also shows number of itemsets with different length. Afterwards, frequency of
occurrence, Supporti, of each of the 1-Itemsets is calculated. Table 4.3 shows top five 1-Itemsets
and their Supporti in each partition of each dataset. The itemsets in boldface are focused itemsets
whereas the itemsets in italic are indifferent itemsets. The focused itemsets present the intervals
(ranges) that are highly associated with occurrence of defects. It is pertinent to mention that there
are more focused itemsets than the itemsets shown in Table 4.3. Focused and indifferent itemsets
with length greater than 1 can be found in (Rana et al., 2013).
90
0% 10% 20% 30% 40% 50% 60% 70% 80%
High Recall
High FPRate
High Accuracy
Low Accuracy
Low FPRate
Low Recall
Low FPRate but Low Recall as well
(a) Question 1
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%
High Recall
High FPRate
High Accuracy
Low Accuracy
Low FPRate
Low Recall
Low FPRate but Low Recall as well
(b) Question 2
0.00% 10.00% 20.00% 30.00% 40.00% 50.00%
High Recall and High FPRate
Low FPRate but Low Recall as well
High Recall and Low Accuracy
None of These
(c) Question 3
0.00% 10.00% 20.00% 30.00% 40.00%
Strongly Disagree
Disagree
Equal Cost
Agree
Strongly Agree
Strongly Disagree
Disagree
Equally Important
Agree
Strongly Agree
Never
When time for testing is too
Other
(d) Question 4
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00%
Strongly Disagree
Disagree
Equally Important
Agree
Strongly Agree
Never
When time for testing is too
When budget constraint is
Other
(e) Question 5
22.70%
59.10%
4.50% 13.60%
Never
When time for testing is too
short
When budget constraint is
too sever
Other
(f) Question 6
Fig. 4.3: Questionnaire Results Showing Industry’s Response Regarding Recall
91
Tab. 4.2: Minimum Support Thresholds and Itemset Counts for Each Dataset
Partition Pt Partition Pf
Data
set
αt 1-
Itemset
2-
Itemset
3-
Itemset
αf 1-
Itemset
2-
Itemset
3-
Itemset
cm1 15% 53 179 517 10% 81 240 429
jm1 15% 41 193 724 10% 98 298 902
kc1 15% 35 142 395 20% 15 91 312
kc2 20% 30 95 287 20% 16 101 376
kc3 15% 32 113 208 15% 18 97 317
pc1 20% 20 95 298 10% 82 83 77
pc3 20% 40 53 37 20% 23 125 475
pc4 20% 29 48 30 20% 22 124 443
pc5 40% 21 156 620 85% 16 108 399
mc1 15% 10 27 33 15% 17 88 299
mw1 15% 10 27 33 15% 17 88 299
ar1 10% 24 113 298 10% 18 91 273
ar4 25% 122 556 1369 15% 16 87 299
ar6 10% 122 556 1369 15% 16 87 299
92
Tab. 4.3: Top 5 1-Itemsets and their Supporti in each partition
Partition Pt Partition Pf
Dataset 1-Itemset Supporti 1-Itemset Supporti
cm1
locCodeAndComment
= (-inf-0.5]
97.95% locCodeAndComment
= (-inf-0.5]
99.77%
ev(g)=(-inf-1.2] 63.26% ev(g)=(-inf-1.2] 76.61%
loc=(65.5-inf) 34.69% iv(g)=(-inf-1.2] 50.77%
lOComment=(34.5-
inf)
28.57% lOCode=(-inf-0.5] 46.77%
n=(400.5-inf) 26.53% lOComment=(-inf-0.5] 34.52%
jm1
locCodeAndComment
= (-inf-0.5]
78.96% locCodeAndComment
= (-inf-0.5]
90.13%
ev(g)=(-inf-1.2] 55.17% ev(g)=(-inf-1.2] 71.71%
lOComment=(-inf-0.5] 53.32% lOComment=(-inf-0.5] 70.15%
loc=(90.5-inf) 25.83% iv(g)=(-inf-1.2] 38.96%
l=(0.005-0.035] 23.97% lOBlank=(-inf-0.5] 32.03%
kc1
locCodeAndComment
= (-inf-0.5]
92.94% locCodeAndComment
= (-inf-0.5]
94.16%
ev(g)=(-inf-1.2] 68.09% ev(g)=(-inf-1.2] 89.90%
lOComment=(-inf-0.5] 56.44% lOComment=(-inf-0.5] 80.42%
loc=(49.5-inf) 31.59% iv(g)=(-inf-1.2] 66.85%
uniq Op=(15.5-inf) 30.67% v(g)=(-inf-1.2] 64.21%
kc2
locCodeAndComment
= (-inf-0.5]
71.02% locCodeAndComment
= (-inf-0.5]
90.60%
v(g)=(-inf-1.2] 51.40% ev(g)=(-inf-1.2] 88.91%
uniq Opnd=(36-inf) 39.25% lOComment=(-inf-0.5] 72.77%
total Op=(151.5-inf) 35.14% iv(g)=’(-inf-1.2] 56.62%
Continued on next page
93
Table 4.3 – continued from previous page
Partition Pt Partition Pf
Dataset 1-Itemset Supporti 1-Itemset Supporti
loc=(95.5-inf) 34.58% v(g)=(-inf-1.2] 52.04%
kc3
locCodeAndComment=(-
inf-0.5]
65.11% locCodeAndComment=(-
inf-0.5]
95.9%
ev(g)=(-inf-2] 62.79% ev(g)=(-inf-2] 86.02%
essential density=(-
inf-0.5)
62.79% essential density=(-inf-
0.5]
86.02%
Decision Density =(1-
2.03]
55.81% Decision Count=(-inf-
1]
60.48%
HALSTEAD
LEVEL=(0.065-0.075]
16.27% Decision Density
=(-inf-1)
55.9%
pc1
ev(g)=(-inf-1.2] 62.33% locCodeAndComment
= (-inf-0.5]
76.45%
locCodeAndComment
= (-inf-0.5]
46.75% ev(g)=(-inf-1.2] 71.51%
loc=(54.5-inf) 33.76% lOComment=(-inf-0.5] 55.81%
I=(65.25-inf) 32.46% iv(G)=(-inf-1.2] 47.69%
B=(0.565-inf) 31.16% v(g)=(-inf-1.2] 30.52%
pc3
ev(g)=(-inf-1.2] 66.25% ev(g)=(-inf-1.2] 74.70%
ESSENTIAL DENSITY
= (-inf-0.025]
66.25% ESSENTIAL DENSITY
= (-inf-0.025]
74.70%
DECISION DENSITY
=(1-2.01]
49.38% locCodeAndComment
= (-inf-0.5]
69.14%
PARAMETER COUNT
=(0.5-1.5]
45.00% lOComment=(-inf-0.5] 61.15%
Continued on next page
94
Table 4.3 – continued from previous page
Partition Pt Partition Pf
Dataset 1-Itemset Supporti 1-Itemset Supporti
uniq Opnd=(44.5-inf) 31.25% PERCENT COMMENTS
=(-inf-0.25]
52.53%
pc4
ESSENTIAL DENSITY
= (-inf-0.5]
96.07% ESSENTIAL DENSITY
= (-inf-0.5]
92.5%
DECISION DENSITY
= (1-2.5]
82.02% ev(g)=(-inf-1.2] 75.94%
ev(g)=(-inf-2] 82.02% locCodeAndComment
=(-inf-0.5]
69.45%
PARAMETER COUNT
=(-inf-0.5]
63.48% CONDITION COUNT
=(-inf-2]
59.69%
CALL PAIRS=(0.5-
1.5]
34.27% DECISION COUNT=(-
inf-1]
59.69%
pc5
PARAMETER COUNT=(-
inf-0.5]
96.12% ev(g)=(-inf-2] 97.17%
n=(44.5-inf) 77.91% ESSENTIAL DENSITY=(-
inf-0.025]
97.17%
v=(201.765-inf) 77.71% iv(g)=(-inf-1.5] 95.55%
CYCLOMATIC DENSITY
=(-inf-0.245]
72.29% PARAMETER COUNT=(-
inf-0.5]
94.70%
l=(0.005-0.105] 70.93% lOComment=(-inf-0.5] 91.88%
mc1
—-=(-inf-0.5] 93.3% locCodeAndComment=(-
inf-0.5]
94.62%
loccomment =(9.5-13] 16.67% PARAMETER COUNT=(-
inf-0.5]
48.11%
Continued on next page
95
Table 4.3 – continued from previous page
Partition Pt Partition Pf
Dataset 1-Itemset Supporti 1-Itemset Supporti
ev(g)=(-inf-2] 70.0% ev(g)=(-inf-2] 83.33%
essential density=(-
inf-0.5)
100% essential density=(-inf-
0.5]
94.62%
halstead level=(-inf-
0.045]
16.67% DECISION COUNT=(-
inf-1]
36.29%
mw1
locCodeAndComment=(-
inf-0.5]
93.3% locCodeAndComment=(-
inf-0.5]
94.62%
loccomment =(9.5-13] 16.67% PARAMETER COUNT=(-
inf-0.5]
48.11%
ev(g)=(-inf-2] 70.0% ev(g)=(-inf-2] 83.33%
essential density=(-
inf-0.5)
100% essential density=(-inf-
0.5]
94.62%
halstead level=(-inf-
0.045]
16.67% DECISION COUNT=(-
inf-1]
36.29%
ar1
locCodeAndComment=(-
inf-0.5]
100% locCodeAndComment=(-
inf-0.5]
92.85%
blank loc=(-inf-1] 88.89% blank loc=(-inf-1] 94.64%
formal parameters=(-
inf-0.5)
66.67% formal parameter=(-
inf-0.5]
71.42%
halstead error=(-inf-
0.045]
20% multiple Condition Count=(-
inf-0.5)
50.89%
unique operand
=(8.5-10]
22.2% design Complexity=(-
inf-0.5]
46.42%
Continued on next page
96
Table 4.3 – continued from previous page
Partition Pt Partition Pf
Dataset 1-Itemset Supporti 1-Itemset Supporti
ar4
—-=(-inf-0.5] 93.3% locCodeAndComment=(-
inf-0.5]
94.62%
loccomment =(9.5-13] 16.67% PARAMETER COUNT=(-
inf-0.5]
48.11%
ev(g)=(-inf-2] 70.0% ev(g)=(-inf-2] 83.33%
essential density=(-
inf-0.5)
100% essential density=(-inf-
0.5]
94.62%
halstead level=(-inf-
0.045]
16.67% DECISION COUNT=(-
inf-1]
36.29%
ar6
locCodeAndComment=(-
inf-0.5]
73.3% locCodeAndComment=(-
inf-0.5]
94.18%
Blank loc=(-inf-0.5] 80% Blank loc=(-inf-0.5] 91.86%
Decision
Density=(0.635-1.065)
60% Decision Density=(-
inf-0.5]
50%
unique operands=(31.5-
inf)
20% decision density=(-inf-
0.5]
50.0%
Total loc =(53.5-inf] 20% branch Count=(-inf-1] 38.37%
In order to identify the indifferent and focused 1-Itemsets we have used the thresholds given
in Table 4.4. This can be noticed that in all cases τt ≤ τf in Table 4.4. This is because of the fact
that there is less data regarding defect-prone modules hence the itemsets associated with defects
have low support. So in order to pick a reasonable number of focused itemsets, we had to keep τt
very low. Further we have also identified longer itemsets (i.e. itemsets with length > 1). Top 3
itemsets from selected datasets have been shown in Table 4.5.
As next step we have developed defect prediction models to evaluate effectiveness of the fo-
97
Tab. 4.4: τt and τf , used in this study, for each dataset
Dataset τt τf
cm1 25% 30%
jm1 20% 30%
kc1 25% 25%
kc2 20% 25%
kc3 38% 37%
pc1 25% 30%
pc3 20% 30%
pc4 20% 49%
pc5 55% 87%
mc1 15% 37%
mw1 17% 37%
ar1 20% 47%
ar4 25% 37%
ar6 34% 39%
98
cused itemsets. The models developed are J48 decision tree and Naive Bayes (NB) model. Both
the models have been developed for selected datasets. The datasets used for J48 are cm1, jm1,
pc1, kc1, and kc2. Each dataset is discretized into 10 equi frequency bins. Further the proposed
preprocessing has been applied on each dataset. After dropping the indifferent attributes the longer
itemsets identified as focused are relabelled as missing one by one. The relabeling of upto 3 item-
sets have been done and an increase in Recall has been observed as shown in Table 4.6. These
results have been partially reported in (Rana et al., 2013).
Tab. 4.5: Top 3 2-Itemsets and their Supporti in each partition
Partition Dt Partition Df
Dataset 2-Itemset Supporti 2-Itemset Supporti
CM1
ev(g)= ’(-inf-1.2]’
locCodeAndComment=’(-
inf-0.5]’ {Indifferent}
61.22 % ev(g)= ’(-inf-1.2]’
locCodeAndComment=’(-
inf-0.5]’ {Indifferent}
76.61 %
loc=‘(65.5-inf)’ loc-
CodeAndComment
=(-inf-0.5]
34.69 % iv(g)= ’(-inf-1.2]’
locCodeAndComment=’(-
inf-0.5]’
50.78 %
loc=‘ (65.5-inf)’
lOComment=‘(34.5-
inf) ’
28.57 % ev(g)= ’(-inf-1.2]’
iv(g)= ’(-inf-1.2]’
46.77 %
JM1
lOComment=’(-
inf-0.5]’
locCodeAndComment=’(-
inf-0.5]’ {Indifferent}
49.72 % lOComment=’(-
inf-0.5]’
locCodeAndComment=’(-
inf-0.5]’ {Indifferent}
66.76 %
lOBlank=’(-inf-0.5]’
locCodeAndCom-
ment =’(-inf-0.5]’
21.56 % iv(g)=’(-inf-1.2]’
locCodeAndComment=’(-
inf-0.5]’
37.65 %
Continued on next page
99
Table 4.5 – continued from previous page
Partition Dt Partition Df
Dataset 2-Itemset Supporti 2-Itemset Supporti
e=’(48232.24-inf)’
t=’(2679.57-inf)’
21.27 % ev(g)=’(-inf-1.2]’
iv(g)=’(-inf-1.2]’
35.39 %
KC1
ev(g)=’(-inf-1.2]’
locCodeAndComment=’(-
inf-0.5]’ {Indifferent}
65.34 % ev(g)=’(-inf-1.2]’
locCodeAndComment=’(-
inf-0.5]’ {Indifferent}
86.26 %
e=’(14140.38-inf)’
t=’(775.605-inf)’
28.83 % ev(g)=’(-inf-1.2]’
iv(g)=’(-inf-1.2]’
66.74 %
n=’(147.5-inf)’
v=’(795.61-inf)’
28.53 % iv(g)=’(-inf-1.2]’
locCodeAndComment=’(-
inf-0.5]’
65.68 %
KC2
ev(g)=’(-inf-1.2]’
lOCodeAndComment=’(-
inf-0.5]’ {Indifferent}
44.86 % ev(g)=’(-inf-1.2]’
lOCodeAndComment=’(-
inf-0.5]’ {Indifferent}
83.13 %
v=’(1403.34-inf)’
uniq Opnd=’(36-inf)’
34.58 % ev(g)=’(-inf-1.2]’
lOComment=’(-inf-
0.5]’
70.84 %
uniq Opnd=’(36-inf)’
total Op=’(151.5-inf)’
34.58 % lOComment=’(-
inf-0.5]’
lOCodeAndComment=’(-
inf-0.5]’
69.40 %
At the third step Naive Bayes classifier has been developed and its performance on different
number of bins has been observed. Performance of the NB model on up to 20 bins has been
observed and trend has been studied. Details of the performances of NB for the 5 datasets, i.e. 1
dataset from each of the groups mentioned in Section 4.1 has been presented in Table 4.7. The
100
Tab. 4.6: Performance of Decision Tree Model in terms of Recall
KC1 KC2 JM1 CM1
No Pre Processing 0.34 0.505 0.13 0.22
Attributes Dropped 0.507 0.527 0.155 0.154
1-Itemsets Relabelled 0.553 0.587 0.258 0.449
2-Itemsets Relabelled 0.553 0.613 0.339 0.449
3-Itemsets Relabelled 0.554 0.613 0.367 0.449
table presents different performance parameters including Recall. A trend of Recall has been
plotted. Performance of the model does not deteriorate for up to 20 bins as shown in Figure 4.4
when compared with no preprocessing. It is expected that the model will show similar results for
the rest of the datasets from each group.
Tab. 4.7: Performance of NB classifier on different number of bins with and without proposed approach
PM1 PAA2 4 6 8 10 12 14 16 18 20
AR4
RecallNo 0.7 0.7 0.55 0.6 0.65 0.55 0.7 0.55 0.7
Yes 0.75 0.8 0.70 0.8 0.75 0.75 0.75 0.70 0.75
FP RateNo 0.22 0.15 0.16 0.17 0.16 0.18 0.16 0.18 0.17
Yes 0.22 0.15 0.17 0.17 0.17 0.18 0.17 0.18 0.19
CM1
RecallNo 0.65 0.63 0.63 0.63 0.57 0.59 0.59 0.57 0.51
Yes 0.81 0.75 0.71 0.69 0.73 0.73 0.71 0.69 0.71
FP RateNo 0.34 0.34 0.33 0.32 0.31 0.31 0.29 0.29 0.28
Yes 0.34 0.34 0.34 0.31 0.31 0.31 0.29 0.29 0.28
KC3
RecallNo 0.86 0.84 0.84 0.84 0.84 0.81 0.84 0.81 0.79
Yes 0.88 0.88 0.88 0.86 0.86 0.86 0.86 0.86 0.86
FP RateNo 0.34 0.32 0.32 0.32 0.31 0.3 0.3 0.29 0.28
Continued on next page
101
Table 4.7 – continued from previous page
PM1 PAA2 4 6 8 10 12 14 16 18 20
Yes 0.35 0.32 0.32 0.33 0.32 0.31 0.31 0.3 0.3
MC1
RecallNo 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82
Yes 0.84 0.87 0.88 0.87 0.84 0.87 0.87 0.87 0.95
FP RateNo 0.18 0.16 0.16 0.15 0.14 0.14 0.15 0.14 0.14
Yes 0.17 0.15 0.15 0.15 0.14 0.14 0.15 0.14 0.14
PC3
RecallNo 0.73 0.74 0.74 0.73 0.74 0.73 0.73 0.73 0.74
Yes 0.84 0.84 0.84 0.84 0.89 0.89 0.88 0.89 0.89
FP RateNo 0.3 0.3 0.3 0.3 0.29 0.29 0.29 0.29 0.28
Yes 0.3 0.29 0.29 0.29 0.28 0.28 0.28 0.28 0.27
1PM = Performance Measure. 2PAA = Proposed Approach Applied. Table 4.7 ends.
4.4 Analysis and Discussion
For the datasets used in this study, very high ranges of the focused itemsets mainly contribute
towards defects. In most of cases the intervals bounded with infinity, inf , are found to be highly
associated with defect. For dataset jm1, the itemset l = (0.005− 0.035] is an exception. But this
itemset has a very low Supporti value and it appears only when we keep αt < 23%.
The intervals with high values are the critical ranges and are very important for software man-
agers and researchers. If, during a software project, values of the mentioned software metrics fall
in critical ranges, this should raise an alarm and the project schedules, resource plans, and testing
plans should be adjusted accordingly. Further a defect prediction model can use the critical ranges
for improved classification defect-prone modules.
The focused itemsets do not only identify the critical ranges for each software metric, they
also indicate the software metrics that can improve the detection of defect-prone modules. Some
attributes have atleast one focused itemset in all dataset, hence showing strength of that attribute
in prediction of defects. Across five datasets, the majority vote reveals that loc, n, v, d, i, e, b, t,
lOCode, uniq Op, uniq Opnd, total Op and total Opnd consistently contribute in occurrence of
102
No. of bins00.10.20.30.40.50.60.70.80.91 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Recall wit hout preprocessingMin Recall wit ht he proposedpreprocessingMax Recall wit ht he proposedpreprocessingR ecall(a) CM1
00.10.20.30.40.50.60.70.80.911 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 14 15 1 6 17 18 1 9 2 0
Recall wit hout preprocessingMin Recall wit ht he proposedpreprocessingMax Recall wit ht he proposedpreprocessingR ecall(b) PC3
No. of bins0.50.550.60.650.70.750.80.850.90.9511 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Recall without preprocessingMin Recall withthe proposedpreprocessingMax Recall wit ht he proposedpreprocessingR ecall(c) MC1
Fig. 4.4: Trend of Recall across five datasets (Continued on next page)
103
No. of bins00.10.20.30.40.50.60.70.80.91
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Recall without preprocessingMin Recall withthe proposedpreprocessingMax Recall withthe proposedpreprocessingR ecall
(d) KC3
00 . 10 .20 .30 .40 .50 .60 .70 .80 .9
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0Recall without preprocessingMin Recall with the proposedpreprocessingMax Recall with theproposed preprocessingR ecall
(e) AR4
Fig. 4.4: (Continued from previous page) Trend of Recall across five datasets
104
051015202530354045
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20No. of binsP ercent agech angei nR ecall
(a) CM1
0510152025
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20No. of binsP ercent agech angei nR ecall
(b) PC3
024681012141618
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20No. of binsP ercent agech angei nR ecall
(c) MC1
Fig. 4.5: Percentage Change in Recall across 5 datasets with and without the proposed preprocessing (Con-
tinued on next page)
105
012345678910
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20No. of binsP ercent agech angei nR ecall
(d) KC3
0510152025303540
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15No. of binsP ercent agech angeinR ecall
(e) AR4
Fig. 4.5: (Continued from previous page) Percentage Change in Recall across 5 datasets with and without
the proposed preprocessing
106
defects. Whereas locCodeAndComment, ev(g) and lOComment do not necessarily cause defects.
In addition to identify the focused itemsets, the experiments have also identified some ranges
that are neutral to occurrence or presence of defects. lOCodeAndComment = (−inf − 0.5],
ev(g) = (−inf−1.2], and lOComment = (−inf−0.5] are few examples. lOCodeAndComment =
(−inf − 0.5] appears in all datasets without exception. Supporti for this itemset is very high for
both partitions with minimum Supporti = 46.75% for partition Dt and Supporti = 76.45%
for partition Df of dataset pc1. lOCodeAndComment is a count of all lines of code including
comments. The interval represented by this itemset shows very low values of this metric. This
indicates that when a software is small in size, measured in terms of lines of code and comments,
this metric does not contribute in occurrence or absence of defects. Low values of ev(g) (i.e.
ev(g) = (−inf − 1.2]) also appears as an indifferent itemset in four datasets.
The trends of Recall across the five datasets are encouraging. Figure 4.5 shows that at most
of the times the change in recall has been positive. The bars in the figure indicate the percentage
increase or percentage decrease in the recall when the proposed preprocessing has been used with
varying bin sizes. This improvement in Recall is achieved at cost of high False Positive Rate
(FPRate). Figure 4.6 shows the percentage change in FPRate. The percentage increase in false
positive rate has remained below 5% for 5 datasets except AR4 where the false positive rate has
been around 14%. As already discussed in Section 4.3.1, this increase in false positive rate is
acceptable and can be tolerated to have better Recall.
4.5 Summary
Software metrics have been investigated over the years for software defect prediction. We study
the relationship between software product metrics and software defects. We have selected public
datasets and discretized the data to study associations of software metrics and defects. From the
discretized data we have generated frequent itemsets and identified the 1-Itemsets strongly associ-
ated with defects. We call these 1-Itemsets the focused itemsets. The 1-Itemsets that are strongly
associated with both the presence and absence of defects are called indifferent itemsets. We have
also identified longer itemsets and their association with defect prone (D) modules. Based on the
107
�2�1 .5�1�0 .500 .511 .521 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 17 1 8 1 9 2 0P ercent ageCh ange
i nFPR at e(a) CM1
04.50403.50302.50201.50100.50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P ercent ageCh angei nFPR at e
(b) PC3
I3.5I3I2.5I2I1.5I1I0.500.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P ercent ageCh angei nFPR at e
(c) MC1
00.511.522.533.544.51 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P ercent ageCh ange
i nFPR at e(d) KC3
02468101214
161 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P ercent ageCh ange
i nFPR at e(e) AR4
Fig. 4.6: Percentage Change in FPRate across five datasets with and without the proposed preprocessing
108
indifferent attributes and focused itemsets we have proposed a preprocessing approach that has
further been used to develop J48 and Naive Bayes defect prediction models. Upto 3-itemsets have
been preprocessed for J48 model. Further, stability of the proposed approach has been studied by
developing the NB model with different number of bins. A trend of Recall has been studied for
the different number of bins.
Analysis of the focused itemsets across datsets shows that very high ranges of loc, n, v, d, i,
e, b, t, lOCode, uniq Op, uniq Opnd, total Op and total Opnd consistently contribute in causing
defects. Whereas locCodeAndComment, ev(g), iv(g) and lOComment do not necessarily cause
defects. associated with defects, this study also identifies the critical ranges of these attributes.
Performance of the J48 and NB models with 10 bins has either increased or remained unchanged
when different number of itemsets are set missing. The trend of Recall suggests that the perfor-
mance of the NB model has also either improved or remained unchanged when number of bins is
increased from 2 to 20. This relabeling of bins to missing values has increased the prediction of D
modules up to 40%.
Discussion so far has been based on static code metrics. Major issue with the code metrics
is that they are not available in early phases of lifecycle and are not effectively used for early
prediction.
109
5. EARLY PREDICTIONS USING IMPRECISE DATA
This chapter proposes the use of Fuzzy Inference System for early detection of defect prone mod-
ules. It is desired and useful to predict defect prone modules when code metrics are not yet avail-
able since this can help avoid defects later in the lifecycle. Defects caught later have high cost of
fixing as compared to the cost of fixing defects earlier in the lifecycle. In order to enable the soft-
ware engineers get the defectiveness information earlier in lifecycle phases, we propose a model
that works with imprecise inputs. The model can be used for earlier rough estimates when exact
values of software measurements are not available.
5.1 Imprecise Inputs and Defect Prediction
Use of code metrics to develop a prediction model gives good predictions (Menzies et al., 2007).
However, code is available late in the software lifcycle. Literature suggests that earlier the predic-
tion the more useful it is, for example potential defects can be avoided and better resource plans
can be generated in early stages of development. The literature also suggests that early estimates
are less accurate as compared to the estimates made later in the software lifecycle, as shown in
Figure 5.1 (Shari L. PFleeger, 2010). One reason for the imprecise estimates is the lack of details
related to software. In the earlier stages, imprecise/fuzzy information can be used to make rough
predictions which can be improved later in the lifecycle when exact details become available. To
this end we propose the use of Fuzzy Inference System (FIS) for early detection of defect prone
modules. The FIS works with imprecise inputs which are design and code metrics (defined in fuzzy
terms). The approach adopted here provides an approximate value of the conventional prediction
made at the later stages of Software Development Life Cycle (SDLC). It has been observed that
fuzzy linguistic based model reaches a comparable level of accuracy, precision and recall. Ex-
Fig. 5.1: Accuracy of estimation as project progresses (Shari L. PFleeger, 2010).
pressing software metrics in fuzzy linguistic terms is a workable solution because the management
responsible for preparing resource plans is usually experienced enough to provide the values of
software metrics in linguistic terms such as very low, low, medium, high.
This chapter presents a Fuzzy Inferencing System (FIS) using inputs represented in fuzzy lin-
guistic terms. The fuzzy linguistic inputs have been used to generate an FIS that predicts the defect
proneness of various modules of object oriented software. We validate the model using kc1-class-
level-data and jEdit bug data (Menzies et al., 2012). First fuzzy c-means clustering (Bezdek, 1981)
is applied on the input data to determine: 1) the membership functions for each input and 2) the
number of rules to be generated. A Sugeno type FIS (Kosko, 1997) is generated afterwards. The
FIS based model proposed in this study has been compared with classification trees (CT) (Mitchell,
1997), linear regression (LR) (Bishop, 2006) and neural networks (NN) (Haykin, 1994) based pre-
diction models so that a comparison of approximate prediction with the exact prediction can be
made to see the extent of performance shortfall. Upto 10% shortfall in Accuracy and 0.93% short-
fall in Recall has been observed using the proposed technique. The rest of the chapter describes
the model and presents the results obtained for public datasets.
111
5.2 Proposed Model based on Imprecise Inputs
5.2.1 Dataset
The datasets used for the study are class-level data for kc1 available at PROMSIE website (Men-
zies et al., 2012). The details about the instances and parameters of each dataset are given in
Table 5.1. A cross(×) against a metric indicates that the metric from the respective dataset has
been used. Each instance in a dataset represents a software module (or file). A classification vari-
able is used as output to indicate if the software module is defect prone (D) or not-defect prone
(ND). There are 95 attributes in the dataset kc1-class-level-data. Except the class attribute, the
rest of the 94 parameters are software metrics calculated for one module and are divided into two
groups. Group A has 10 parameters and Group B has 84 parameters. Values of the parameters in
Group A are originally measured at module level whereas the values of the parameters in Group B
are originally measured at method level and are later transformed to module level before making
the dataset available at PROMSIE website (Menzies et al., 2012). No parameter from Group B
(transformed to module level) has been selected for this study. From Group A, 8 most commonly
used parameters identified in (Menzies et al., 2003) have been selected to predict defect proneness.
5.2.2 FIS Based Model
Our FIS based model is generated through a two phase process. The first phase performs fuzzy
c-means clustering (Bezdek, 1981) to identify the membership functions for each input and the sec-
ond phase then generates a fuzzy inference system that models the behavior of the data. Two types
of fuzzy reasoning methods, i.e. Mamdani and Sugeno reasoning methods (Kosko, 1997), can
be applied in FIS implementations. Sugeno uses constant or linear functions as output functions
whereas Mamdani uses fuzzy membership functions at the output resulting in higher computational
costs (Kosko, 1997). In the present study Sugeno method has been applied as the reasoning process
simply because it is computationally efficient, and performance enhancement of this method can
further be achieved through applying other optimization and adaptive techniques.
112
Tab. 5.1: Metrics Used for this Study
Metric Abbreviation kc1-classlevel jEdit data
Coupling Between Object classes CBO × ×Depth of Inheritance Tree DIT × ×Lack of Cohesion in Methods LCOM × ×Number Of Children NOC × ×Dependence on an a descendant DOC ×Count of calls by higher modules FAN IN ×Response For a Class RFC × ×Weighted Methods per Class WMC × ×Number of Public Methods NPM ×Lines of Code LOC ×
Number of Instances 145 274
113
Phase 1: Performing Fuzzy C-Means Clustering
In order to generate the fuzzy rules, it is required to determine the total number of rules to be
developed and the number of antecedent membership functions. Fuzzy c-means clustering (FC)
(Bezdek, 1981) has been used to determine these two parameters. The total number of clusters
given by FC are useful in determining the number of rules. FC algorithm outputs n clusters C1···n
and the membership value of each antecedent in each of the n clusters.
Phase 2: Generating FIS
A Sugeno type FIS using Matlab fuzzy toolbox has been generataed in which the consequent y of
a rule is a crisp number which is computed as follows:
y =k
∑
i=1
αixi + βi (5.1)
where k is total number of antecedents (parameters predicting the defect proneness), αi and βi are
co-efficients which can be different for each parameter xi. The generated FIS has n rules and the
jth rule takes the form:
Rulej : IF input1 in Ci AND · · · AND inputk in Ci THEN output in Ci
where k is total number of input parameters and i = 1 · · ·n. For a binary class problem, two
rule sets, RuleSetD and RuleSetND, are generated for classification of the D and ND modules
respectively. When a certain rule, say Rulep, is fired, the following RuleD and RuleND classify
the given module into class D and ND respectively:
RuleD : IF Rulep ǫ RuleSetD THEN D
RuleND : IF Rulep ǫ RuleSetD THEN ND
where
RuleSetD = {Ruleq| Ruleq classifies the given module as D},
RuleSetND = {Ruler| Ruler classifies the given module as ND}
and
RuleSetD ∩RuleSetND = Φ
114
Tab. 5.2: Evaluation Parameters Used for Comparison
Evaluation Parameter Abbreviation
True Negative Rate TNRate
True Positive Rate (Recall) TPRate (Recall)
False Positive Rate FPRate
False Negative Rate FNRate
Accuracy Acc
Precision Prec
Misclassification Rate MCRate
F-Measure F
5.2.3 Evaluation
The suggested FIS model has been compared with the prediction models such as classification
tree (CT), linear regression (LR) and neural network (NN) based models. The objective is to
not only identify the best model for each evaluation parameter but also highlight the extent of
performance shortfall for each evaluation parameter if fuzzy prediction model is used instead of
the exact prediction models.
The comparison mentioned above has been done in terms of the evaluation parameters listed
in Table 5.2. TNRate, TPRate or Recall, FPRate and FNRate are obtained while computing
a confusion matrix (Jiawei and Micheline, 2002). Acc, Prec and MC Rate are important and
widely used model performance measures when confusion matrix is available. Acc is not consid-
ered a good performance measure for unbalanced datasets (Kubat et al., 1998), therefore to deal
with the unbalanced data F − measure (Rijsbergen, 1979) is used. It is worthwhile to mention
that performance in terms of Recall (i.e. true positive rate) is significant as it helps in focusing
on the problematic areas in the software and thus better resource planning can be done. Since our
goal is to detect the defect prone modules, we consider Recall as our primary measure of compar-
ison. In addition to these parameters we have determined the position of each model on relative
115
operating characteristic (ROC) graph (Egan, 1975) by plotting a ROC point for each model. This
visualization is helpful in identifying a better model in terms of Recall. To see the effect of using
fuzzy model, a percentage of performance shortfall is calculated for each evaluation parameter. An
overall performance shortfall (maximum shortfall from all the parameters) is also recorded. This
percentage is helpful in finding the benefits of using the fuzzy based prediction and its drawbacks
over using the exact predictions.
5.3 Results
To develop the FIS, the membership functions for each input were assigned based on the frequency
distribution of the data. A relationship between the distribution of all the input metrics of kc1-class-
level data and the membership functions (or clusters) for each input software metric is shown in
Figure 5.2 and Figure 5.3. There are more membership functions in Figure 5.3 for the more dense
areas of the corresponding input metric in Figure 5.2. A similar relationship between the metrics
of jEdit bug data and their membership functions can be seen in Figures 5.4 and 5.5 respectively.
For both the datasets, we have kept the shape of all the membership functions of the antecedents as
gaussian and have used the information obtained from phase 1 to determine the rules that model the
data behavior. Linear least squares estimation has been used to find the consequent of each rule. In
total 21 and 7 conjunctive rules were generated for kc1-class-level and jEdit datasets respectively.
The decision whether a module is D or ND is made using RuleD and RuleND respectively. For
kc1-class-level, 21 clusters were obtained with clusters 1 to 11 belonging to ND class and clusters
12 to 21 belonging to D class for kc1-class-level data. For jEdit data, 7 clusters were obtained
where instances lying in cluster 1 to 4 belong to ND class and the instances that belong to clusters
5, 6, 7 belong to D class.
Three models LR, NN and CT have been developed for both the datasets. The standard R-
squared LR model was obtained using the training data. The NN model is a single hidden layer
feed forward back propagation based network having one neuron in the hidden layer. The NN
model selected for comparison is obtained after exhaustive experimental runs. It was seen during
the experiments that the more complicated NN models did not perform better than the single layer
116
−5 0 5 10 15 20 250
5
10
15
20
25
30
35
input 1
frequency
(a) CBO
0 1 2 3 4 5 6 7 80
10
20
30
40
50
60
70
input 2
frequency
(b) DIT
−0.2 0 0.2 0.4 0.6 0.8 1 1.20
50
100
150
input 5
frequency
(c) DOC
−0.5 0 0.5 1 1.5 2 2.5 3 3.50
10
20
30
40
50
60
70
input 6
frequency
(d) FAN IN
Fig. 5.2: Frequency distribution of all input metrics for kc1-classlevel data (Continued on Next Page)
117
−20 0 20 40 60 80 100 1200
10
20
30
40
50
60
input 3
frequency
(e) LCOM
−1 0 1 2 3 4 5 60
20
40
60
80
100
120
140
input 4fr
equency
(f) NOC
−50 0 50 100 150 200 2500
10
20
30
40
50
60
input 7
frequency
(g) RFC
−20 0 20 40 60 80 100 1200
10
20
30
40
50
60
70
80
input 8
frequency
(h) WMC
Fig. 5.2: (Continued from Previous Page) Frequency distribution of all input metrics for kc1-classlevel data
118
Fig. 5.3: Output of phase 1: clusters and membership functions for each input of kc1-classlevel. (The plot
of the membership functions for each input appear in the same order as the distribution of each
input appears in Figure 5.2)
(a) CBO (b) DIT
(c) NPM (d) LOC
(e) LCOM (f) NOC
Fig. 5.4: Frequency distribution of all input metrics for jEdit bug data (Continued on Next Page)
119
(g) RFC (h) WMC
Fig. 5.4: Frequency distribution of all input metrics for jEdit bug data
single neuron network for the datasets under study. To train the NN on kc1-class-level, 100 epochs
were used to obtain reasonable results. A larger number of epochs has not produced better results
with the initial bias of 1 and the initial unit weights at each hidden layer. The NN for jEdit data
took 78 epochs for training with same initial bias and weights. A CT with 14 and 16 intermediate
nodes was developed for kc1-class-level and jEdit data respectively.
5.4 Analysis and Discussion
Despite being a simple learner of all the other learners, CT has performed better in training phase.
It has the best values for all the evaluation parameters (in case of kc1) including Recall. The
FIS based model has the third best Recall during training as shown in Figure 5.6a whereas it has
the best Recall but a higher FPRate in the testing phase as shown in Figure 5.6b. In case of
JEdit data FIS based model performs better than the others as shown in Figure 5.6c and 5.6d. The
testing phase performance reported in Table 5.3 indicates the LR based model dominates in the
test phase (of kc1) in terms of Acc, Prec and MC Rate. The NN based model turns out to be
the best classifier for the ND modules. The FIS based model has the second best F value. It is
worthwhile to mention that the results reported in Table 5.3 are for unbalanced test data which is
dominated by ND classes. Hence the best Recall and the second best F value of the FIS based
model is reasonable for a model developed using fuzzy inputs. These values can be improved if
the membership functions are further fine tuned.
The effect of using fuzzy inputs have been measured in terms of performance shortfall. For kc1
data, maximum performance shortfall of 24.99% has been observed in case of TNRate. For the
120
(a) CBO (b) DIT
(c) NPM (d) LOC
(e) LCOM (f) NOC
(g) RFC (h) WMC
Fig. 5.5: Output of phase 1: clusters and membership functions for each input of jEdit. (The plot of the
membership functions for each input appear in the same order as the distribution of each input
appears in Figure 5.4)
121
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Train: ROC Point for each model
FP Rate
TP
Rate
(R
ecall)
FIS
CT
LR
NN
(a) kc1-class-level Training
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
TP
Rate
(R
ecall)
FP Rate
Test: ROC Point for each model
FIS
CT
LR
NN
(b) kc1-class-level Testing
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Train: Average ROC Point for each model
FP Rate
TP
Rate
FIS
CT
LR
NN
(c) jEdit data Training
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Test: Average ROC Point for each model
FP Rate
TP
Rate
FIS
CT
LR
NN
(d) jEdit data Testing
Fig. 5.6: ROC point for each model in training and testing
122
Tab. 5.3: Testing Performance Measures
Dataset Model TNRate TPRate FPRate FNRate Acc. Prec. MCRate F
(Recall)
kc1- FIS 0.667 0.917 0.333 0.083 0.729 0.478 0.271 0.629
classlevel CT 0.694 0.833 0.306 0.167 0.729 0.476 0.271 0.606
LR 0.861 0.667 0.139 0.333 0.813 0.615 0.188 0.640
NN 0.889 0.500 0.111 0.500 0.792 0.600 0.208 0.546
jEdit FIS 1 0.973 0 0.027 0.979 1 0.021 0.984
Bug CT 0.7 0.877 0.3 0.123 0.839 0.914 0.161 0.895
Data LR 0.983 0.973 0.017 0.027 0.975 0.994 0.025 0.982
NN 0.95 0.982 0.05 0.018 0.975 0.986 0.025 0.984
rest of the evaluation parameters, the performance shortfalls have been 10.25%, 22.27%, 44.42%
and 1.81% for Acc, Prec, MCRate and F respectively. At the same time, the FIS models provides
10% gain in Recall when compared with the second best Recall. For jEdit data the FIS model
performed better in terms of all evaluation parameters except Recall and FNRate. The percentage
shortfall in Recall is as low as 0.93%. FNRate seems to have increased by 50% but the low value
of 0.0273 for the FNRate indicates that the increase is not alarming. For the rest of the parameters,
the FIS model has increased performance. The observed performance shortfalls are reasonably low
given the approximate nature of the inputs used for the FIS prediction model. Therefore the FIS
based prediction can be used to identify the possible defect prone modules at an early stage of the
SDLC. In order to get the exact prediction, the conventional prediction models can be used later in
the SDLC when precise values are available.
The FIS model has LOC as input metric for jEdit data. LOC is not a design metric and has
dependence on programming language (Pressman, 2010). Function Points (FP) is an alternative
to LOC as an estimate of software size. FP overcomes the limitation of LOC for being language
dependant. Although the datasets do not provide FP value for each software modules, studies have
suggested the relationship between the two (Pressman, 2010). This relationship can be used to
123
translate the LOC based values in the datasets into FP based values if required.
A critique on the proposed approach could be that it requires magic to get the correct linguistic
labels for the software metrics earlier in the lifecyle. This critique is valid. However we expect
that the managers making estimates have appropriate experience to provide good linguistic labels
to the input parameters. FP based estimations face the similar critique that calculations of FP
require freight of hand (Pressman, 2010). However, this does not stop the practitioners to make FP
based estimates.
5.5 Summary
The conventional quality prediction models require exact values of their input parameters. This
Chapter introduces a novel concept of getting approximate prediction of defect proneness when
exact input values are not available, especially during early SDLC phases. These early predictions
are needed for better resource planning, cost reduction and test planning. The Chapter suggests the
use of fuzzy inputs to overcome the need for exact and precise input metric values. The prediction
model introduced in this Chapter is fuzzy inference system based and requires imprecise estimates
of input software metrics. This imprecision is introduced by defining inputs as fuzzy linguistic
variables. The linguistic labels can be obtained from a domain expert. The Chapter identifies
that predictions made through vague inputs are reasonably close to predictions obtained through
the application of conventional models based on exact and precise inputs. In light of the analysis
conducted in this Chapter, one can easily predict the defect proneness at an early stage without
requiring exact measurements of the metrics.
This study needs to be extended for the validation using more datasets. Another study can be
conducted to provide a systematic mechanism for software managers to suggest linguistic labels
for input metrics. We also plan to investigate if the rules extraction can be improved to establish a
causal relationship between the software metrics and the defect proneness of software modules.
124
6. RESOLVING ISSUES IN SOFTWARE DEFECT DATASETS
Measurements of different characteristics of software such as size, complexity and relationship be-
tween its components are collected in the form of software metrics (Baker et al., 1990). Software
metrics have been effectively used in various Software Quality Prediction (SQP) studies (Ganesan
et al., 2000, Shen et al., 1985, Thwin and Quah, 2002). Mostly two kinds of metrics have been used
to build SQP models; Software Product Metrics (SPdMs) and Software Process Metrics (SPcM).
Product metrics correspond to measurements of attributes associated with software itself, for ex-
ample number of errors in the software and number of lines of code. Process metrics correspond
to measurements of attributes of the processes which are carried out during the lifecycle of the
software such as effectiveness of development and performance of testing. This chapter unifies
names of software product metrics. Further the chapter also demonstrates that software metrics
have been misused in software defect prediction studies.
A number of SQP models based on product metrics have been reported in literature for differ-
ent stages of software development lifecyle (SDLC) and different software development paradigms
(SDPs) (Bouktif et al., 2006, Gokhale and Lyu, 1997, Khosgoftaar and Munson, 1990, Munson and
Khosgoftaar, 1992, Quah and Thwin, 2003, Wang et al., 2007). Selection of a prediction model is
generally based on a number of parameters such as software metrics available, phase of software
development lifecycle, software development paradigm, domain of the software (Jones, 2008),
quality attribute to be predicted, product based and value based views of the model (Rana et al.,
2008, Wagner and Deissenboeck, 2007) and so on. Selecting a model on the basis of so many
parameters poses a problem to the organizations and results in subjective selection of a prediction
model. In order to reduce this subjectivity, a generic approach is needed which can help in objec-
tively selecting a model. Though some attempts have been made to develop such approaches to
predict software quality (Bouktif et al., 2006, Rana et al., 2008), but there are limitations due to
different interpretations, nomenclature and representation of model input parameters. For example,
numerous product metrics like program Volume (V) and total Lines of Code (LOC), have been used
with different names by different researchers which has generated inconsistency (Dick and Kandel,
2003, Jensen and Vairavan, 1985, Khosgoftaar and Munson, 1990, Koru and Liu, 2005a, Li et al.,
2006, Ottenstein, 1979). The generic approach presented in (Rana et al., 2008) requires such incon-
sistencies to be removed. Figure 6.1 shows different components of the generic approach. These
components are Input Selection, Model Selection, Model Development and Reporting. Input Se-
lection activity identifies the relevant dataset from a given repository. Model Selection activity
uses the relevant dataset to compare performances of the models and select an appropriate defect
prediction model for the relevant dataset. Model Development activity selects important metrics in
the dataset and develops a prediction model using the selected metrics. Reporting activity presents
the prediction results to the user.
This chapter identifies two types of inconsistencies in naming of software product metrics and
presents a resolution framework. The suggested framework resolves these inconsistencies on the
basis of definition of the metric and chronological usage of the metric name. A Unified Metrics
Database (UMD) is also introduced as part of Input Selection activity of Figure 6.1. Further details
on the role of the UMD can be found in (Rana et al., 2011).
This chapter also studies the role of Software Science Metrics (SSM) (Halstead, 1977) in defect
prediction. The chapter shows with the help of experiments that the use of SSM do not significantly
Generic Approach for Quality Prediction
Input Selection
Similar Dataset Selection
Datasets
Repository ( R )
UMD
Model Selection
Model
DevelopmentReporting
Models
Performance Data
Prediction
Results
Measures’
Labels
Appropriate
Model
Software
MeasuresUsed, Quality
Objectives
Relevant Dataset,
Quality Objectives
Report
Fig. 6.1: Role of UMD in the Generic Approach for Software Quality Prediction
126
contribute in:
1. classifying OO software modules as defect prone and not defect prone (binary classification)
2. predicting number of defects in OO software (numeric classification).
6.1 Issues related to Software Defect Data
6.1.1 Inconsistencies in Software Product Metrics Nomenclature
Over the years various researchers have worked in software metrics and proposed various suites of
product metrics that directly or indirectly measure the difficulty or complexity in comprehending
and developing a software (Belady, 1980, Chidamber and Kemerer, 1994, Halstead, 1977, Henry
et al., 1981, McCabe, 1976). Later on different researchers have used these suites and sometimes
proposed similar metrics with different nomenclature (Bouktif et al., 2004, Gyimothy et al., 2005,
Khoshgoftaar et al., 1997b, Khoshgoftaar and Seliya, 2002, Olague et al., 2007, Schneider, 1981,
Shen et al., 1985, Zhou and Leung, 2006). The lack of coordination among the researchers and
slower dissemination of information in the past has resulted in many inconsistencies in taxonomy
of the Software Product Metrics (SPdM). By studying the literature, we have identified and tried
to resolve two types of inconsistencies in the product metrics nomenclature. These are:
• Type I Inconsistency: This type of inconsistency arises in the situations where same metric
has been used with more than one names (or labels) as shown in Figure 6.2a. An example
of Type I inconsistency is Jensen’s estimate of program length which has been referred as
NF (Jensen and Vairavan, 1985) and JE (Guo and Lyu, 2000). This can also be termed as
Different Labels for Same Metric (DLSM) inconsistency.
• Type II Inconsistency: This type of inconsistency arises when same label has been used for
more than one metric as shown in Figure 6.2b. For example B has been used as the Halstead
error estimate (Jiang et al., 2008c, Khosgoftaar and Munson, 1990, Ottenstein, 1979, 1981)
and B has also been used as average level nesting of control graph of the program (Jensen
127
Mj
L1
Lp
Lq
Lp
SPdMs Alternate
Labels
Preserved
Labels
(a) Type I Inconsistency (DLSM)
Mk
Lt Lt
LvMl
Lu
Lv
Lt
SPdMs Alternate
Labels
Preserved
Labels
(b) Type II Inconsistency (SLDM)
Fig. 6.2: SPdM Type I and Type II Inconsistencies
and Vairavan, 1985). This can also be called Same Label for Different Metrics (SLDM)
inconsistency.
These inconsistencies have been removed using the proposed framework and the resultant UMD
has been put in place. This UMD is part of the generic approach suggested in (Rana et al., 2008).
Before presenting the framework to resolve these inconsistencies we present some examples of
Type I and Type II inconsistencies in most frequently used metrics.
128
Tab. 6.1: Examples of metrics with Type I inconsistency
Metric Definition Labels Used Used by
Total lines of code (including comments) LOC (Brun and Ernst,
2004), (Dick and
Kandel, 2003),
(Gyimothy et al.,
2005), (Khosgof-
taar et al., 1994),
(Khosgoftaar and
Munson, 1990),
(Khoshgof-
taar and Allen,
1999c), (Khosh-
goftaar and
Allen, 1999b),
(Khoshgoftaar
and Seliya,
2003), (Munson
and Khosgoftaar,
1992), (Pizzi
et al., 2002),
(Xing et al.,
2005)
SLOC (Briand et al.,
1993)
Continued on next page
129
Table 6.1 – continued from previous page
Metric Definition Labels Used Used by
TC (Gokhale and
Lyu, 1997), (Guo
and Lyu, 2000)
Lines of source code (Dick and Kan-
del, 2003)
Lines of code (Li et al., 2006)
LOC TOTAL (Jiang et al.,
2008c)
McCabe’s Cyclomatic complexity V(G) (Khosgoftaar and
Munson, 1990),
(Khoshgof-
taar and Allen,
1999c), (Munson
and Khosgoftaar,
1992), (Xing
et al., 2005)
MC (Jensen and
Vairavan, 1985)
VG (Briand et al.,
1993), (Khosh-
goftaar et al.,
1996), (Khosh-
goftaar and
Seliya, 2002)
Continued on next page
130
Table 6.1 – continued from previous page
Metric Definition Labels Used Used by
VG1 (Dick and Kan-
del, 2003),
(Khosgoftaar
et al., 1994)
McC1 (Ohlsson and Al-
berg, 1996)
M (Gokhale and
Lyu, 1997), (Guo
and Lyu, 2000)
CMT (Dick and Kan-
del, 2003)
Strict Cyclomatic Complexity (Li et al., 2006)
Complexity (Nagappan et al.,
2006)
CYCLOMATIC COMPLEXITY, v(G) (Jiang et al.,
2008c)
Total executable statements EX (Ottenstein,
1979)
(all lines of code excluding EXE (Khosgoftaar and
Munson, 1990)
comments, declarations and blanks) ELOC (Khosgoftaar
et al., 1994),
(Khoshgoftaar
and Allen, 1999c)
Continued on next page
131
Table 6.1 – continued from previous page
Metric Definition Labels Used Used by
STMEXE (Khoshgoftaar
and Allen,
1999b), (Khosh-
goftaar and
Seliya, 2003)
CL (Guo and Lyu,
2000)
Executable lines (Dick and Kan-
del, 2003)
Size1 (Quah and
Thwin, 2003)
Lines (Nagappan et al.,
2006)
LOC EXECUTABLE (Jiang et al.,
2008c)
Metrics with Type I Inconsistency
Type I inconsistency appears frequently in literature. From a variety of software metrics collected
in UMD, we show inconsistencies in top 3 frequently used metrics in Table 6.1.
Lines of Code (LOC) is primarily used to measures product size but it has been shown to have
a strong relationship with number of errors as well (Khosgoftaar and Munson, 1990). That is
why it has been widely used in studies related to error prediction as shown in Table 6.1. Because
definition of cyclomatic complexity (McCabe, 1976) is based on the fact that complexity of a soft-
ware depends on the decision structure of a program, it is considered to be highly correlated with
problems in the program, hence compelling the researchers to use this metric for error prediction
(Khosgoftaar and Munson, 1990, Munson and Khosgoftaar, 1992). Number of executable lines
132
of code (EX) has been used as predictor of number of errors because of its strong correlation
with cyclomatic complexity, Halstead’s program volume and number of errors (Khosgoftaar and
Munson, 1990, Ottenstein, 1979).
There are a number of other metrics that have Type I inconsistency. In order to match two
datasets, atleast their metric labels should be consistent. If a single label is not assigned to each of
the metrics, finding the similarity between datasets becomes a difficult task. Consequently, making
development of the generic approach difficult.
Metrics with Type II Inconsistency
Type II inconsistency appears when two different studies assign same label to different product
metrics as depicted in Figure 6.2b. A few examples of Type II inconsistency are given in Table
6.2. B has been used with three different meanings: bandwidth of a program (Jensen and Vairavan,
1985), number of branches in the program (Khosgoftaar and Munson, 1990) and Halstead’s error
estimate (Ottenstein, 1979). Similarly CL has been used to represent total code lines by some
studies and total executable statements by others. But CL has been associated with the definition
total code lines (Munson and Khosgoftaar, 1992) prior to its association with the other definition
by Guo et al (Guo and Lyu, 2000) i.e total executable statements. Another example of Type II
inconsistency is label D that represents Halstead’s difficulty (Shen et al., 1985) as well as number
of decisions (Khosgoftaar and Munson, 1990). FANOut also has more than one definitions:
number of calls of a subroutine (Li et al., 2006), and number of objects in the calls made by a
function (Ohlsson and Alberg, 1996).
Removal of this type of inconsistency is pivotal, because this can cause misinterpretations of
different metrics and may lead to invalid results while finding the data similarity mentioned above.
The framework presented in this Chapter suggests rules to remove this inconsistency as well.
6.1.2 Ineffective use of Software Science Metrics
Software Science Metrics (SSM) proposed by Halstead (Halstead, 1977), are based on number
of operators, operands and their usage for procedural paradigm. These metrics are indicators of
133
Tab. 6.2: Examples of metrics with Type II inconsistency
Label Definition Used Used by
B Bandwidth of a program (Jensen and Vairavan, 1985)
Count of branches (Khosgoftaar and Munson, 1990)
The Halstead’s error estimate (Ottenstein, 1979, 1981)
CL Total code lines (Gokhale and Lyu, 1997, Munson and Khosgoftaar,
Total executable statements (Guo and Lyu, 2000)
D Halstead’s difficulty (Jiang et al., 2008c, Li et al., 2006, Shen
Number of decisions (Khosgoftaar and Munson, 1990)
FANOut Number of calls of a subroutine (Li et al., 2006, Nagappan et al., 2006)
Number of objects in the calls made by a certain function (Ohlsson and Alberg, 1996)
software size and complexity (for example program length N and effort E measure size and com-
plexity respectively). Earlier studies have found a correlation of software size and complexity
with number of defects (Ottenstein, 1979, Khosgoftaar and Munson, 1990) and have used size and
complexity metrics as predictors of defects. Studies have used SSM for defect prediction and clas-
sification of defect prone software modules as well (Ottenstein, 1979, Jensen and Vairavan, 1985,
Munson and Khosgoftaar, 1992, Briand et al., 1993, Khosgoftaar et al., 1994, Gokhale and Lyu,
1997, Khoshgoftaar and Allen, 1999c, Khoshgoftaar and Seliya, 2003, 2004, Xing et al., 2005,
Koru and Liu, 2005a, Li et al., 2006, Seliya and Khoshgoftaar, 2007). Fenton et al. (Fenton and
Neil, 1999) have criticized the use of SSM and other size and complexity metrics in defect predic-
tion models because 1) neither the relationship between complexity and defects is entirely causal 2)
nor are the defects a function of size. Majority of the prediction models take these two assumptions
(Fenton and Neil, 1999). Despite the critique various studies have used SSM to study software de-
veloped in procedural paradigm (Khoshgoftaar and Allen, 1999c, Khoshgoftaar and Seliya, 2003,
Xing et al., 2005) as well as object oriented paradigm (Koru and Liu, 2005a, Challagulla et al.,
2005, Seliya and Khoshgoftaar, 2007).
With the shift of paradigm from procedural to object oriented (OO), metrics such as unique
134
Software Product Measures with Type I and Type II
Inconsistencies
Unification Rules
Unified Measures
Categorization Rules
Categorization Parameters: SDP, SDLC
Phase, Frequency Categorized Measures
(a) High Level Design
SDLC
SDP
FrequencyOccasionally
Used
Frequently
Used
Conventional
Object Oriented
Design
Implementation
Testing
Deployment
Maintenance
(b) Dimensions of Categorization
Fig. 6.3: SPdM Unification and Categorization Framework
operands η2, total operand occurrences N2, program vocabulary n and program volume V do not
remain effective indicators of complexity of the software. This is because of the nature of OO
paradigm where software consists of many classes and each class has its own operands (attributes).
Having 10 to 15 classes in the software each with 5 to 10 attributes might not make the software as
complex as indicated by these operator and operand based metrics. The complexity in case of OO
software will depend on interaction between the objects of the classes and complexity of methods
of the classes. So using SSM as predictors of defects in OO software might not be a wise decision.
The results presented here include the application of different classification models on dataset
kc1 with class level data (Menzies et al., 2012). Analysis on the impact of removing SSM from
the set of independent variables of the classification models is also presented. The experimental
results show that removing SSM from the set of independent variables does not significantly affect
the binary and numeric classification of OO software modules. As compared to the case when
all the collected metrics are used for both the classifications, the number of incorrectly classified
instances and the mean absolute error have improved in absence of SSM for binary and numeric
classification respectively.
135
6.2 Proposed Approaches to Handle the Issues related to Software Defect Data
6.2.1 Metric Unification and Categorization (UnC) Framework
In the previous section it has been highlighted that Type I and Type II inconsistencies exist in
software product metrics nomenclature. In this section we present a Unification and Categorization
(UnC) framework to remove the said inconsistencies as shown in Figure 6.3a. After the unification,
the product metrics are categorized based on the following three dimensions as shown in Figure
6.3b:
• Usage of a software metric in software development paradigms (SDP)
• Availability of a software metric in different phases of software lifecycle (SDLC)
• Frequency of usage of a software metric in studies related to SQP
We have first removed the inconsistencies in metrics nomenclature and then categorized the met-
rics with respect to software development paradigm and software lifecycle phase. Afterwards, on
the basis of frequency of usage, the metrics are categorized in two groups, frequently used and
occasionally used. Then the unified metric is stored in the Unified Metrics Database (UMD). In
the rest of the Section, unification rules are presented followed by the categorization method of the
unified metrics.
Unification
Once all the labels used for a product metric are collected from the Inconsistent Metrics Database
(IMD) shown in the Figure 6.4a, Rule 1 and Rule 2 are serially applied on the SPdM.
Rule 1: Rule 1 shown in Figure 6.4b resolves the Type I inconsistency by associating the most
frequently used label to the metric under consideration. Frequencies of each label are computed
and there is no clear winner label which can be associated with the software product metric if:
|fl1 − fl2| < δ (6.1)
where fl1 and fl2 are frequencies of the top two frequently used labels and δ, a positive integer, is
a threshold value provided by the user. In such a case the label used earliest for the product metric
136
SPdM Categorized w.r.t Paradigm
OO, Conventional
Paradigm Measure
Used in OO
Paradigm?
YesNo
Conventional
Paradigm Measure
Mntc
Available in
SDLC Phase?
Design Maintenance
Implementation Deployment
Te
stin
g
Design + Impl + Test + Dep +
Add the Unified and Categorized SPdM in UMD
UMD
f >
SPdM Categorized w.r.t SDLC Phase
Frequently Used
Measure
Yes
Occasionally
Used Measure
No
BA
Software Product
Measure (SPdM)
Conflict with other
measure label?
Yes
No
Type I
Type II
Rule 2
Rule 1
Alternate Labels
Pick all labels used for
the SPdM
IMD
Unified SPdM
Calculate f, the
frequency of usage
Conventional
Paradigm
Measure
Used in
Conventional
Paradigm?
YesNo
OO
Paradigm
Measure
A B
w
(a) An SPdM Through the Framework
nal
ure
ntc
nce
e
MFU Label
Associated
with the SPdM
Type I Inconsistent
SPdM
Find Most Frequently
Used (MFU) Label
Clear winner
found using ?
Yes No
Find the Earliest
Used Label
Earliest used label
associated with the
SPdM
Type I Consistent SPdM
(b) Rule 1
t
l
eMFU Label
Associated with
the SPdM
Type II
Inconsistent
SPdM with
Conflicting Label
For each conflicting measure find
earliest use of the conflicting label
Label used for
the SPdM?
YesNo
Find the MFU
Label
Conflicting
SPdMs with the
Conflicting
Label
Remove the SPdM from IMD
Remove the label
from the
alternates of the
conflicting SPdM
Type II Consistent SPdM
(c) Rule 2
Fig. 6.4: SPdM Unification and Categorization Framework: Detailed Design
is considered as the preserved label.
Rule 2: Rule 2 shown in Figure 6.4c is intended to resolve Type II inconsistency in metrics nomen-
clature. If the earliest use of the label preserved by Rule 1 was for the software metric under con-
sideration, then Rule 2 does not modify the decision taken by Rule 1. In addition the Rule 2 ensures
that the conflicting label is removed from the alternate labels of all the conflicting metrics in IMD,
so that this label cannot be preserved for any other metric in future. Otherwise, if the earliest use
of the label is for another metric, the decision by Rule 1 is altered and the most frequently used
label is preserved. The metric is then removed from the IMD so that this might not again conflict
with the rest of the product metrics when their unification process takes place.
137
Categorization
The categorization of Software Product Metrics (SPdMs) has been done in three dimensions shown
in Figure 6.3b. While categorizing the SPdMs with respect to each of the dimensions, we only
consider use of the metric in that dimension instead of considering the effective use of the metric
for that dimension. Effective use or statistical importance of a metric may differ for different
scenarios and datasets. Based on the existing literature, it is very difficult to objectively state the
statistical importance of software metrics. Different stages of the unification and categorization
process are shown in Figure 6.4a. Each of the diamond in the figure represents the start of a
categorization stage. These stages are discussed below.
Software Development Paradigm (SDP)
Some metrics are available in conventional as well as object oriented paradigm. Use of a soft-
ware metric in more than one paradigm shows the capability of the metric to capture haracteristics
of different kinds of software. This capability does not guarantee that the use of the metric will
always be effective. Purao et al. (Purao and Vaishnavi, 2003) have termed the use of a conven-
tional metric in object oriented systems as a misuse of conventional metric and as a researcher’s
bias. For example η has been used to study the characteristics of software developed using con-
ventional software engineering (Ottenstein, 1979) as well as characteristics of software developed
in OO paradigm (Koru and Liu, 2005a). But using it as one of the indicators of software quality
in OO paradigm does not give improved results (Koru and Liu, 2005a). Metrics specific to OO
paradigm such as relationship between objects (modules) and coupling between object classes are
better indicators of software characteristics in this case. Similarly there are certain metrics, like
B (Belady, 1980), which have been used in studies related to software developed in conventional
paradigm only.
Software Development Lifecycle Phase
Availability of a software metric in different phases of software lifecycle indicates the measure-
ment of software characteristic in that phase. For example Weighted Methods per Class (WMC)
(Chidamber and Kemerer, 1994) can be available in design phase and it is an indicator of design
of the class. It can further indicate the increments made to the software i.e. if new and complex
138
functionality has been added to the software in the next iteration or not. Availability of a metric in a
particular phase further helps software quality prediction by making it clear that predictions based
on this metric can be made in this particular phase of the lifecycle. Therefore, we have categorized
the product metrics based on the possibility of their availability in software lifecycle phases. The
possible categories for the product metrics are Design+, Impl+, Test+, Dep+ and Mntc, which
represent the design, implementation, testing, deployment and maintenance phases respectively. A
‘+’ sign indicates that the metric is available in the corresponding and the following phases.
Frequency of Usage
Metrics cited in this Chapter have been divided into two broader categories based on their
frequency f of usage: frequently used and occasionally used. We define frequently used metrics as
the metrics which have been used by more than α studies on SQP i.e. f > α where α is a threshold
to classify a metric as frequently used (FU). If f < α the metric is called occasionally used (OU)
metric. α is a positive integer provided by the user.
6.2.2 Proposed Approach to Show Ineffective use of SSM
The role of SSM in defect prediction in OO software has been studied using dataset kc1 (Menzies
et al., 2012), which consists of class level data of a NASA project (Facility). The dataset has 145
instances and each instance has 94 attributes, which are metrics collected for that software instance.
These attributes include object oriented metrics (Chidamber and Kemerer, 1994), metrics derived
from cyclomatic complexity such as sumCY CLOMATIC
COMPLEXITY and metrics derived from SSM such as minNUM OPERANDS, avgNUM
OPERANDS. A few other size metrics like LOC are also part of the 94 attributes. Total 48
metrics were derived from SSM and we applied the models listed in Table 6.3 first using all of the
94 attributes as input to the models and then applied the same models for the 46 metrics which are
not derived from SSM.
The data is available in two structurally different formats. One format allows binary classi-
fication and the other allows numeric classification. We performed binary classification (BC) of
modules, i.e. defect prone or not defect prone, as well as numeric classification (NC), i.e. number
139
Tab. 6.3: List of classification models used from WEKA(Witten et al., 2008).
BC NC
Model Name Abbr. Model Name Abbr.
Bayesian Bay Additive Regression AR
Decision Table DTb Decision Tree DTr
Intance Based IB Linear Regression LR
Logistic Log Support Vector Reg. SVR
of defects in the modules using various classification models available in WEKA (Witten et al.,
2008) and listed in Table 6.3. The classification is done using:
1. all the metrics present in the dataset.
2. all the metrics except the SSM based metrics.
Because of the structural nature of the data, we applied different models for BC and NC and
recorded different performance measures. Similarly the impact of removing SSM from the set
of inputs is studied using different effectiveness measures for both kinds of classifications. We’ll
first discus measures related to BC and then the measures related to NC. Accuracy is used as
model performance measures for BC. Accuracy (Acc) is based on number of Correctly Classified
Instances (CCI), number of Incorrectly Classified Instances (ICI) and is defined as follows:
Acc =CCI
CCI + ICI(6.2)
Effectiveness Effi is defined to study the impact of removing SSM from the set of inputs to the
ith binary classification model. Effi is given by the following equation:
Effi = Acci,All − Acci,NotSSM (6.3)
where Acci,SSM is the accuracy of model i using all metrics and Acci,NotSSM is the accuracy of
model i using all metrics except SSM. Use of SSM is considered effective by model i if Effi is
140
above a threshold α = 0.01. Which means that use of SSM is considered effective if accuracy
of the model i does not decrease more than two decimal points if the SSM are removed from the
set of inputs to model i. If Effi is a negative value, this means that the accuracy of model i has
improved on removing SSM from the set of inputs. In order to measure overall effectiveness of
SSM, Effavg is used which is average of all the Effis. Use of SSM will be considered as effective
only if Effavg is a positive number and is greater than λ = 0.005. On the other hand, SSM cannot
be considered ineffective if Effavg does not fall below −λ.
Performance measures recorded for NC models are: Mean Absolute Error (MAE), Root Mean
Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Square Error (RRSE) and
are defined by Equations 6.4, 6.5, 6.6 and 6.7 respectively:
MAE =1
n
n∑
i=1
|Pi − Ai| (6.4)
where n is total number of instances, Pi is predicted number of errors in ith instance, Ai is the
observed value of number of errors in ith instance.
RMSE =
√
√
√
√
1
n
n∑
i=1
(Pi − µ)2 (6.5)
where µ is the mean of actual values of number of errors.
RAE =
∑n
i=1 |Pi − Ai|∑n
i=1 |Ai − µ| (6.6)
RRSE =
√
∑n
i=1(Pi − Ai)2∑n
i=1(Ai − µ)2(6.7)
To study the impact of removing SSM from the set of inputs to the numeric classification model i,
we have defined a measure Erri based on MAE of model i as follows:
Erri = MAEi,NotSSM −MAEi,All (6.8)
where MAEi,NotSSM is the MAE of model i using all metrics except SSM and MAEi,All is the
MAE of model i using all metrics. Erri should be greater than δ = 0.1 in order to consider SSM as
an effective predictor of number of defects using model i. In order to check overall effectiveness
141
B Categorized w.r.t Paradigm
B
Used in OO
Paradigm?
Available in
SDLC Phase?
Implementation
Impl +
Add B alongwith its definition in UMD
UMD
f > 2
B Categorized w.r.t SDLC Phase
Frequently Used
Measure
Yes
B
The Halstead Error
Estimate
Conflict with other
measure label?
Yes
No
Type I
Type II
Rule 2
Rule 1
Alternate Labels:
B, HALSTEAD_
ERROR_EST, B
Pick all labels used for
the SPdM
IMD
B Unified
Calculate f, the
frequency of usage
Conventional
Paradigm
Metric
B Used in
Conventional
Paradigm?
Yes
B
f=4
OO, Conventional
Paradigm Measure
Yes
(a) B Passing Through the Framework
The Halstead Error
Estimate.
Alternate Labels:
B, HALSTEAD_
ERROR_EST, B
Find Most Frequently
Used (MFU) Label
> 2 ?
No
Find the Earliest
Used Label
B associated with
the SPdM
B
(b) Rule 1
t
Bandwidth of the
Program. B
For each conflicting measure find
earliest use of the conflicting label
Label used for
B?
Yes
The Halstead
Error Estimate. B
Remove B from IMD
Remove B from
the alternates of
Bandwidth of the
Program
Bandwidth of the
Program, 1985
Halstead Error
Estimate, 1979
Unified B
(c) Rule 2
Fig. 6.5: Unification and Categorization of B
of SSM in case of numeric classification, average error Erravg is defined. SSM are considered
effective if average of all Erri is a positive quantity greater than ǫ = 0.05. SSM cannot be
considered ineffective if Erravg does not fall below −ǫ.
6.3 Results
6.3.1 Application of the UnC Framework
The suggested framework has been applied to 140 software product metrics. This section shows
unification and categorization process carried out for two conflicting product metrics. The same
process is used for unification and categorization of the rest of the product metrics and a database
of unified metrics has been developed.
142
Halstead error estimate B (Halstead, 1977) and Bandwidth of the program BW (Belady, 1980)
are two conflicting product metrics. The suggested framework is applied to B as shown in Figure
6.5a. Multiple labels can be noticed for this metric indicating a Type I inconsistency. Hence Rule 1,
as shown in Figure 6.5b, is applied here. In order to apply Rule 1, the value of δ is set equal to 2, and
the frequency fl of each label is calculated and compared. It has been noticed that no clear winner
is identified using expression 6.1. Therefore the earliest used label B is preserved. Preserving
the label B for the Halstead error estimate generates a Type II inconsistency with another product
metric i.e. bandwidth of the program. Both the metrics are labeled as B. Therefore Rule 2 is
now applied to remove Type II inconsistency related to B. Figure 6.5c shows the application
of Rule 2, where it can be noticed that usage of B for the error estimate was done in 1979 by
Ottenstein (Ottenstein, 1979) prior to its use as bandwidth of the program in 1985 by Jensen et al
(Jensen and Vairavan, 1985). Hence preserving the label B for the Halstead error estimate. As a
consequence the label B is removed from the candidate alternate labels of the bandwidth of the
program. Unification process for Halstead error estimate B is now complete.
The next step after the unification is to categorize the metric B. A metric is categorized with
respect to three dimensions: software development paradigm, software lifecycle phase and fre-
quency of use. In the context of development paradigm, this metric has been used in conventional
paradigm (Ottenstein, 1979) as well as in OO paradigm (Jiang et al., 2008c), therefore is catego-
rized as a metric for both OO and conventional paradigms. As far as the software lifecycle phase
is concerned, B is not available in design phase as it is based on code metrics. Therefore, B is
categorized as Imp+ metric as per its usage. Further, B is seen to be a frequently used metric be-
cause the frequency f of B is greater than α which is set equal to 2. It is, therefore, categorized as
frequently used product metric. The unification and categorization process for the Halstead error
estimate ends when the metric is saved in the UMD with B as its preserved label.
The bandwidth of the program BW now is left with other candidate labels such as Band
and NDAV . But BW is the most frequently used label for this metric and hence BW is now
Type I and Type II consistent. The categorization process for this metric groups it as conventional
paradigm, Imp+ and frequently used metric.
143
Tab. 6.4: Results of numeric classification with and without SSM (Halstead, 1977).
Model Input Metrics CCI ICI Acc
Bay All 109 36 0.751
Without SSM 108 37 0.744
Log All 99 46 0.683
Without SSM 104 41 0.717
DTb All 102 43 0.703
Without SSM 104 41 0.717
IB All 102 43 0.703
Without SSM 108 37 0.744
Each metric passes through this UnC process when it is added in the UMD using the interface
discussed in (Rana et al., 2011).
6.3.2 Ineffective use of SSM
Table 6.4 shows results of binary classification of software modules. Use of SSM alongwith other
available metrics to classify defect prone modules does not help in case of all the models except
Bayesian classifier. Rather dropping SSM as predictors, improves Correctly Classified Instances
(CCI) and model accuracy for the dataset under study. Alternatively, Incorrectly Classified In-
stances (ICI) have decreased for all these models on dropping SSM from the input set, which is
a better performance as compared to the case when classification was done using all metrics in-
cluding SSM. When all SSM were removed from the input of Bayesian classifier, which has the
highest accuracy among all four models, number of ICI increased by 1 and accuracy of the model
decreased by a factor of 0.7%. Intance based learning with 1 nearest neighbor (IB) has shown the
highest gain in accuracy, which is by the factor of 4%, when SSM were not a part of input to the
classifier.
Results of numeric classification of modules are presented in Table 6.5 where MAE of all
144
Tab. 6.5: Results of numeric classification with and without SSM (Halstead, 1977).
Model Input MAE RMSE RAE RRSE
AR All 4.58 10.32 75.89% 94.59%
Without SSM 4.37 9.78 72.44% 89.61%
DTr All 4.92 11.23 81.51% 102.84%
Without SSM 4.89 10.77 81.05% 98.71%
LR All 6.59 11.17 109.08% 102.35%
Without SSM 5.60 9.42 92.77% 86.25%
SVR All 4.39 7.42 72.64% 67.96%
Without SSM 4.66 8.96 77.16% 82.13%
the models decreased in the absence of SSM in input to the classifiers except for the case of
support vector regression. SVR had the lowest MAE among all the NC models in presence of
SSM and increase in MAE SVR is an interesting observation. Linear regression (LR) has observed
a significant decrease of 0.99 in MAE in absence of SSM from the set of inputs metrics. Other
three performance measures for NC, showed the same pattern as does MAE, i.e. for all the models
except SVR, values of RMSE, RAE and RRSE decreased in absence of SSM.
6.4 Analysis and Discussion
This section presents an analysis on the application of the Unification and Categorization (UnC)
framework on 140 Software Product Metrics (SPdMs). Analysis on performance of prediction
models in absence of SSM is also presented here.
6.4.1 UnC Framework Resolves Nomenclature Issues
In order to present the results, we first report the categorization of SPdMs in two classes: frequently
used (FU) and occasionally used (OU). The percentages of FU and OU software product metrics
145
SDLC
SDP
FrequencyOccasionally
Used
(65.72%)
Frequently
Used
(34.28%)
Conventional
(31.43%)
Object
Oriented
(30.71%)
Design
(52.86%)
Implementation
(35%)
Testing
(4.28%)
Deployment
(0.71%)
Maintenance
(7.14%)
Overlap
(37.86%)
44.44%
5.55%
25%
75%
50%
30%
5%
10%
20%
35%
60.60%
39.39%
5.55%
11.11%
13.88%
69.44%
14.28%
85.71%
Type II
(5.71%)
Type I
(42.85%)
13.04% 100%
16.67 %
Inconsistencies
Occasionally
Used
Frequently
Used
39.13% 16.67%
21.74% 68.75%
39.13% 14.58%
Fig. 6.6: SPdM Categorization
are 34.29% and 65.71% respectively. Frequency of usage of a metric does not necessarily indicate
its importance. It only shows that a certain metric has been used for how many times.
The results of application of the UnC framework are presented under the Frequently Used and
Occasionally Used categorization. The combined results for Frequently and Occasionally Used
metrics are reported as overall metrics. The unification process revealed that 100% of the fre-
quently used metrics have been Type I inconsistent whereas 16.67% have been Type II inconsistent
as shown in Figure 6.6. In occasionally used metrics 13.04% have been Type I inconsistent. There
was no Type II inconsistency among the occasionally used metrics. Overall, 42.85% of the product
metrics under study have been Type I inconsistent and 5.71% have been Type II inconsistent. To
remove these inconsistencies 75% of the frequently used metrics have been assigned their earli-
est used labels whereas 25% of the metrics preserve their frequently used label. In occasionally
used category, the earliest used label has been preserved for 100% of the product metrics. Overall,
146
Tab. 6.6: Percentage of Preserved Labels.
Earliest Used Label Frequently Used Label
FU Metrics 75% 25%
OU Metrics 100% 0%
Overall 91.42% 8.57%
Tab. 6.7: Distribution of Software Product Metrics in Software Development Paradigm with Overlap.
OO with Overlap Conv. with Overlap
FU Metrics 83.33% 85.41%
OU Metrics 60.87% 60.87%
Overall 68.57% 69.28%
91.43% product metrics have been assigned earliest used label and 8.57% have been given the
frequently used label. Table 6.6 presents these statistics.
During the categorization of the product metrics with respect to development paradigm, some
metrics have been found to be used in both the conventional and OO paradigms. This overlap
has been 68.75% and 21.74% for frequently used and occasionally used metrics respectively as
shown in Figure 6.6. The overlap in the FU metrics has been considerably high. This indicates that
most of the FU metrics are considered important for the quality assessment of both the paradigms.
Figure 6.6 reports the overlap for both the categories of the product metrics to be 37.86%.
In addition to the use of some product metrics in both paradigms, other product metrics have
been exclusively used for conventional (e.g. BW , S) or OO paradigm (DIT , RFC). Distribution
of these metrics in both the paradigms is presented in Figure 6.6. In frequently used metrics,
16.67 % metrics have been exclusively used for conventional paradigm and 14.58% have been
exclusively used for the OO paradigm. Whereas the percentage of exclusively conventional and
OO metrics in occasionally used metrics has been 39.13% each. Figure 6.6 also shows that overall
31.43% and 30.71% metrics have been categorized as exclusively conventional and OO paradigm
metrics respectively.
147
Tab. 6.8: Categorization with respect to Software Lifecycle Phase.
Design Implementation Test Deployment Maintenance
FU Metrics 54.17% 45.83% 0% 0% 0%
OU Metrics 52.17% 29.35% 6.52% 1.08% 10.86%
Overall percentage of the product metrics in both the paradigms can be calculated using the
above statistics. In FU metrics category, 85.42% have been used in conventional paradigm and
83.33% have been used in OO paradigm as presented in Table 6.7. In occasionally used category
60.87% have been used in the conventional paradigm whereas 60.87% product metrics have been
used in OO paradigm. Overall, 69.28% metrics have been used in conventional and 68.57% in OO
paradigm.
There are two main reasons for a relatively larger percentage of conventional paradigm metrics
in studies related to software quality prediction (SQP):
• the area of SQP has been of prime importance since the embryonic days of software engi-
neering
• data for the software developed conventionally is publicly available (Menzies et al., 2012,
Facility).
A significant percentage of OO metrics in SQP related studies indicates that the importance of
this area has not reduced in contemporary days where software are usually developed using OO
paradigm. However, there is very limited data (Menzies et al., 2012, Facility) available for OO
software. The percentage of OO metrics in SQP studies has significantly improved after the avail-
ability of public data (Menzies et al., 2012, Facility) and is expected to grow with the increase in
the amount of public OO data.
Majority of the product metrics used for SQP are design and implementation phase metrics as
seen in Table 6.8 and Figure 6.6. The percentage of design metrics in frequently used metrics is
54.17% whereas 45.83% of the frequently used metrics are available only after the code has been
written. In occasionally used product metrics, 52.17% can be derived after the completion of the
148
design and 29.34% of the metrics need the code of the program. The percentage of metrics from
other phases is very low. 6.52% of the occasionally used metrics are testing metrics, 1.08% are
deployment metrics and 10.86% are maintenance metrics. Overall 52.85% of the product metrics
are design metrics and 35% are implementation phase metrics. Percentages of use of testing and
deployment phase metrics are 4.28% and 0.71% respectively. The percentage of maintenance
phase metrics is 7.14%.
The high density of design and code metrics in the literature as predictors of software quality
highlights that early quality prediction is crucial and the studies try to predict quality as early
as design phase. Efforts to predict software quality using requirements metrics have been made
without promising results (Jiang et al., 2007). A significant percentage of design and code metrics
and a very low percentage of metrics from later phases of software lifecycle indicate that the
prediction information is desired no later than the start of testing phase. Among the later phase
metrics, a relatively strong percentage of maintenance metrics has been noticed. Maintenance
metrics are usually helpful in predicting quality of later iterations of the same software.
The software product metrics which are available in design phase can be used for early pre-
diction of quality even when historic data is not available. For example Yang et al. (Yang et al.,
2007) have used DIT (Chidamber and Kemerer, 1994) early in the lifecycle to predict reliability
and efficiency of software. Majority of the implementation phase metrics, including the software
science metrics (Halstead, 1977) have usually been used for both paradigms.
The unification and categorization (UnC) of the product metrics will help the development of
SQP models based on product metrics. This can help in identifying and deciding which metrics to
prefer. The UnC can further be helpful in future studies on software metrics. The future studies
can view a certain group of the product metrics (for example design metrics only, or OO metrics
only) which the studies are interested in.
6.4.2 Use of SSM Deteriorates Performance
As mentioned earlier accuracies of majority of the BC models have improved in absence of SSM
and all the performance measures of NC models have improved for majority of NC models as well.
149
This section discusses the extent of improvement in performance measures of BC and NC models.
First the effectiveness of SSM reported by each model is discussed on the basis of Effi and Erri,
and then overall effectiveness of SSM on the basis of Effavg and Erravg, which are combined
results of BC and NC models, is presented.
Effectiveness of SSM reported by each model and the average values of effectiveness measures
are shown in Table 6.9. First two columns of the table show that no model has reported significant
decrease in its accuracy on dropping SSM, i.e. no Effi is greater than α. Unlike other three
BC models, Effi of Baysian classifier is a positive number but since this does not exceed α, we
cannot take it as an indication of effectiveness of SSM. Effavg is less than λ as well hence we
cannot call that SSM have been effective in classifying software modules as defect prone or not
defect prone for the dataset under study. Effavg is a negative term smaller than −λ and prompts
us to believe that SSM have not only been ineffective for this dataset, but they negatively affect
the classification. Moreover, the decrease in number of Incorrectly Classified Instances (ICI) and
increase in number of Correctly Classified Instances (CCI) on dropping SSM further indicates that
SSM have a negative effect on classification of modules in kc1. In case of NC models Erri reported
by SVR is greater than δ, which means that SVR has reported the effectiveness of SSM for this
dataset. SVR is different from the rest of the NC models used in this study. All the used models
minimize the empirical classification error, SVM at the same time also maximize the geometric
margin between the classes. Dropping SSM have reduced the empirical error for all the models but
it has been helpful for SVR in maximizing the margin between the classes. Values reported by rest
of the NC models are less than δ. Erravg is a negative value below −ǫ indicating that using SSM
to predict number of defects in this data of OO software is not a wise decision.
The dataset used to study the behavior of classification models in absence of SSM comprises of
145 instances. Though the number of instances are enough to conduct an initial investigation, yet
the results presented here cannot be generalized. More software instances are needed to establish
that SSM are ineffective defect predictors in case of OO software.
150
Tab. 6.9: Effectiveness of SSM reported by all models
BC Model Effi NC Model Erri
Bay 0.007 AR -0.21
DTb -0.14 DTr -0.03
IB -0.41 LR -0.99
Log -0.34 SVR 0.27
Effavg -0.221 Erravg -0.24
6.5 Summary
One of the limitations to develop a generic software quality prediction approach is inconsistency
found in naming software product metrics. In some cases same metric has been given different
labels whereas in other cases same label has been used for different metrics. In this thesis we
have identified these two anomalies as Type I and Type II inconsistencies respectively. In or-
der to remove these inconsistencies, a Unification and Categorization (UnC) framework has been
suggested. A set of criteria and rules has been devised for unification and categorization. For uni-
fication, frequency of use and usage history have been used as criteria. For categorization, three
dimensions have been considered namely frequency of use, software development lifecycle phase
and software development paradigm. The proposed framework has been applied to 140 base and
derived metrics to develop a searchable Unified Metrics Database (UMD). Out of these metrics,
42.85% metrics were Type I inconsistent and 5.71% were Type II inconsistent. Percentage of
frequently used and occasionally used metrics has been recorded as 34.28% and 65.72% respec-
tively. The metrics pertaining to object oriented paradigm only and conventional paradigm only
were found to be 30.71% and 31.43% respectively. Most of the metrics have been design and
implementation phase metrics with a combined percentage of 87.86%.
Furthermore the study of the role of software science metrics (SSM) in defect prediction of
object oriented (OO) software reveals that SSM are not effective predictors of defects. The models
151
used here are first applied using all the metrics available in the dataset and then removing SSM
from the input and the accuracies and error values of all the models are observed. Effectiveness
of SSM is measured at model level by comparing accuracies and mean absolute error of models
with and without SSM. Overall effectiveness of SSM is measured by taking averages of reported
error values of all models. Out of the four models used for binary classification, no model has
reported SSM as effective metrics to classify OO software modules as defect prone. In case of NC
models support vector regression has reported the effectiveness of SSM in predicting number of
defects, whereas other three models have reported negative role of SSM in predicting number of
defects. Averages of reported errors of all the models show that use of SSM for classification of
OO software modules and predicting number of defects does not help in this case, and prediction
error can even be improved if SSM are dropped from the input.
152
7. CONCLUSIONS AND FUTURE DIRECTIONS
This thesis applies intelligent computing techniques like association mining and fuzzy inference
system to improve prediction of defect-prone modules. The thesis also proposes a framework to
resolve issues in nomenclature of software metrics. This thesis also reports the ineffectiveness of
using Software Science Metrics (SSM) to build defect prediction models.
This thesis starts with handling a known limitation of defect prediction models that they do
not achieve Recall as high as desired because the available software defect data is dominated by
instances of Not-Defect-prone (ND).This thesis proposes preprocessesing in data before building
prediction model by dividing the input variables in equi-frequency bins, and finding association
between the bins and the D modules. The bins highly associated with D modules are given more
importance and their association with ND modules is removed from the data. This preprocessing
results in better prediction of D modules, however at the cost of True Negative Rate (TNRate).
This approach has been tested using Naive Bayes classifier. The thesis analyses the performance
gain and performance shortfall for Recall and TNRate respectively. Upto 40% improvement in
recall has been observed when the technique is applied on 5 public datasets with upto 20 bins for
each variable. On the other hand, the maximum performance shortfall for any of the datasets has
been 36%. Lower TNRate implies higher False Positive Rate (FPRate). The thesis argues on
the basis of industry feedback that performance gain in case of Recall is more important because
shipment of a defective module is more critical as compared to extra testing of a defect free module.
Another limitation in the software defect prediction domain is that for accurate predictions, one
has to wait for code metrics that are collected very in the software lifecycle. This thesis suggests the
use of imprecise values for the code and design metrics in early phases of lifecycle. The resultant
fuzzy input based model gives comparable performance in terms of Recall when compared with
models developed using precise values of code metrics. The thesis suggests that these results based
on linguistic values should be used for early prediction and better models with precise inputs can
be used later in lifecycle to improve the prediction. The gap in performance of models based on
precise and imprecise inputs has been low.
Towards the end, this thesis identifies anomalies in naming of software metrics. The two types
of inconsistencies are of Type I (same metric has been given different labels) and Type II (same
label has been used for different metrics). The thesis also removes these inconsistencies in 140
software metrics, through the proposed Unification and Categorization (UnC) framework. This
framework is a set of criteria and rules that employs frequency of use and usage history as a
criterian for unification. The framework also uses three dimensions namely frequency of use,
software development lifecycle phase and software development paradigm for categorization. Out
of the 140 metrics, 42.85% metrics were Type I inconsistent and 5.71% were Type II inconsistent.
Percentage of frequently used and occasionally used metrics has been recorded as 34.28% and
65.72% respectively. The metrics pertaining to object oriented paradigm only and conventional
paradigm only were found to be 30.71% and 31.43% respectively. Most of the metrics have been
design and implementation phase metrics with a combined percentage of 87.86%. A list and
categorization of all these metrics is given in appendix B.
This thesis also reports that SSM are not effective predictors of defects. The thesis develops
models first by using all the metrics available in the software defect datasets and then by removing
SSM from the input. The accuracies and error values of all the models are observed. Effectiveness
of SSM is measured at model level by comparing accuracies and mean absolute error of models
with and without SSM. Overall effectiveness of SSM is measured by taking averages of reported
error values of all models. Out of the four models used for binary classification, no model has
reported SSM as effective measures to classify OO software modules as defect prone. In case of
NC models support vector regression has reported the effectiveness of SSM in predicting number
of defects, whereas other three models have reported negative role of SSM in predicting number
of defects. Averages of reported errors of all the models show that use of SSM for classification of
OO software modules and predicting number of defects does not help in this case, and prediction
error can even be improved if SSM are dropped from the input.
154
While improving the defect prediction and handling the limitations mentioned in the thesis,
a tool has been developed to support the experimentation activity. This Matlab based tool has
supported the rapid collection of results from the experiments. Architecture and working of the
tool are given in Appendix A.
In presence of numerous publicly available software defect datasets, the defect prediction mod-
els are evaluated using the public data. There are certain directions for future research related to
present study:
• The association mining based preprocessing approach presented in Chapter 4 predicts defect
prone modules. The proposed approach may be applied to find types of defects also. In that
case the software characteristics highly associated with major defects can be further studied
and used to design specific test cases. For example if in a software module higher values of
Weighted Methods per Class (WMC) are associated with a major defect in software, then
special test cases may be designed to test control structures of this software module. These
test cases will help discover the error before release and shipment of major defects will be
avoided.
• Agile development have been reported to be effective in smaller as well as medium scale
software (approximately 1,000 function points). This development methodology has also
been used for large software (in the range of 10,000 function points) in the last decade. Suc-
cess of agile development methodologies for large applications has been scarcely reported.
However, a shift towards the development methodologies has been visible in last few years
which indicates its success. The shift towards the agile methodologies being recent, there is
limited literature or data available related to maintenance costs, defect-prone modules, qual-
ity of application, quality of maintenance task etc. The work presented in this thesis may
be extended to work in case of agile development environment where quality related infor-
mation is not thoroughly collected. For example, some metrics may be categorized to see
their effectiveness in agile methodology and then their values may be estimated in linguistic
terms using findings of cross project studies. These linguistic labels can then be sent to the
proposed fuzzy inference system to get the information regarding defect prone modules.
155
• In order to have good utility of the proposed approaches these approaches should be applied
in software industry and the approaches should be evaluated based on current software data.
For that matter, the software industry data may be compared with available public data and
a similarity in the two data is calculated. This way cross project prediction of defects will
be useful. However, finding similarity in software projects is a non-trivial task. So before
adoption of the proposed approaches data similarity problem may be resolved.
• Performance of defect prediction models is evaluated using ROC curve. As discussed earlier
the public data is imbalanced and ROC curved are considered to present lesser information
as compared to Precision-Recall curves in case of skewed data. An analysis of performance
of existing models using PR curves can be performed to: a) show that information presented
using PR curves is more useful than the information presented using ROC curves. b) com-
pare performance of the existing models using PR-Curves to see if results are different from
the results reported in literature.
156
BIBLIOGRAPHY
Seiya Abe, Osamu Mizuno, Tohru Kikuno, Nahomi Kikuchi, and Masayuki Hirayama. Estimation
of project success using bayesian classifier. In Proceedings of The 28th International Conference
on Software Engineering, ICSE’06, 2006.
H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic
Control, 19(6):716–723, December 1974.
Saleh Alshomrani, Abdullah Bawakid, Seong-O Shim, Alberto Fernndez, and Francisco Herrera.
A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping
in imbalanced datasets. Knowledge-Based Systems, 73(0):1 – 17, 2015.
S. Amasaki, Y. Takagi, O. Mizuno, and T. Kikuno. A Bayesian Belief Network for assessing
the likelihood of fault content. In Software Reliability Engineering, 2003. ISSRE 2003. 14th
International Symposium on, pages 215– 226, Nov. 2003.
Erik Arisholm, Lionel C. Briand, and Eivind B. Johannessen. A systematic and comprehensive
investigation of methods to build and evaluate fault prediction models. Journal of Systems and
Software, 83(1):2 – 17, 2010.
Albert L. Baker, James M. Beiman, Norman Fenton, David A. Gustafson, Austin Meton, and
Robin Whitty. A philosophy for software measurement. Journal of Systems and Software, 12
(Issue 3):277–281, 1990.
Ma Baojun, Karel Dejaeger, Jan Vanthienen, and Bart Baesens. Software defect prediction based
on association rule classification. Technical report, Katholieke Universiteit Leuven, Feb 2011.
Victor R. Basili and Barry T. Perricone. Software errors and complexity: an empirical investiga-
tion. Communications of the ACM, 27:42–52, January 1984.
Steffen Becker, Lars Grunske, Raffaela Mirandola, and Sven Overhage. Performance prediction of
component-based systems - a survey from an engineering perspective. In Architecting Systems
with Trustworthy Components, volume 3938 of Lecture Notes in Computer Science. Springer,
2006.
Laszio A. Belady. Software geometry. In Proceedings of The 1980 Internaional Computer Sym-
posium, 1980.
M. Bell, Robert, J. Ostrand, Thomas, and J. Weyuker, Elaine. The limited impact of individual
developer data on software defect prediction. Empirical Software Engineering, pages 2 –13,
September 2011. ISSN 1573-7616.
J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum Press, New
York, USA, 1981.
P.S. Bishnu and V. Bhattacherjee. Software fault prediction using quad tree-based k-means clus-
tering algorithm. Knowledge and Data Engineering, IEEE Transactions on, 24(6):1146 –1150,
june 2012. ISSN 1041-4347.
Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and
Statistics). Springer, August 2006. ISBN 0387310738.
Salah Bouktif, Danielle Azar, Doina Precup, Houari Sahraoui, and Balazs Kegl. Improving rule
set based software quality prediction: A genetic algorithm-based approach. Journal of Object
Technology, 3(4):227–241, April 2004.
Salah Bouktif, Houari Sahraoui, and Giuliano Antoniol. Simulated annealing for improving soft-
ware quality prediction. In Proceedings of The GECCO’06. ACM, 8-12 July 2006.
Leo Breiman. Random forests. Machine Learning, 45(1):5–32, October 2001.
158
Lionel C. Briand, Victor R. Basili, and William M. Thomas. A pattern recognition approach for
software engineering data analysis. IEEE Transactions on Software Engineering, Vol. 18(No.
11):931–942, November 1992.
Lionel C. Briand, Victor R. Basili, and Christopher J. Hetmanski. Developing interpretable models
with optimized set reduction for identifying high-risk software components. IEEE Transactions
on Software Engineering, Vol. 19(No. 11):1028–1044, November 1993.
Sarah Brocklehurst and Bev Littlewood. New ways to get accurate reliability measures. IEEE
Software, Vol. 9(Issue 4):34–42, July 1992.
Yuriy Brun and Michael D. Ernst. Finding latent code errors via machine learning over program
executions. In Proceedings of The 26th International Conference on Software Engineering,
ICSE’04, 2004.
David N. Card. Understanding causal systems. CrossTalk: The Journal of Defense Software
Engineering, 2004, pages 15–18, October 2004.
David N. Card. Myths and strategies of defect causal analysis. In Proceedings: The 24th Pa-
cific Northwest Software Quality Conference, Portland, Oregon, pages 469–474, October 10-11
2006.
Cagatay Catal and Diri Banu. A systematic review of software fault prediction studies. Expert
Systems with Applications, 36(4):7346–7354, 2009.
Venkata U. B. Challagulla, Farokh B. Bastani, and Raymond A. Paul. Empirical assessment of
machine learning based sofwtare defect prediction techniques. In Proceedings of 10th Workshop
on Object-Oriented Real-Time Dependable Systems (WORDS’05), pages 263–270, Washington,
DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2347-1.
Ching-Pao Chang and Chih-Ping Chu. Improvement of causal analysis using multivariate statistical
process control. Software Quality Control, 16:377–409, September 2008. ISSN 0963-9314.
159
Ching-Pao Chang, Chih-Ping Chu, and Yu-Fang Yeh. Integrating in-process software defect pre-
diction with association mining to discover defect pattern. Information and Software Technology,
51:375–384, February 2009. ISSN 0950-5849.
Shyam R. Chidamber and C. F. Kemerer. A metrics suite for object oriented designs. IEEE
Transactions on Software Engineering, 20(No. 6):476–493, June 1994.
L. J. Chmura, A. F. Norcio, and T. J. Wicinski. Evaluating software design processes by analyzing
change data over time. IEEE Transactions on Software Engineering, 16(7):729–740, July 1990.
K.J. Cios, W. Pedrycz, R.W. Swiniarski, and L.A. Kurgan. Data Mining: A Knowledge Discovery
Approach. Springer, 2007. ISBN 9780387367958.
William W. Cohen. Learning trees and rules with set-valued features. In Proceedings of the
thirteenth national conference on Artificial intelligence (AAAI’96), pages 709–716, 1996.
Philip Crosby. Quality is Free. New York: McGraw-Hill, 1979. ISBN 0-07-014512-1.
James B. Dabney, Gary Barber, and Don Ohi. Predicting software defect function point ratios
using a bayesian belief network. In Proceedings of the PROMISE workshop, 2006.
Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In In
ICML 06: Proceedings of the 23rd international conference on Machine learning, pages 233–
240. ACM Press, 2006.
Janez Demsar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res.,
7:1–30, December 2006. ISSN 1532-4435.
Scott Dick and Abraham Kandel. Fuzzy clustering of software metrics. In Proceedings of The
12th IEEE International Conference on Fuzzy Systems, May 2003.
J. P. Egan. Signal detection theory and roc analysis. Series in Cognition and Perception, 1975.
Albert Endres. An analysis of errors and their causes in system programs. In Proceedings of The
International Conference on Reliable Software, pages 327–336. ACM Press, 1975.
160
NASA/WVU IV & V Facility. Metrics data program (mdp). Internet, http://mdp.ivv.nasa.gov/.
Norman Fenton, Paul Krause, Martin Neil, and Crossoak Lane. Software measurement: Uncer-
tainty and causal modelling. IEEE Software Magazine, 19(14):116 – 122, July/Aug. 2002.
Norman Fenton, Martin Neil, and David Marquez. Using bayesian networks to predict software
defects and reliability. 2007a.
Norman Fenton, Martin Neil, William Marsh, Peter Hearty, David Marquez, Paul Krause, and
Rajat Mishra. Predicting Software Defects in Varying Development Lifecycles using Bayesian
Nets. Information and Software Technology, 49:32–43, January 2007b. ISSN 0950-5849.
Norman Fenton, Martin Neil, William Marsh, Peter Hearty, Lukasz Radlinski, and Paul Krause.
Project data incorporating qualitative factors for improved software defect prediction. In
Proceedings of 3rd International Workshop on Predictor Models in Software Engineering
(PROMISE ’07), pages 69–, Washington, DC, USA, 2007c. IEEE Computer Society. ISBN
0-7695-2830-9.
Norman Fenton, Martin Neil, William Marsh, Peter Hearty, Lukasz Radliski, and Paul Krause.
On the effectiveness of early life cycle defect prediction with Bayesian Nets. Empirical Softw.
Engg., 13:499–537, October 2008. ISSN 1382-3256.
Norman E. Fenton and Martin Neil. A critique of software defect prediction models. IEEE Trans-
actions on Software Engineering, Vol. 25(No. 5):675–687, September/October 1999.
Norman E. Fenton and Niclas Ohlsson. Quantitative analysis of faults and failures in a complex
software system. IEEE Transactions on Software Engineering, Vol. 26(No. 8), August 2000.
Norman E. Fenton and Shari Lawrence Pfleeger. Software Metrics: A Rigorous and Practical
Approach. PWS Publishing Co., 2nd edition, 1998.
International Organization for Standardization. Iso 9000:2005 quality management systems – fun-
damentals and vocabulary. Standard, 2005.
161
K. Ganesan, Taghi M. Khosgoftaar, and Edward B. Allen. Case-based software quality prediction.
International Journal of Software Engineering and Knowledge Engineering, Vol. 10(No. 2):
139–152, February 2000.
Vicente Garca, Ramn A. Mollineda, and J. Salvador Snchez. A bias correction function for classi-
fication performance assessment in two-class imbalanced problems. Knowledge-Based Systems,
59(0):66 – 74, 2014.
Felix Garcıa, Manuel F. Bertoa, Coral Calero, Antonio Vallecillo, Francisco Ruız, Mario Piattini,
and Marcela Genero. Towards a consistent terminology for software measurement. Information
and Software Technology, 48(8):631 – 644, 2006. ISSN 0950-5849.
Felix Garcıa, Francisco Ruız, Coral Calero, Manuel F. Bertoa, Antonio Vallecillo, Beatriz Mora,
and Mario Piattini. Effective use of ontologies in software measurement. The Knowledge Engi-
neering Review, 24(1):23–40, 2009. ISSN 0269-8889.
Swapna S. Gokhale and Michael R. Lyu. Regression tree modeling for the prediction of software
quality. In Proceedings of The 3rd ISSAT Intl. Conference on Reliability, 1997.
Vincenzo Grassi. Architecture-based dependability prediction for service-oriented computing. In
ICSE, Workshop on Architecting Dependable Systems, WADS’04, Lecture Notes in Computer
Science. Springer, June 2004.
Vincenzo Grassi and Simone Patella. Reliability prediction for service-oriented computing envi-
ronments. IEEE Internet Computing, pages 43–49, May-June 2006.
Andrew R. Gray and Stephen G. MacDonell. A comparison of techniques for developing predictive
models of software metrics. Information and Software Technology, 39(1997):425 – 437, 1997.
D. Gray, D. Bowes, N. Davey, Yi Sun, and B. Christianson. The misuse of the nasa metrics
data program data sets for automated software defect prediction. In Evaluation Assessment in
Software Engineering (EASE 2011), 15th Annual Conference on, pages 96–103, April 2011.
162
David Grosser, Houari A. Sahraoui, and Petko Valtchev. Analogy-based software quality predic-
tion. In 7th Workshop On Quantitative Approaches In Object-Oriented Software Engineering,
QAOOSE’03, June 2003.
Ping Guo and Michael R. Lyu. Software quality prediction using mixture models with em algo-
rithm. In Proceedings of The First Asia-Pacific Conference on Quality Software, 2000.
Tibor Gyimothy, Rudolf Ferenc, and Istvan Siket. Empirical validation of object-oriented metrics
on open source software for fault prediction. IEEE Transactions on Software Engineering, 31
(No. 10):897–910, October 2005.
Mark Hall and Eibe Frank. Combining naive bayes and decision tables. In Proceedings of the 21st
Florida Artificial Intelligence Society Conference (FLAIRS), pages 2–3. AAAI press, 2008.
Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. A systematic re-
view of fault prediction performance in software engineering. IEEE Transactions on Software
Engineering, 99(PrePrints), 2011. ISSN 0098-5589.
Maurice H. Halstead. Elements of software science. 1977.
Simon Haykin. Neural Networks: A Comprehensive Foundation. Macmillan, New York, 1994.
Peng He, Bing Li, Xiao Liu, Jun Chen, and Yutao Ma. An empirical study on software defect
prediction with a simplified metric set. Information and Software Technology, 59:170–190,
March 2015.
Sallie Henry, Dennis Kafura, and Kathy Harris. On the relationships among three software met-
rics. In Proceedings of the 1981 ACM Workshop/Symposium on Measurement and Evaluation
of Software Quality, pages 81–88. ACM, 1981.
Jens Christian Huehn and Eyke Huellermeier. Furia: An algorithm for unordered fuzzy rule induc-
tion. Data Mining and Knowledge Discovery, 2009.
IEEE. Ieee standard for a software quality metrics methodology. International Standard 1061-
1998, 1998.
163
IEEE. Ieee standard adoption of iso/iec 15939:2007 - systems and software engineering - mea-
surement process. International Standard 15939:2008, 2008.
British Standards Institution. Iso 9001:2008 quality management systems – requirements. Stan-
dard, 2008.
ISO/IEC. Software engineering – product quality – part 1: Quality model. International Standard
9126-1, June 15 2001.
Howard A. Jensen and K. Vairavan. An experimental study of software metrics for real-time
software. IEEE Transactions on Software Engineering, Vol. SE-11(No. 2):231–234, February
1985.
Yue Jiang, Bojan Cukic, and Tim Menzies. Fault prediction using early lifecycle data. In Pro-
ceedings of The 18th IEEE International Symposium on Software Reliability (ISSRE) 07, pages
237–246. IEEE Computer Society, 2007.
Yue Jiang, Bojan Cukic, and Yan Ma. Techniques for evaluating fault prediction models. Empirical
Softw. Engg., 13:561–595, October 2008a. ISSN 1382-3256.
Yue Jiang, Bojan Cukic, and Tim Menzies. Cost curve evaluation of fault prediction models. In
Proceedings of the 2008 19th International Symposium on Software Reliability Engineering,
ISSRE ’08, pages 197–206, Washington, DC, USA, 2008b. IEEE Computer Society. ISBN
978-0-7695-3405-3.
Yue Jiang, Bojan Cukic, Tim Menzies, and Nick Bartlow. Comparing design and code metrics for
software quality prediction. In Proceedings of PROMISE’08, pages 11–18. ACM, May 2008c.
Han Jiawei and Kamber Micheline. Data Mining - Concepts and Techniques. Morgan Kaufmann,
2002.
Capers Jones. Applied Software Measurement: Global Analysis of Productivity and Quality. Tata
McGraw-Hill, 3 edition, 2008.
164
Yasutaka Kamei, Akito Monden, Shuji Morisaki, and Ken-ichi Matsumoto. A hybrid faulty module
prediction using association rule mining and logistic regression analysis. In Proceedings of the
Second ACM-IEEE international symposium on Empirical software engineering and measure-
ment, ESEM ’08, pages 279–281, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-971-5.
Stephen H. Kan. Metrics and Models in Software Quality Engineering. Addison-Wesley Longman
Publishing Co., Inc., 2nd edition, 2002.
R. Karthik and N. Manikandan. Defect association and complexity prediction by mining asso-
ciation and clustering rules. In Computer Engineering and Technology (ICCET), 2010 2nd
International Conference on, volume 7, pages V7–569–V7–573, 2010.
Arashdeep Kaur, Parvinder S. Sandhu, and Amanpreet Singh Bra. Early software fault prediction
using real time defect data. In Proceedings of the 2009 Second International Conference on
Machine Vision, ICMV ’09, pages 242–245, Washington, DC, USA, 2009. IEEE Computer
Society. ISBN 978-0-7695-3944-7.
Malik Jahan Khan, Shafay Shamail, Mian Muhammad Awais, and Tauqeer Hussain. Comparative
study of various artificial intelligence techniques to predict software quality. In Proceedings of
The 10th IEEE Multitopic Conference, 2006, INMIC ’06, pages 173–177, December 2006.
Taghi M. Khosgoftaar and John. C. Munson. Predicting software development errors using soft-
ware complexity metrics. IEEE Journal On Selected Areas In Communications, Vol. 8(No. 2):
253–261, February 1990.
Taghi M. Khosgoftaar, David L. Lanning, and Abhijit S . Pandya. A comparative study of pattern
recognition techniques for quality evaluation of telecommunications software. IEEE Journal On
Selected Areas In Communications, Vol. 12(No. 2):279–291, February 1994.
Taghi M. Khoshgoftaar and Edward B. Allen. Logistic regression modeling of software quality.
International Journal of Reliability, Quality and Safety Engineering, Vol. 6(Issue 4):303–317,
December 1999a.
165
Taghi M. Khoshgoftaar and Edward B. Allen. Predicting fault-prone software modules in embed-
ded systems with classification trees. In Proceedings of The 4th IEEE International Symposium
on High-Assurance Systems Engineering. IEEE, Computer Society, 1999b.
Taghi M. Khoshgoftaar and Edward B. Allen. A comparative study of ordering and classification
of fault-prone software modules. Empirical Software Engineering, 4:159–186, 1999c.
Taghi M. Khoshgoftaar and Naeem Seliya. Tree-based software quality estimation models for fault
prediction. In Proceedings of the Eighth IEEE Symposium on Software Metrics (METRICS’02).
IEEE Computer Society, 2002.
Taghi M. Khoshgoftaar and Naeem Seliya. Fault prediction modeling for software quality esti-
mation: Comparing commonly used techniques. Empirical Software Engineering, 8(No. 3):
255–283, September 2003. ISSN 1382-3256.
Taghi M. Khoshgoftaar and Naeem Seliya. Comparative assessment of software quality classifica-
tion techniques: An empirical case study. Empirical Software Engineering, 9:229–257, 2004.
Taghi M. Khoshgoftaar, John C. Munson, Bibhuti B. Bhattacharya, and Gary D. Richardson. Pre-
dictive modeling techniques of software quality from software measures. IEEE Trans. Softw.
Eng., 18:979–987, November 1992. ISSN 0098-5589.
Taghi M. Khoshgoftaar, Edward B. Allen, Kalai S. Kalaichelvan, and Nishith Goel. Early quality
prediction: A case studv in telecommunications. IEEE Software, 13(1):65–71, 1996.
Taghi M. Khoshgoftaar, Edward B. Allen, Robert Halstead, Gary P. Trio, and Ronald Flass. Pro-
cess measures for predicting software quality. In Proceedings of The High-Assurance Systems
Engineering Workshop, pages 155 –160, aug 1997a.
Taghi M. Khoshgoftaar, K. Ganesan, Edward B. Allen, Fletcher D. Ross, Rama Munikoti, Nishith
Goel, and Amit Nandi. Predicting fault-prone modules with case-based reasoning. In Proceed-
ings of The Eighth International Symposium On Software Reliability Engineering, pages 27–35.
IEEE, 2-5 Nov 1997b.
166
Taghi M. Khoshgoftaar, Naeem Seliya, and Nandini Sundaresh. An empirical study of predicting
software faults with case-based reasoning. Software Quality Journal, 14(2):85–111, June 2006.
ISSN 0963-9314.
Sunghun Kim, Thomas Zimmermann, E.James Whitehead-Jr., and Andreas Zeller. Predicting
faults from cache history. In Proceedings of The 29th International Conference on Software
Engineering, ICSE’07, 2007.
Michael Klas, Haruka Nakao, Frank Elberzhager, and Jurgen Munch. Support planning and con-
trolling of early quality assurance by combining expert judgment and defect data–a case study.
Empirical Softw. Engg., 15(4):423–454, August 2010.
Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. Data duplication: an imbalance
problem? In Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC,
2003.
A. Gunes Koru and Hongfang Liu. An investigation of the effect of module size on defect predic-
tion using static measures. In Proceedings of International Workshop on Predictor Models in
Software Engineering (PROMISE ’05). ACM Press, 2005a.
A. Gunes Koru and Hongfang Liu. Building defect prediction models in practice. IEEE Softw., 22
(6):23–29, nov 2005b. ISSN 0740-7459.
Bart Kosko. Fuzzy engineering. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1997. ISBN
0-13-124991-6.
Miroslav Kubat, Robert C. Holte, and Stan Matwin. Machine Learning, volume 30, chapter Ma-
chine Learning for the Detection of Oil Spills in Satellite Radar Images, pages 195–215. Kluwer
Academic Publishers, Boston, Manufactured in The Netherlands, 1998.
Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. Benchmarking classifi-
cation models for software defect prediction: A proposed framework and novel findings. IEEE
Trans. Softw. Eng., 34(4):485–496, jul 2008.
167
Paul Luo Li, James Herbsleb, Mary Shaw, and Brian Robinson. Experiences and results from
initiating field defect prediction and product test prioritization efforts at abb inc. In Proceedings
of The 28th International Conference on Software Engineering, ICSE’06, 2006.
Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association rule mining.
In In Proc. of the Fourth International Conference on Knowledge Discovery and Data Mining
(KDD-98), pages 80–86, 1998.
Yan Liu, Allen Fekete, and Ian Gorton. Design-level performance prediction of component based
applications. IEEE Transactions on Software Engineering, 31(11):928–941, November 2005.
Atchara Mahaweerawat, Peraphon Sophatsathit, Chidchanok Lursinsap, and Petr Musilek. Fault
prediction in object oriented software using neural network techniques. In Proceedings of the
InTech Conference, pages 27–34, December 2-4 2004.
Mark W. Maier and Eberhardt Rechtin. The Art of Systems Architecting (2Nd Ed.). CRC Press,
Inc., Boca Raton, FL, USA, 2000. ISBN 0-8493-0440-7.
Henry. B. Mann and D. R. Whitney. On a Test of Whether one of Two Random Variables is
Stochastically Larger than the Other. The Annals of Mathematical Statistics, 18(1):50–60, 1947.
Thomas J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(No.
4):308–320, December 1976.
Thilo Mende. Replication of defect prediction studies: problems, pitfalls and recommendations. In
Proceedings of the 6th International Conference on Predictive Models in Software Engineering,
PROMISE ’10, pages 5:1–5:10, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0404-7.
Thilo Mende and Rainer Koschke. Revisiting the evaluation of defect prediction models. In
Proceedings of the 5th International Conference on Predictor Models in Software Engineering,
PROMISE ’09, pages 7:1–7:10, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-634-2.
T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predic-
tors. Software Engineering, IEEE Transactions on, 33(1):2 –13, jan. 2007. ISSN 0098-5589.
168
Tim Menzies, Justin S. Di Stefano, and Mike Chapman. Learning early lifecycle ivv quality indi-
cators. In Proceedings of IEEE Metrics 03. IEEE, 2003.
Tim Menzies, Burak Turhan, Ayse Bener, Gregory Gay, Bojan Cukic, and Yue Jiang. Implications
of ceiling effects in defect predictors. In PROMISE ’08: Proceedings of the PROMISE’08, pages
47–54. ACM, May 2008. ISBN 978-1-60558-036-4.
Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayse Bener. Defect
prediction from static code features: current results, limitations, new approaches. Automated
Software Engg., 17(4):375–407, dec 2010. ISSN 0928-8910.
Tim Menzies, Bora Caglayan, Zhimin He, Ekrem Kocaguneli, Joe Krall, Fayola Peters, and Bu-
rak Turhan. The promise repository of empirical software engineering data, June 2012. URL
http://promisedata.googlecode.com.
C E Metz. Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8(4):283–298, 1978.
Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
Audris Mockus, Ping Zhang, and Paul Luo Li. Predictors of customer perceived software quality.
In Proceedings of The 27th International Conference on Software Engineering, ICSE’05, 15-21
May 2005.
Siba N. Mohanty. Models and measurements for quality assessment of software. ACM Computing
Surveys, 11(3):251–275, September 1979.
John C. Munson and Taghi M. Khosgoftaar. The detection of fault-prone programs. IEEE Trans-
actions on Software Engineering, Vol. 18(No. 5):423–434, May 1992.
John. C. Munson and Y. M. Khoshgoftaar. Predicting software development errors using software
complexity metrics. IEEE Journal on Selected Areas in Communications, 8:253, Feb 1990.
Nachiappan Nagappan and Thomas Ball. Static analysis tools as early indicators of pre-release
defect density. In Proceedings of The 27th International Conference on Software Engineering,
ICSE’05, 2005a.
169
Nachiappan Nagappan and Thomas Ball. Use of relative code churn measures to predict system
defect density. In Proceedings of The 27th International Conference on Software Engineering,
ICSE’05, 2005b.
Nachiappan Nagappan, Thomas Ball, and A. Zeller. Mining metrics to predict component failures.
In Proceedings of The 28th International Conference on Software Engineering, ICSE’06, 2006.
Martin Neil and Norman Fenton. Predicting software quality using bayesian belief networks.
In Proceedings of 21st Annual Software Engineering Workshop NASA/Goddard Space Flight
Centre, pages 217 – 230, 1996.
Niclas Ohlsson and Hans Alberg. Predicting fault-prone software modules in telephone switches.
IEEE Transactions on Software Engineering, Vol. 22(No. 12):34–42, December 1996.
Niclas Ohlsson, Ming Zhaq, and Mary Helander. Application of multivariate analysis for software
fault prediction. Software Quality Journal, 7:51–66, 1998.
Ahmet Okutan and OlcayTaner Yildiz. Software defect prediction using Bayesian Networks. Em-
pirical Software Engineering, 19(1):154–181, 2014.
Hector M. Olague, Letha H. Etzkorn, Sampson Gholston, and Stephen Quattlebaum. Empirical
validation of three software metrics suites to predict fault-proneness of object-oriented classes
developed using highly iterative or agile software development processes. IEEE Transactions
on Software Engineering, 33(No. 6):402–419, October 2007.
Linda M. Ottenstein. Quantitative estimates of debugging requirements. IEEE Transactions on
Software Engineering, Vol. SE-5(No. 5):504–514, September 1979.
Linda M. Ottenstein. Predicting numbers of errors using software science. In Proceedings of The
1981 ACM Workshop/Symposium on Measurement and Evaluation of Software Quality, pages
157–167. ACM, 1981.
170
G.J. Pai and J. Bechta Dugan. Empirical Analysis of Software Fault Content and Fault Proneness
Using Bayesian Methods. Software Engineering, IEEE Transactions on, 33(10):675–686, Oct
2007.
Yi Peng, Gang Kou, Guoxun Wang, Wenshuai Wu, and Yong Shi. Ensemble of software defect
predictors: An AHP-based evaluation method. International Journal of Information Technology
& Decision Making (IJITDM), 10(01):187–206, 2011.
Nicolino J. Pizzi, Arthur R. Summers, and Witold Pedrycz. Software quality prediction using
median-adjusted class labels. In Proceedings of The 2002 International Joint Conference on
Neural Networks, IJCNN’02, 2002.
Roger S. Pressman. Software Engineering, A Practitioners Approach. Pearson, 7th edition, 2010.
J. Priyadarshin. Mining of defect using Apriori and defect correction effort prediction. In Proceed-
ings of 2nd National Conference on Challenges and Opportunities in Information Technology
(COIT-2008) RIMT-IET, Mandi Gobindgarh., pages 38–41, 2008.
Sandeep Purao and Vijay Vaishnavi. Product metrics for object-oriented systems. ACM Computing
Surveys, 35(2):191–221, June 2003.
Tong-Seng Quah and Mie Mie Thet Thwin. Application of neural network for predicting software
development faults using object-oriented design metrics. In Proceedings of The 19th Inter-
national Conference on Software Maintenance, ICSM’03. IEEE Computer Society, September
2003.
Zeeshan A. Rana, Sehrish Abdul Malik, Shafay Shamail, and Mian M. Awais. Identifying associ-
ation between longer itemsets and software defects. In Minho Lee, Akira Hirose, Zeng-Guang
Hou, and RheeMan Kil, editors, Neural Information Processing, volume 8228 of Lecture Notes
in Computer Science, pages 133–140. Springer Berlin Heidelberg, 2013. ISBN 978-3-642-
42050-4.
171
Zeeshan Ali Rana, Shafay Shamail, and Mian Muhammad Awais. Towards a generic model for
software quality prediction. In WoSQ ’08: Proceedings of the 6th International Workshop on
Software Quality, pages 35–40. ACM, May 2008. ISBN 978-1-60558-023-4.
Zeeshan Ali Rana, Mian Muhammad Awais, and Shafay Shamail. An FIS for early detection
of defect prone modules. In De-Shuang Huang, Kang-Hyun Jo, and Hong-Hee Lee, editors,
Proceedings of ICIC’09, Emerging Intelligent Computing Technology and Applications. With
Aspects of Artificial Intelligence, volume 5755/2009 of Lecture Notes in Computer Science,
pages 144–153, Ulsan, South Korea, September 16-19 2009a. Springer Berlin Heidelberg.
Zeeshan Ali Rana, Shafay Shamail, and Mian Muhammad Awais. Ineffectiveness of use of soft-
ware science metrics as predictors of defects in object oriented software. In WCSE ’09: Pro-
ceedings of the 2009 WRI World Congress on Software Engineering, pages 3–7, Washington,
DC, USA, May 19-21 2009b. IEEE Computer Society.
Zeeshan Ali Rana, Mian Muhammad Awais, and Shafay Shamail. Nomenclature unification of
software product measures. IET Software, 5(1):83–102, 2011.
ZeeshanAli Rana, MianM. Awais, and Shafay Shamail. Impact of using information
gain in software defect prediction models. In De-Shuang Huang, Vitoantonio Bevilac-
qua, and Prashan Premaratne, editors, Intelligent Computing Theory, volume 8588 of
Lecture Notes in Computer Science, pages 637–648. Springer International Publish-
ing, 2014. ISBN 978-3-319-09332-1. doi: 10.1007/978-3-319-09333-8 69. URL
http://dx.doi.org/10.1007/978-3-319-09333-8 69.
Ralf H. Reussner, Heinz W. Schmidt, and Iman H. Poernomo. Reliability prediction for
component-based software architectures. Journal of Systems and Software, 66(Issue 3):241–
252, June 2003.
C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2
edition, 1979. ISBN 0408709294.
172
P.S. Sandhu, R. Goel, A.S. Brar, J. Kaur, and S. Anand. A model for early prediction of faults in
software systems. In The 2nd International Conference on Computer and Automation Engineer-
ing (ICCAE), pages 281–285, 2010. ISBN 978-1-4244-5585-0.
Victor Schneider. Some experimental estimators for developmental and delivered errors in software
development projects. ACM SIGMETRICS Performance Evaluation Review, Vol. 10(Issue 1):
169–172, 1981.
Naeem Seliya and Taghi M. Khoshgoftaar. Software quality estimation with limited fault data: A
semi-supervised learning perspective. Software Quality Journal, 15:327–344, August 2007.
Zongyao Sha and Jiangping Chen. Mining association rules from dataset containing predeter-
mined decision itemset and rare transactions. In Natural Computation (ICNC), 2011 Seventh
International Conference on, volume 1, pages 166 –170, july 2011.
Joanne M. Atlee Shari L. PFleeger. Software Engineering, Theory and Practice. Pearson, 4th
edition, 2010.
Vincent Y. Shen, Tze-Jie Yu, Stephen M. Thebaut, and Lorri R. Paulsen. Identifying error-prone
sofwtare - an empirical study. IEEE Transactions on Software Engineering, SE-11:317–323,
April 1985.
Martin J. Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. Data quality: Some comments
on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208–
1215, September 2013.
IEEE Computer Society. Ieee standard glossary of software engineering terminology. IEEE Stan-
dard, 1990.
Qinbao Song, M. Shepperd, M. Cartwright, and C. Mair. Software defect association mining and
defect correction effort prediction. Software Engineering, IEEE Transactions on, 32(2):69–82,
feb. 2006.
173
Qinbao Song, Zihan Jia, M. Shepperd, Shi Ying, and Jin Liu. A general software defect-proneness
prediction framework. Software Engineering, IEEE Transactions on, 37(3):356 –370, may-june
2011. ISSN 0098-5589.
Kotsiantis Sotiris and Kanellopoulos Dimitris. Association rules mining: A recent overview.
GESTS International Transactions on Computer Science and Engineering, 32:71–82, 2006.
Zhongbin Sun, Qinbao Song, and Xiaoyan Zhu. Using coding-based ensemble learning to improve
software defect prediction. Systems, Man, and Cybernetics, Part C: Applications and Reviews,
IEEE Transactions on, 42(6):1806 –1817, nov. 2012. ISSN 1094-6977.
M.P. Thapaliyal and Garima Verma. Software defects and object oriented metrics - an empirical
analysis. International Journal of Computer Applications, 9(5):41–44, November 2010. Pub-
lished By Foundation of Computer Science.
Mie Mie Thet Thwin and Tong-Seng Quah. Application of neural network for predicting software
development faults using object-oriented design metrics. In Proceedings of The 9th International
Conference on Neural Information Processing, ICONIP’02, volume 5, 2002.
Burak Turhan and Ayse Bener. Analysis of naive bayes’ assumptions on software fault data: An
empirical study. Data Knowl. Eng., 68(2):278–290, February 2009.
Vijay K. Vaishnavi, Sandeep Purao, and Jens Liegle. Object-oriented product metrics: A generic
framework. Information Sciences, 177:587–606, 2007.
Stefan Wagner. Global sensitivity analysis of predictor models in software engineering. In
Proceedings of 3rd International Workshop on Predictor Models in Software Engineering
(PROMISE ’07). IEEE Computer Society Press, 2007.
Stefan Wagner and Florian Deissenboeck. An integrated approach to quality modelling. In Pro-
ceedings of 5th Workshop on Software Quality (5-WoSQ). IEEE Computer Society Press, 2007.
174
Huanjing Wang, Taghi M. Khoshgoftaar, and Naeem Seliya. How many software metrics should
be selected for defect prediction? In FLAIRS Conference, pages 69–74. AAAI Publications,
2011.
HUANJING WANG, TAGHI M. KHOSHGOFTAAR, and QIANHUI (ALTHEA) LIANG. A
study of software metric selection techniques: Stability analysis and defect prediction model
performance. International Journal on Artificial Intelligence Tools, 22(05):1360010, 2013.
Ke Wang Wang, Senqiang Zhou, Qiang Yang, and Jack Man Shun Yeung. Mining customer value:
From association rules to direct marketing. Data Mining and Knowledge Discovery, 11(1):
57–79, 2005.
Qi Wang, Bo Yu, and Jie Zhu. Extract rules from software quality prediction model based on neural
network. In Proceedings of The 16th IEEE International Conference on Tools with Artificial
Intelligence, ICTAI’04, 2004.
Qi Wang, Jie Zhu, and Bo Yu. Extract rules from software quality prediction model based on neural
network. In Proceedings of The 11th International Conference on Evaluation and Assessment
in Software Engineering, EASE’07, April 2007.
Wen-Li Wang, Ye Wu, and Mei-Hwa Chen. An architecture-based software reliability model. In
Proceedings of The Pacific Rim International Symposium on Dependable Computing, 1998.
Elaine J. Weyuker, Thomas J. Ostrand, and Robert M. Bell. Comparing negative binomial and
recursive partitioning models for fault prediction. In Proceedings of the 4th international work-
shop on Predictor models in software engineering, PROMISE ’08, pages 3–10, New York, NY,
USA, 2008. ACM. ISBN 978-1-60558-036-4.
Leland Wilkinson, Anushka Anand, and Dang Nhon Tuan. Chirp: a new classifier based on com-
posite hypercubes on iterated random projections. In Proceedings of the 17th ACM SIGKDD
international conference on Knowledge discovery and data mining, KDD ’11, pages 6–14, New
York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7.
175
Leland Wilkinson, Anushka Anand, and Tuan Nhon Dang. Substantial improvements in the
set-covering projection classifier chirp (composite hypercubes on iterated random projections).
ACM Transactions on Knowledge Discovery from Data, 6(4):19:1–19:18, December 2012.
Sebastian Winter, Stefan Wagner, and Florian Deissenboeck. A comprehensive model of usability.
In Proceedings of Engineering Interactive Systems 2007. Springer-Verlag, 2007.
Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey Holmes, and Sally Jo Cun-
ningham. The Waikato Environment for Knowledge Analysis (WEKA), 2008. URL
http://www.cs.waikato.ac.nz/ml/weka.
Fei Xing and Ping Guo. Support vector regression for software reliability growth modeling and
prediction. In Advances in Neural Networks - ISNN 2005, Second International Symposium
on Neural Networks, Part 1, ISNN (1), volume 3496 of Lecture Notes in Computer Science.
Springer, 2005.
Fei Xing, Ping Guo, and Michael R. Lyu. A novel method for early software quality prediction
based on support vector machine. In Proceedings of The 16th IEEE International Symposium
on Software Reliability Engineering. IEEE, 2005.
Bo Yang, Lan Yao, and Hong-Zhong Huang. Early software quality prediction based on a fuzzy
neural network model. In Proceedings of Third International Conference on Natural Computa-
tion (ICNC 2007). IEEE Computer Society, 2007.
Xiaohong Yuan, Taghi M. Khoshgoftaar, Edward B. Allen, and K. Ganesan. An application of
fuzzy clustering to software quality prediction. In Proceedings of The 3rd IEEE Symposium on
Application-Specific Systems and Software Engineering Technology. IEEE, 2000.
Yuming Zhou and Hareton Leung. Empirical analysis of object-oriented design metrics for pre-
dicting high and low severity faults. IEEE Transactions on Software Engineering, 32(No. 10):
771–789, October 2006.
176
Appendices
177
Appendix A
PRATO: A PRACTICAL TOOL FOR SDP
A.1 Collecting and Combining Defect Prediction Models
Various techniques have been employed to predict software quality prior complete development of
the software, for example regression analysis (Khosgoftaar and Munson, 1990), neural networks
(Khosgoftaar et al., 1994), case based reasoning (Ganesan et al., 2000), genetic algorithms (Bouktif
et al., 2004), fuzzy neural networks (Yang et al., 2007). Based on these techniques different pre-
diction models have been proposed in literature. Studies have depicted that despite the availability
of such a large number of prediction models, their application in software industry is very limited
(Bouktif et al., 2004). One reason for this limited application is considered to be the specificity of
each prediction model (Rana et al., 2008, Wagner and Deissenboeck, 2007). Each model has been
developed for a specific context and a specific programming paradigm. Wagner et al. (Wagner and
Deissenboeck, 2007) and Rana et al. (Rana et al., 2008) have independently discussed the dimen-
sions of specificity of prediction models. Moreover, these studies have emphasized the need of a
generic approach to predict software quality (Rana et al., 2008, Wagner and Deissenboeck, 2007)
that helps quality managers to focus on quality. This Chapter presents PraTo, a tool based on an
earlier approach by Rana et al. (Rana et al., 2008), which is based on existing models and predicts
software defects.
PraTo lets software managers specify a certain scenario by providing their data information,
express their objectives, give their preferences based on the objectives and select a prediction
model that is most suitable for the given preferences. Using PraTo, one can develop a number of
prediction models with different model parameters before selecting the best of them.
Rest of the Chapter is organized as follows: Section A.2 introduces PraTo and describes its
architecture and components. Section A.4 shows application of PraTo with the help of an exam-
ple. Section A.5 discusses the benefits of using PraTo and mentions its limitations. Section A.6
concludes the Chapter and identified the future directions. (Fenton and Neil, 1999)
A.2 Tool Architecture and Working
PraTo provides the Quality Managers with various prediction techniques to use for software defect
prediction. This prediction can be binary, i.e. Defect Prone (D) or Not Defect Prone (ND), as well
as numeric i.e. Number of Defects. To analyze the models the tool collects different metrics for
both types of predictions. In order to develop a prediction model for their software, the managers
are required to provide their software data and select a model from a list of available models. In
case they are unable to decide which model to use, they have the provision to search and use the
best model(s) for the datasets similar to their software data. On the other hand if the managers do
not have enough data to develop a certain model, they can provide domain level information about
their software and find out the nearest dataset. Performance of available models on that dataset can
then be studied to decide which prediction model to choose. The goal of is to reduce their time
spent in study and comparison of different models and help them select the most suitable model
given their preferences.
PraTo is a multiphase approach. A schematic overview of PraT in Figure A.1 shows that it
takes a scenario as input from the user and performs the steps mentioned using the components
depicted outside the dotted box. A scenario is a set of constraints, objectives and contextual infor-
mation regarding the software. After specification of a scenario the input passes through the input
selection phase, where datasets similar to the given scenario are identified with the help of a Uni-
fied Metrics Database (UMD). Before developing a model the user inputs pass through the model
selection phase, which uses Analytic Hierarchy Processing (AHP) to select a model based on cer-
tain performance measures received from a Rule Based System (RBS). AHP studies the models’
performances in terms of the performance measures and performs the selection based on user pref-
erences. In model development phase, the user input is preprocessed using Information Gain (IG)
179
Input SelectionModel Performance parameters, Selected softwaremeasures infoModel SelectionConstra int measures infoSelected software measures, Selected model name Datasets (R)Unif ied MeasuresModel DevelopmentOutput ValidationObject iveAdd it iona lContextua l Unif ied MeasuresDatabase (UMD)Mode lsDeveloped modelReportingInfo Test resultsA Scenario Repos itoryFig. A.1: The Generic Approach
or Principle Component Analysis (PCA) to drop the irrelevant attributes or select the most signifi-
cant dimensions. Later on, the developed model is validated using 10-fold cross validation before
Reporting of the results in terms of ROC points (Egan, 1975).
The rest of the section describes the architecture of PraTo in two steps. First the components
outside the dotted box are discussed followed by a discussion on the phases mentioned inside the
dotted box of Figure A.1. In the end working of PraTo is described.
A.2.1 A Scenario
A scenario is specified in terms of business and data constraints like budget, time, resources, less
training data; objective like find minimum number of modules that need to be tested but do not
ship too many defects; and additional contextual information like software metrics of conventional
paradigm are available and the available data is design and implementation phase data. PraTo
requires the users to specify the severity level of each constraint in interval [0-1]. Objective is
expressed in terms of sensitivity towards shipping defective modules and importance of finding
high number of modules that need to be tested. Additional contextual information can be provided
either by adding a certain dataset or by providing information regarding software metrics in use.
Objectives and constraints are sent to an RBS to find out what values of model performance mea-
180
Tab. A.1: Datasets List
Serial Dataset Parameters Instances Variant ND Modules (%) SVP
ds1 kc1 21 2109 No 84.54% 2
ds2 kc1-classlevel 94 145 No 58.62% 8
ds3 kc1-classlevel-oo 8 145 Yes 58.62% 0
ds4 kc1-originally-classlevel 10 145 Yes 58.62% 1
ds5 kc2 21 522 No 79.5% 0
ds6 kc3 39 458 No 90.61% 0
ds7 pc1 21 1109 No 93.05% 0
ds8 jm1 21 1109 No 93.05% 0
ds9 cm1 21 1109 No 93.05% 0
ds10 kc1-classlevel-num 94 145 No 58.62% 8
ds11 kc1-classlevel-oo-num 8 145 Yes 58.62% 0
ds12 kc1-originally-classlevel-num 10 145 Yes 58.62% 1
SVP = Same Valued Parameters
sures should be considered. Additional information is used to identify datasets similar to other
datasets of NASA projects (Menzies et al., 2012).
A.2.2 Dataset Repository
Datasets used in this study describe the defectiveness information regarding various software mod-
ules have been taken from Promise repository (Menzies et al., 2012). Number of parameters and
instances in each of the dataset has been mentioned in Table A.1. The parameters are the indepen-
dent variables and an additional parameter is used as output to indicate either the class of software
181
(i.e. D, ND) or Number of Defects. Each instance in a dataset represents a software class (or
module) and the parameters are software metrics calculated for that module. Four of the datasets
listed in Table A.1 are variants whereas the rest have been used without any modifications. A
dataset is termed as a variant if the number of parameters of the dataset have been modified. Here,
ds3, ds4 are variants of ds2 and ds11, ds12 are variants of ds10. Furthermore, percentage of nega-
tive instances in each dataset is also mentioned in the table to show any imbalance in a dataset. The
last column mentions the number of parameters that hold same value for all the instances. Such
parameters are not useful in prediction, rather they can potentially have a negative effect on the
prediction.
The datasets ds2 and ds10 are representative of same software modules. The only difference
is that ds2 has binary classification as output parameter and ds10 gives numeric classification.
Two variants of ds2 (kc1-classlevel data) have been formed by dividing the parameters into three
groups. Group A has 8 parameters, Group B has 10 parameters and Group C has 84 parameters,
Group A ⊂ Group B and Group B ∩ Group C = ∅. Parameters in Group A are the most
commonly used parameters for defect prediction identified in (Menzies et al., 2003). The ds3
(kc1-classlevel-oo) comprises of Group A parameters only and is marked as variant in Table A.1.
Values of the parameters in Group B are originally metricd at module level and these parameters
are used to form ds4 (kc1-originally-classlevel). This dataset is also marked as variant in the table.
The values of the parameters in Group C are originally measured at method level and are later
transformed to module level before making the dataset available at PROMSIE website (Menzies
et al., 2012). Parameters used in ds2 are a union of Group B and Group C parameters. The
datasets ds11 and ds12 have been modified using the same procedure to make them variants of
ds10.
A.2.3 Unified Metrics Database
Rana et al have suggested a Unified Metrics Database (UMD) to remove inconsistencies found in
naming of software product metrics and we use the UMD in Input Selection activity (Rana et al.,
2011). The UMD plays a vital role in the selection of datasets with common metrics. Developing
182
Tab. A.2: List of Models in Repository
Serial Model Name
1 Least Square Regression
2 Robust Regression
3 Neural Networks
4 Radial Basis Networks
5 Fuzzy Inference System
6 Classification and Regression Trees
7 Logistic Regression
a consistent UMD requires a dedicated attention. Once the UMD has been developed, it will be
helpful for software managers who need to identify and decide which datasets are similar to their
problem domain or which metrics to use for their projects. The UMD can further be helpful for
future studies on software product metrics and development of prediction models based on these
metrics. It is worthwhile to mention here that the proposed UMD includes base as well as certain
derived metrics.
A.2.4 Models Repository
With a large number of prediction techniques reported in literature, it is required to make them
available for quality managers. Because every technique has different aspects and capabilities, each
of them is has its utility in varying situations. Gray et al have compared a number of techniques
in terms of their modeling capabilities (Gray and MacDonell, 1997). The Models Repository in
PraTo contains the models compared by Gray et al. This list of models is presented in Table A.2.
183
Input Selection
Similar Dataset Selection
Datasets
Repository ( R )
Problem Domain
based Similarity
Common
Measures based
Similarity
UMD
Data Values
based Similarity
Rd
Rc
Measures’
Labels
Rv
D
Fig. A.2: Input Selection
Fig. A.3: Model Selection
184
Fig. A.4: Main Screen of PraTo
A.2.5 Input Selection
Input Selection activity is a three steps activity that outputs datasets similar to the dataset provided
by the user. Three steps of this activity are shown in Figure A.2. In the first step, datasets Rd
having problem domain level similarity with user’s dataset D are selected using meta information
regarding each dataset. In the second step, the datasets Rc, that use common metrics are identified
with the help of UMD. In the third step, these datasets are compared with D and the datasets Rv
having data value level similarity are selected. After each of these steps, the number of datasets
keeps on decreasing such that:
Rv ⊆ Rc ⊆ Rd ⊆ R (A.1)
where R is the set of all the datasets in the Repository, Rd contains the datasets which have domain
based similarity with D, Rc has the datasets which have common metrics with D and Rv has those
datasets which have data values level similarity with D.
185
A.2.6 Context Specification and Model Selection
With more than one parameters involved in model selection of a model poses a problem to the
quality managers and results in subjective selection of a prediction model. In order to reduce this
subjectivity the AHP based model selection is performed. The activity receives the performance
measures needed for selection decision and user preferences of one parameter over the other as
shown in Figure A.3.
Dataset Similarity
Compare a dataset with a public dataset. The closest public dataset is the user scenario. The model
with the best performance, in terms of recall, on the public data is selected.
Providing Set of Constraints
Provide a set of constraints in terms of HR, Budget and Time. Based on the severity of the con-
straints certain values of the performance measures determined. The model with the best values of
the performance measures is selected.
AHP Based Ranking
Select three model performance measures. Provide relative importance of the measures. An AHP
based technique is applied to select a model.
A.2.7 Model Development
These models sometimes use all software metrics for model development but sometimes perform
preprocessing to resolve data compatibility and correlatedness issues and dimensionality reduction.
For example, Principle Component Analysis (PCA) and Factor Analysis (FA) and Information
Gain (IG) are some techniques that can be used to reduce size of input space. PCA transforms
the larger input space to a smaller input space such that all the variables of the larger input space
are represented in the reduced input space. Using influence of all variables is not always a good
186
practice. Experiments have shown that it sometimes deteriorates performance of prediction models
(Rana et al., 2009b). Furthermore, there are scenarios where the data is not suitable for direct PCA.
In such cases some additional preprocessing is needed to make data suitable for PCA. So for these
scenarios we suggest the use of IG to drop irrelevant attributes. The tool presented in this Chapter
handles both types of preprocessing.
A.2.8 Output Validation
After development of a model based on training data, the model is validated using 10-fold valida-
tion. Complete data is split into 10 groups. The first group consists of 10% instances, the second
set consists of 20% instances and so on. The last set consists of the 100% instances of the data.
After these 10 iterations, average results of the iterations are reported.
A.3 Salient Features of PraTo
The user can specify a scenario and select model performance measures using the Rule Based
System shown in Figure A.3. The scenario specification module is meant to guide the user which
performance measures should be used for AHP based model selection. Currently this module
works for three performance measures only i.e. False Positive (FP), True Positive (TP) and Ac-
curacy (Acc). A scenario is specified using the severity level of three parameters: HR, Budget
and Time. Based on severity level of each of the parameters, a rule based system outputs what
value of the above three performance measure should be desired. Once the scenario specification
is completely developed, the manual model selection part may be bypassed. Based on the specified
scenario the system will identify the performance measures and perform the AHP based selection.
User can add a new dataset in the repository using the Add Dataset button. At first step, the
user is asked to provide domain information regarding the new dataset (like how many requirement
phase measures are used in the dataset, how many of the measures are product measurens etc.).
This information is used to identify the datasets similar to the new one. Project types will also be
listed to improve the domain level similarity. To find similarity on the basis of use of measurenns,
further information regarding the measurenns used in the new datasets is also collected but is not
187
Constraints:í Business Domain: Severity (0 � 1): Budget=0 .4, HR=0 .9, T ime=0.7í Data: less training data, skewed data
Objectives:( f ind the min modules that need to be tested but do not ship toomany defects Additional Contextual info:í Design and implementation phase data availableí Conventional paradigm measuresí …more info will be collected after performance measuresmapping
Fig. A.5: A Scenario
fully functional at the moment. These measurens can be selected using the list of metrics (coming
from a unified metrics database) shown in lower left part of Add Dataset screen.
User can select a certain dataset or a certain model that needs to be run. There is a checkbox
’Run all models’ that allows the user to run all the models on the selected dataset. There is another
checkbox ’Run on all datasets’ that allows the user to run a selected model on all the datsets and
see its performance. The screen also provides the options to preprocess data before running a
prediction model. Two types of preprocessing is available so far: Principal Component Analysis
(PCA): Using this preprocessing the prediction model will be developed on the basis of the reduced
dimensions only. Information Gain (IG): There is a text box adjacent to IG checkbox to specify
a threshold. Attributes of the dataset that have IG value lower than this threshold will be dropped
before developing the model. During the IG based preprocessing, binning of the attributes needs
to be performed. Therefore three types of binning options are provided. User can select from Equi
Distance (EQ Dist) bins, Equal Frequency (EQ Freq) bins and a combination of both (Hybrid). ”
Before model development the user has the option to specify the data split. By default the data
split is 2/3 training data and 1/3 test data. ” In order to select random training examples from
each dataset before model development, a checkbox ’Randomize’ should be checked. If it is left
unchecked then first 2/3 examples are taken as training data and the rest as the test data. This check
box needs to be checked every time a new dataset is selected. Plot Input and Residual Errors: User
has the options to have various plots of the input parameters including Capa and PMF plots. User
188
Fig. A.6: Specifying a Scenario
can plot all the input parameters separately or in one figure. User also has the option to plot residual
errors through the ’Plot residual Errors’ checkbox. ” ’Software Error Tolerance’ text box is used
when the underlying dataset predicts number of defects instead of defectiveness of each module.
In such a case only those modules will be considered defect prone which have errors more than
the Software Error Tolerance threshold. Default value of this threshold is kept 5. ” Beta value is
used to calculate F Measure which is a measure to assess models’ performance in case of skewed
datasets. Beta = 1 means that equal importance is given to precision and recall while calculating
the F value. ” k = 10 is used for 10-fold cross validation. User can change this value of k.
The lower pane of the main screen allows the user to perform Analytic Hierarchical Process
(AHP) based model selection using any three performance parameters listed in the rightmost list
item ’Performance Measures’. User has to provide index of the desired performance measures and
the relative importance of each of the performance parameters.
A.4 Application of PraTo
PraT allows the users to specify multiple scenarios and get the prediction information regarding
their software. This section describes one scenario as an example and demonstrates the use of PraT
189
for that scenario.
A.4.1 Scenario Specification:
Consider a scenario where the user has very sever human resource problem but at the same time
does not want to ship too many defects to the client. Such a scenario has been expressed in Figure
A.5. The PraTo user interface (UI) to express this scenario is shown in Figure A.6. Usually HR,
schedule and budget issues are common problems faced by majority of the managers. The top
three lists requires the user to specify the severity level of these three issues. If any of these issues
is not present, the user can assign them zero weight in the respective list. The lower list requires the
user to mention the tolerance to ship defects. If shipment of a few defects can be tolerated because
of tight schedule then higher value of the tolerance is selected. The radio buttons allow the user
to select an objective. Once the user specifies the scenario, PraTo uses an RBS to identify which
labels and performance measures should be used to perform model selection. In this case the three
parameters are TP , FP and Accuracy and their suggested values are Low, Low and Medium
respectively.
A.4.2 Input Selection and Preprocessing:
The user might also like to find out which of the existing datasets have similar characteristics that
my software has. This information can help in pruning the prediction models that have potential
to work well on the user’s dataset. In order to let the user gather this information, PraTo provides
an Add Dataset feature. Add Dataset screen requires the user to provide metadata information
regarding his software and select the software metrics being collected as shown in Figure A.7. The
unified labels of software metrics are displayed in the bottom left list and the user can select one
metric at a time to depict its usage in user’s data. The figure shows that for the given metadata, ds
12 is the nearest dataset.
190
Algorithm 2 IG-based algorithm to select relevant metrics for defect prediction models
Require: AllParameters, Target, α
Ensure: RelevantParameters is returned s.t. RelevantParameters ⊆ AllParameters
{Steps:}Entropy ← calculateEntropy(Target)
RelevantParameters← AllParameters
paramCount← countParams(AllParameters)
for i = 1 to paramCount do
IG[i]← calculateInformationGain(AllParameters[i], Entropy)
if IG[i] < α then
Drop this parameter from RelevantParameters
end if
end for
return RelevantParameters
Removing Irrelevant Attributes
We suggest an algorithm to drop the irrelevant attributes on the basis of IG and use the relevant
attributes to detect a defect-prone (D) or not defect-prone (ND) module. We study the impact of
using this approach on two defect prediction models i.e. Classification Trees (CT) and a Fuzzy
Inferencing System (FIS) based model (Rana et al., 2009a). It is interesting to note that perfor-
mance of both the models has improved after the IG based dimension reduction. We compare the
results of using PCA and IG for both the models and notice that using IG as a replacement to PCA
produces better prediction results in terms of misclassification rate and recall.
The proposed approach, presented in algorithm 2 (Rana et al., 2014), drops irrelevant attributes
from a dataset. Irrelevant attributes are those which do not individually contribute significantly in
prediction.
As seen from algorithm 2, three input values are required for this approach: two non-scalar
values are underlined. First of the two non-scalar inputs, AllParameters, is a n×m matrix with
n software modules and m software metrics used to detect D and ND modules. Second input,
191
Target, is a n × 1 matrix representing the defectiveness of n software modules. The scalar α s.t.
0 ≤ α ≤ 1, represents a threshold to decide whether a certain attribute needs to be dropped or not.
The output, RelevantParameters, is a n × k matrix with k ≤ m and RelevantParameters ⊆AllParameters.
Calculating Entropy and IG: Entropy of a set S is as follows (Mitchell, 1997):
Entropy(S) =k
∑
i=1
− pi log2 pi (A.2)
where k is total number of classes and pi is proportion of examples that belong to class i.
Information gain of an attribute A in a set S is defined as follows (Mitchell, 1997):
IG(S,A) = Entropy(S) −v
∑
i=1
|Si||S|Entropy(Si) (A.3)
where v is total number of distinct values in attribute A (i.e. domain of A) and Si is set of examples
that hold a certain value from domain of A.
Inorder to gauge the performance of each approach, percentage change (increase/decrease) in
Recall, Acc and MC have been calculated using the following expression:
%age change =(DR Reading −NoDR Reading)
NoDR Reading× 100 (A.4)
where NoDR Reading is value of evaluation parameter when model is developed without any di-
mension reduction and DR Reading is value of evaluation parameter when models are developed
after dimension reduction.
Experiments were conducted with a split of training and test data as 67% to 33% respectively
for each dataset. A number of experiments were conducted on random data and best results of each
model in terms of misclassification rate are compared and reported here.
We have used seven datasets available at Promise repository (Menzies et al., 2012) and have
kept α = 0.01. For most of the datasets performance shortfall incurred in case of IG has been
smaller as compared to performance shortfall observed with PCA. Average values of percentage
changes in the three evaluation parameters indicate that on average IG has been better choice for
these datasets.
192
We have compared the two dimension reduction approaches dataset wise and presented the
results in Table A.3. IG has emerged as better approach for majority of these datasets. For ds1, use
of IG has improved performance of both the models and therefore IG is listed as overall winner.
This is to be mentioned here that the largest dataset ds1 is dominated by 84.5% ND modules.
Therefore none of the models could perform very well in identifying the D modules. In case of ds2
(145 instances, 94 attributes), there are many parameters that are not good predictors of D modules
for this ds2 (Rana et al., 2009b). But PCA still keeps their contribution in the smaller input space
hence deteriorating models’ performances. IG has dropped the irrelevant attributes and has resulted
in better Recall for both the models and is reported as a winner. IG has outperformed PCA in case
of ds3 and ds4 as well. Both the datasets are variants of ds2 and are based on selected commonly
used software metrics. Use of both the dimension reduction techniques degrades the models’
performances most of the times. But IG is reported as a winner because of low performance
shortfall. In case of ds5, the third largest dataset with 21 attributes the IG based approach failed
to drop any irrelevant attributes. On the other hand, PCA has handled the correlation among the
Tab. A.3: Winners in Terms of Recall and Acc
Dataset Recall Accuracy Overall
CT FIS CT FIS
ds1 IG IG IG IG IG
ds2 IG IG IG PCA IG
ds3 IG IG PCA IG IG
ds4 IG IG IG IG IG
ds5 PCA No PCA No PCA
ds6 No PCA IG IG IG
ds7 IG PCA IG IG IG
193
attributes and is reported as a winner. ds6 and ds7 are dominated by 90% and 93% ND modules
respectively. Performances of PCA and IG are very close for both these datasets.
Defect prone modules in a software are detected using defect prediction models developed
using software defect data and sometimes need to reduce dimensions of the input data. Usually
principal component analysis (PCA) is used by the defect prediction models for the purpose. PCA
reduces dimensions of the input data, and at the same time, it keeps the representation of all the
input attributes intact. In some cases the representation of all the input attributes can negatively
affect the prediction. Therefore, this Chapter suggests an Information Gain (IG) based approach
to drop irrelevant attributes. This approach calculates information gain (IG) of each of the input
attributes and drops the variables which have IG values below a threshold value α. The proposed
technique has been used to develop classification tree (CT) and fuzzy inferencing system (FIS)
based models for 7 datasets. The proposed approach resulted in lesser performance shortfall as
compared to PCA in case of smaller datasets with a large number of attributes. IG based approach
also showed better results in case of large imbalanced datasets.
Results presented here cannot be generalized. We plan to verify these results by using more
datasets and conducting more experiments with different values of α and fraction of variance. We
also plan to study characteristics of the parameters dropped by the suggested approach to find the
exact reason for better performance of the IG based approach.
Experiments were conducted with a split of training and test data as 67% to 33% respectively
for each dataset. A number of experiments were conducted on random data and best results of
each model in terms of misclassification rate are compared and reported here. For most of the
datasets used, performance shortfall incurred in case of IG has been smaller as compared to perfor-
mance shortfall observed with PCA. Average values of percentage changes in the three evaluation
parameters indicate that on average IG has been better choice for these datasets.
A.5 Analysis and Discussion
Currently the proposed model is handling structural and object oriented development paradigms
only, but it is extendible to other paradigms. It can also be extended for as many quality factors
194
Fig. A.7: Adding New Dataset
Fig. A.8: AHP based Model Selection
195
as there have been models available for. The model is usable at any stage of software life cycle
provided the models repository contains models applicable in that phase. Our model caters for
the specificity of a component predictor by taking three control inputs (Quality Factor, Software
Development Life Cycle Phase, and Software Development Paradigm), which contribute towards
selection of a predictor. We are using the product-based existing models unlike Bouktif et al.
(Bouktif et al., 2004) who are using classification based models.
A.6 Summary
Defect prone modules in a software are detected using defect prediction models developed using
software defect data and sometimes need to reduce dimensions of the input data. Usually principal
component analysis (PCA) is used by the defect prediction models for the purpose. PCA reduces
dimensions of the input data, and at the same time, it keeps the representation of all the input
attributes intact. In some cases the representation of all the input attributes can negatively affect
the prediction. Therefore, this Chapter suggests an Information Gain (IG) based approach to drop
irrelevant attributes. This approach calculates information gain (IG) of each of the input attributes
and drops the variables which have IG values below a threshold value α. The proposed technique
has been used to develop classification tree (CT) and fuzzy inferencing system (FIS) based models
for 7 datasets. The proposed approach resulted in lesser performance shortfall as compared to PCA
in case of smaller datasets with a large number of attributes. IG based approach also showed better
results in case of large imbalanced datasets.
Results presented here cannot be generalized. We plan to verify these results by using more
datasets and conducting more experiments with different values of α and fraction of variance. We
also plan to study characteristics of the parameters dropped by the suggested approach to find the
exact reason for better performance of the IG based approach.
In this chapter we present an approach which takes benefit from existing techniques and com-
bines them together to predict different quality factors.
196
Appendix B
LIST OF UNIFIED AND CATEGORIZED SOFTWARE PRODUCT METRICS
Tab. B.1: Frequently Used Software Measures, Their Use and Applicability
Short Definition
Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
The halstead er-
ror estimate
B B (Khosgoftaar
and Mun-
son, 1990),
(Ottenstein,
1979),
(Ottenstein,
1981)
OO, Con-
ventional
Imp+
HALSTEAD ERROR
EST, B
(Jiang
et al.,
2008c)
Continued on next page
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Bandwidth of the
program
BW BW (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000),
(Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
Conventional Design+
(average nesting
level of
B (Jensen and
Vairavan,
1985)
the control flow
graph of
Band (Jensen and
Vairavan,
1985)
the software) NDAV (Briand
et al., 1993)
Bandwidth (Dick and
Kandel,
2003)
Continued on next page
198
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Count of
branches in the
program (Sum of
decisions
C C (Khosgoftaar
and Mun-
son, 1990),
(Ottenstein,
1979)
Conventional Design+
and subroutine
calls)
BRANCH COUNT (Jiang
et al.,
2008c)
Continued on next page
199
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
No. of other
classes which this
class is coupled
to
CBO CBO (Gyimothy
et al.,
2005),
(Olague
et al.,
2007),
(Quah and
Thwin,
2003),
(Rana
et al.,
2008),
(Thwin
and Quah,
2002),
(Zhou and
Leung,
2006)
OO Design+
IOC (Pizzi et al.,
2002)
COUPLING
BETWEEN OBJECTS
(Koru and
Liu, 2005a)
ClassCoupling (Nagappan
et al., 2006)
Continued on next page
200
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
CK-CBO (Nagappan
et al., 2006)
Total code lines
(all lines of
code excluding
comments)
CL CL (Gokhale
and Lyu,
1997),
(Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
OO, Con-
ventional
Imp+
Total LOC (Nagappan
and Ball,
2005b)
NCNB (Brun and
Ernst,
2004)
SLOC (Zhou and
Leung,
2006)
Number of
classes/
Com Com (Ohlsson
and Alberg,
1996)
OO Design+
Continued on next page
201
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
components in a Components (Ohlsson
et al., 1998)
module Classes (Li et al.,
2006),
(Nagappan
et al., 2006)
Halstead’s diffi-
culty
D D (Jiang
et al.,
2008c),
(Li et al.,
2006),(Shen
et al., 1985)
OO, Con-
ventional
Impl+
Number of code
characters
DChar DChar (Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
OO, Con-
ventional
Imp+
Co (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000)
Continued on next page
202
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Code characters (Dick and
Kandel,
2003)
No. of branching
points
Dec Dec (Khoshgoftaar
and Allen,
1999b),
(Ohlsson
and Alberg,
1996)
OO, Con-
ventional
Design+
DECISION
COUNT
(Jiang
et al.,
2008c)
The design com-
plexity of a mod-
ule
DES CPX DES CPX (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
OO, Con-
ventional
Design+
DESIGN COM-
PLEXITY, iv(G)
(Jiang
et al.,
2008c)
Continued on next page
203
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Depth inheritance
tree is length of
the longest path
from the class to
the root
DIT DIT (Gyimothy
et al.,
2005),
(Quah and
Thwin,
2003),
(Rana
et al.,
2008),
(Thwin
and Quah,
2002),
(Yang
et al.,
2007),
(Zhou and
Leung,
2006)
OO Design+
Dep (Ohlsson
and Alberg,
1996)
Depth (Ohlsson
et al., 1998)
Continued on next page
204
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
DI (Pizzi et al.,
2002)
DEPTH (Koru and
Liu, 2005a)
InheritanceDepth (Nagappan
et al., 2006)
Depth of inheritance
tree
(Li et al.,
2006)
CK-DIT (Olague
et al., 2007)
Continued on next page
205
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Halstead’s Soft-
ware Science
Effort
E E (Jensen and
Vairavan,
1985),
(Jiang
et al.,
2008c),
(Khosgof-
taar et al.,
1994),
(Khosgof-
taar and
Munson,
1990),
(Ottenstein,
1979),
(Schneider,
1981)
OO, Con-
ventional
Imp+
HALSTEAD EFFORT(Koru and
Liu, 2005a)
Halstead’s Effort (Li et al.,
2006)
Continued on next page
206
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
The essential
complexity of a
module
ESS CPX ESS CPX (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
OO, Con-
ventional
Design+
ESSENTIAL
COMPLEXITY,
ev(G)
(Jiang
et al.,
2008c)
Essential complexity (Li et al.,
2006)
Total executable
statements
EX EX (Ottenstein,
1979)
OO, Imp+
(all lines of code
excluding
EXE (Khosgoftaar
and Mun-
son, 1990)
Conventional
comments, dec-
larations and
blanks)
ELOC (Khosgoftaar
et al.,
1994),
(Khosh-
goftaar
and Allen,
1999c)
Continued on next page
207
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
STMEXE (Khoshgoftaar
and Allen,
1999b),
(Khosh-
goftaar
and Seliya,
2003)
CL (Guo and
Lyu, 2000)
Executable lines (Dick and
Kandel,
2003)
Size1 (Quah and
Thwin,
2003)
Lines (Nagappan
et al., 2006)
LOC EXECUTABLE (Jiang
et al.,
2008c)
No. of objects in
the
FANin FANin (Ohlsson
and Alberg,
1996)
OO, Design+
Continued on next page
208
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
incoming func-
tion calls
FAN-in (Ohlsson
et al., 1998)
Conventional
InFlow (Li et al.,
2006)
No. of objects in
the calls
FANout FANout (Ohlsson
and Alberg,
1996)
OO, Design+
made by a certain
function
FAN-out (Ohlsson
et al., 1998)
Conventional
OutFlow (Li et al.,
2006)
Number of deci-
sions
IFTH IFTH (Khoshgoftaar
and Allen,
1999b),
(Khoshgof-
taar et al.,
1996)
Conventional Design+
Con (Ohlsson
and Alberg,
1996)
Conditions (Ohlsson
et al., 1998)
DEC (Pizzi et al.,
2002)
Continued on next page
209
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
D (Khosgoftaar
and Mun-
son, 1990)
CONDITION COUNT(Jiang
et al.,
2008c)
Continued on next page
210
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Lack of cohesion
in methods
LCOM LCOM (Bouktif
et al.,
2004),
(Bouk-
tif et al.,
2006),
(Gyimothy
et al.,
2005),
(Olague
et al.,
2007),
(Pizzi et al.,
2002),
(Quah and
Thwin,
2003),
(Rana
et al.,
2008),
(Zhou and
Leung,
2006)
OO Design+
Continued on next page
211
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
LACK OF COHESION
OF METHODS
(Koru and
Liu, 2005a)
Continued on next page
212
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Total lines of
code (including
comments)
LOC LOC (Brun and
Ernst,
2004),
(Dick and
Kandel,
2003),
(Gyimothy
et al.,
2005),
(Khosgof-
taar et al.,
1994),
(Khosgof-
taar and
Munson,
1990),
(Khosh-
goftaar
and Allen,
1999c),
(Khosh-
goftaar
and Allen,
1999b),
(Khosh-
goftaar
and Seliya,
2003),
(Munson
and Khos-
goftaar,
OO, Con-
ventional
Imp+
213
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
SLOC (Briand
et al., 1993)
TC (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000)
Lines of source code (Dick and
Kandel,
2003)
Lines of code (Li et al.,
2006)
LOC TOTAL (Jiang
et al.,
2008c)
Number of loop
constructs
Loo Loo (Ohlsson
and Alberg,
1996)
OO, Design+
Loops (Ohlsson
et al., 1998)
Conventional
Lop (Khoshgoftaar
and Allen,
1999b)
Continued on next page
214
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
NL (Khoshgoftaar
and Seliya,
2002)
No. of arcs con-
taining predicate
of loop construct
LP LP (Khoshgoftaar
and Allen,
1999b),
(Khoshgof-
taar et al.,
1996)
Conventional Design+
N STRUCT (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
No. of paths in
the subroutine
MacPat MacPat (Khoshgoftaar
and Allen,
1999b),
(Ohlsson
and Alberg,
1996)
OO, Con-
ventional
Design+
N PATHS (Khoshgoftaar
et al.,
1997b)
Continued on next page
215
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
PATH (Ohlsson
et al., 1998)
Number of com-
ment characters
MChar MChar (Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
OO, Con-
ventional
Imp+
CC (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000)
Comment characters (Dick and
Kandel,
2003)
Continued on next page
216
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
No. of operators
and operands
N N (Dick and
Kandel,
2003),
(Gokhale
and Lyu,
1997),
(Jensen and
Vairavan,
1985),
(Munson
and Khos-
goftaar,
1992),
(Ottenstein,
1979),
(Ottenstein,
1981),
(Schneider,
1981),
(Xing et al.,
2005)
OO, Con-
ventional
Imp+
N’ (Guo and
Lyu, 2000)
Continued on next page
217
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Halstead Length (Li et al.,
2006)
HALSTEAD LENGTH,
N
(Jiang
et al.,
2008c)
Halstead’s esti-
mated program
length
N N (Khosgoftaar
et al.,
1994),
(Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
OO, Con-
ventional
Imp+
NH (Jensen and
Vairavan,
1985)
Ne (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000)
Continued on next page
218
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Nh (Dick and
Kandel,
2003)
Jensen’s program
length estimate
NF NF (Dick and
Kandel,
2003),
(Jensen and
Vairavan,
1985),
(Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
OO, Con-
ventional
Imp+
JE (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000)
Nf (Dick and
Kandel,
2003)
Continued on next page
219
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Total operator
count
N1 N1 (Khosgoftaar
et al.,
1994),
(Khosh-
goftaar
and Allen,
1999c)
OO, Con-
ventional
Imp+
TOT OPTR (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
N1 (Dick and
Kandel,
2003)
Total operators (Li et al.,
2006)
NUM OPERATORS,
N1
(Jiang
et al.,
2008c)
Continued on next page
220
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Total operand
count
N2 N2 (Khosgoftaar
et al.,
1994),
(Khosh-
goftaar
and Allen,
1999c)
OO, Con-
ventional
Imp+
N2 (Dick and
Kandel,
2003)
Total operands (Li et al.,
2006)
NUM OPERANDS,
N2
(Jiang
et al.,
2008c)
Maximum nest-
ing level
NC NC (Khosgoftaar
and Mun-
son, 1990)
Conventional Imp+
NDMAX (Briand
et al., 1993)
Continued on next page
221
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
CTRNSTMX (Khoshgoftaar
and Allen,
1999b),
(Khosh-
goftaar
and Seliya,
2003)
MAX LVLS (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
Max Nesting (Li et al.,
2006)
Number of edges
found in a given
module control
from
N EDGES N EDGES (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
Conventional Design+
Continued on next page
222
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
one module to an-
other
Arcs (Nagappan
et al.,
2006),
(Ohlsson
et al., 1998)
EDGE COUNT, e (Jiang
et al.,
2008c)
No. of func-
tions/procedures
in a module
N IN N IN (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
OO, Con-
ventional
Design+
(No. of entry
nodes)
NDSENT (Khoshgoftaar
and Allen,
1999b)
Meth (Pizzi et al.,
2002)
NOM (Bouktif
et al., 2004)
Continued on next page
223
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Function (Li et al.,
2006),
(Nagappan
et al., 2006)
Continued on next page
224
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
No. of children is
the no. of imme-
diate descendants
NOC NOC (Bouktif
et al.,
2004),
(Bouk-
tif et al.,
2006),
(Gyimothy
et al.,
2005),
(Olague
et al.,
2007),
(Quah and
Thwin,
2003),
(Rana
et al.,
2008),
(Thwin
and Quah,
2002),
(Zhou and
Leung,
2006)
OO Design+
Continued on next page
225
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Kid (Pizzi et al.,
2002)
NUM OF CHILDREN(Koru and
Liu, 2005a)
SubClasses (Nagappan
et al., 2006)
CK-NOC (Olague
et al., 2007)
No. of public at-
tributes
NPA NPA (Bouktif
et al.,
2004),
(Bouktif
et al., 2006)
OO, Con-
ventional
Imp+
Global Set (Li et al.,
2006)
No. of possible
paths from
Pat Pat (Ohlsson
and Alberg,
1996)
OO, Design+
input signal to the
output signal
N PATHS (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
Conventional
Continued on next page
226
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Number of proce-
dures /
PRC PRC (Brun and
Ernst,
2004),
(Khosgof-
taar and
Munson,
1990)
OO, Design+
methods PROC (Khosgoftaar
et al., 1994)
Conventional
Mac (Ohlsson
and Alberg,
1996)
NOM (Bouktif
et al.,
2006),
(Quah and
Thwin,
2003)
QMOOD NOM (Olague
et al., 2007)
Continued on next page
227
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
No. of methods
that can be po-
tentially executed
in response to a
message received
RFC RFC (Gyimothy
et al.,
2005),
(Quah and
Thwin,
2003),
(Rana
et al.,
2008),
(Thwin
and Quah,
2002),
(Zhou and
Leung,
2006)
OO Design+
by an object RFO (Pizzi et al.,
2002)
RESPONSE FOR
CLASS
(Koru and
Liu, 2005a)
CK-RFC (Olague
et al., 2007)
Continued on next page
228
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Number of pa-
rameters in the
routine/module
RSDPN RSDPN (Nagappan
et al.,
2006),
(Wang
et al., 2004)
OO, Con-
ventional
Imp+
PARAMETER COUNT(Jiang
et al.,
2008c)
Total number of
statements
S S (Ottenstein,
1979),
(Schneider,
1981)
Conventional Imp+
Number of State-
ments
(Ottenstein,
1981)
PS (Khosgoftaar
and Mun-
son, 1990)
N STMTS (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
Continued on next page
229
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Number of calls
of a subroutine
(Number of calls
to
TC TC (Khoshgoftaar
and Allen,
1999b),
(Khoshgof-
taar et al.,
1996)
Conventional Design+
other functions in
a module)
Coh (Ohlsson
and Alberg,
1996)
Cohesion (Ohlsson
et al., 1998)
TCT (Khoshgoftaar
and Seliya,
2002)
FanOut (Li et al.,
2006),
(Nagappan
et al., 2006)
CALL PAIRS (Jiang
et al.,
2008c)
Continued on next page
230
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Total character
count
TChar TChar (Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
OO, Con-
ventional
Imp+
Cr (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000)
Total characters (Dick and
Kandel,
2003)
Total comments TComm TComm (Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
OO, Con-
ventional
Imp+
COM (Khosgoftaar
et al., 1994)
Continued on next page
231
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Cm (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000)
N COM (Khoshgoftaar
and Allen,
1999c),
(Khoshgof-
taar et al.,
1997b)
CMT (Dick and
Kandel,
2003)
LOC COMMENTS (Jiang
et al.,
2008c)
Unique calls to
other modules
UC UC (Khoshgoftaar
and Allen,
1999b),
(Khoshgof-
taar et al.,
1996)
OO, Con-
ventional
Design+
Continued on next page
232
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
UCT (Khoshgoftaar
and Seliya,
2002)
Program volume V V (Briand
et al.,
1993),
(Jensen and
Vairavan,
1985),
(Khosgof-
taar et al.,
1994),
(Khosgof-
taar and
Munson,
1990),
(Ottenstein,
1979),
(Ottenstein,
1981)
OO, Con-
ventional
Design+
Halstead’s Volume (Li et al.,
2006)
Continued on next page
233
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
HALSTEAD VOLUME,
V
(Jiang
et al.,
2008c)
McCabe’s Cyclo-
matic complexity
V(G) V(G) (Khosgoftaar
and Mun-
son, 1990),
(Khosh-
goftaar
and Allen,
1999c),
(Munson
and Khos-
goftaar,
1992),
(Xing et al.,
2005)
OO, Con-
ventional
Design+
MC (Jensen and
Vairavan,
1985)
Continued on next page
234
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
VG (Briand
et al.,
1993),
(Khoshgof-
taar et al.,
1996),
(Khosh-
goftaar
and Seliya,
2002)
VG1 (Dick and
Kandel,
2003),
(Khosgof-
taar et al.,
1994)
McC1 (Ohlsson
and Alberg,
1996)
M (Gokhale
and Lyu,
1997),
(Guo and
Lyu, 2000)
Continued on next page
235
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
CMT (Dick and
Kandel,
2003)
Strict Cyclomatic
Complexity
(Li et al.,
2006)
Complexity (Nagappan
et al., 2006)
CYCLOMATIC
COMPLEXITY,
v(G)
(Jiang
et al.,
2008c)
Continued on next page
236
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Weighted meth-
ods per class is
sum of complexi-
ties of methods in
a class
WMC WMC (Bouktif
et al.,
2006),
(Gyimothy
et al.,
2005),
(Quah and
Thwin,
2003),
(Rana
et al.,
2008),
(Thwin
and Quah,
2002),
(Zhou and
Leung,
2006)
OO Design+
WEIGHTED METHODS
PER CLASS
(Koru and
Liu, 2005a)
CK-WMC (Olague
et al., 2007)
Halstead vocabu-
lary,
η η (Ottenstein,
1979)
OO, Imp+
Continued on next page
237
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
η = η1 + η2 Halstead’s Vocabu-
lary
(Li et al.,
2006)
Conventional
HALSTEAD CON-
TENT, µ
(Jiang
et al.,
2008c)
Unique operator
count
η1 η1 (Khosgoftaar
et al.,
1994),
(Khosh-
goftaar
and Allen,
1999c),
(Ottenstein,
1979)
OO, Con-
ventional
Imp+
n1 (Jensen and
Vairavan,
1985)
n1 (Dick and
Kandel,
2003)
NUM UNIQUE
OPERATORS, µ1
(Jiang
et al.,
2008c)
Continued on next page
238
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
Unique operand
count
η2 η2 (Khosgoftaar
et al.,
1994),
(Khosh-
goftaar
and Allen,
1999c),
(Ottenstein,
1979)
OO, Con-
ventional
Imp+
n2 (Jensen and
Vairavan,
1985)
VARUSDUQ (Khoshgoftaar
and Allen,
1999b),
(Khosh-
goftaar
and Seliya,
2003)
n2 (Dick and
Kandel,
2003)
Unique operands (Li et al.,
2006)
Continued on next page
239
Table B.1 – continued from previous page
Short Definition Preserved
Label
Alternate Label Used by Used in
Paradigm
Availability
NUM UNIQUE
OPERANDS, µ2
(Jiang
et al.,
2008c)
240