View
214
Download
0
Category
Tags:
Preview:
Citation preview
Towards Logistic Regression Models for Predicting Fault-prone Code
across Software ProjectsErika Camargo
and Ochimizu Koichiro
Japan Institute of Science and Technology
ESEM 2009ESEM 20091
Contents
1. Abstract2. Background3. Problem Analysis4. Case study5. Results6. Conclusion and Future Work
2
Abstract
Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects.
First attempt of solution: simple log data transformations
P(y=1)
xX = X = design-design-complexity complexity metricmetric
P(Fault prone P(Fault prone class)class)
3
Background• Some design-complexity metrics have shown to
be good predictors of fault-prone classes in LR models
• Among these metrics are the Chidamber & Kemerer (CK) metrics
– 80th and 20th percentiles of the distributions can be used to determine high and low values
– Their thresholds cannot be determined before their use and should be derived and used locally
4
Problem Analysis
Can a LR model built with these kind of metrics work efficiently with different software projects?
LEAST FAULTY MOST FAULTY
Small Size SW project
Large Size SW project
X = Number of Methods
P (y=1)
105
20
Case Study
1. Data analysis of 7 different projects and application of simple log data transformations.
2. Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes).– Dependent Variables: CK-CBO, CK-RFC, CK-WMC– Independent Variables: Defects (from Bugzilla & CVS)
3. Test these models with 2 other smaller projects (with 11 and13 Java classes)
6
7
Challenge
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
produced biased regression estimates and reduce the predictive power of regression models
BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
RFC Data of BNS is more spread than the data of
the MYL
BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **
8
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
RFC Data of BNS is more spread than the data of
the MYL
BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **
9
Case Study
Solution. Simple data transformation using “Log10”
Example :
10
Number of Outliers are lessData Spread is more uniform
LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm;Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed
Results
Effects of the Log data Transformations:• Elimination of great number of outliers• Overall goodness of fit of the 3 models is
better • Discrimination (Most Faulty/Least Faulty)– All models discriminate well between most Faulty
and Least Faulty classes of the Mylyn System– What about using different projects?
11
Results
Group Model Correct Classification (RAW DATA)
Correct Classification(LOG Tx DATA)
Effect
MF(6 classes)
CBO 2 5
RFC 5 5 =
WMC 6 6 =
LF(5 classes)
CBO 5 5 =
RFC 3 3 =
WMC 4 4 =
BOTH(11 classes)
CBO 7 10
RFC 8 8 =
WMC 10 10 =
BANKING SYSTEM
12
MF: Most FaultyLF: Least Faulty
Results
Group Model Correct Classification (RAW DATA)
Correct Classification(LOG Tx DATA)
Effect
MF(9 classes)
CBO 3 7
RFC 9 8
WMC 7 6
LF(4 classes)
CBO 4 4 =
RFC 0 3
WMC 0 4
BOTH(13 classes)
CBO 7 11
RFC 9 11
WMC 7 10
E-COMMERCE SYSTEM
13
MF: Most FaultyLF: Least Faulty
Conclusions and Future work
• CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects
• Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model.
• Further data exploration and study of data transformations
14
Thank you!questions, comments …
contact: erika.camargo@jaist.ac.jp
15
16
17
18
Recommended