Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika...

Towards Logistic Regression Models for Predicting Fault-prone Code

across Software ProjectsErika Camargo

and Ochimizu Koichiro

Japan Institute of Science and Technology

ESEM 2009ESEM 20091

Contents

1. Abstract2. Background3. Problem Analysis4. Case study5. Results6. Conclusion and Future Work

Abstract

Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects.

First attempt of solution: simple log data transformations

P(y=1)

xX = X = design-design-complexity complexity metricmetric

P(Fault prone P(Fault prone class)class)

Background• Some design-complexity metrics have shown to

be good predictors of fault-prone classes in LR models

• Among these metrics are the Chidamber & Kemerer (CK) metrics

– 80th and 20th percentiles of the distributions can be used to determine high and low values

– Their thresholds cannot be determined before their use and should be derived and used locally

Problem Analysis

Can a LR model built with these kind of metrics work efficiently with different software projects?

LEAST FAULTY MOST FAULTY

Small Size SW project

Large Size SW project

X = Number of Methods

P (y=1)

Case Study

1. Data analysis of 7 different projects and application of simple log data transformations.

2. Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes).– Dependent Variables: CK-CBO, CK-RFC, CK-WMC– Independent Variables: Defects (from Bugzilla & CVS)

3. Test these models with 2 other smaller projects (with 11 and13 Java classes)

Challenge

(**) Eclipse Project

(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

produced biased regression estimates and reduce the predictive power of regression models

BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **

RFC Data of BNS is more spread than the data of

the MYL

RFC Data of BNS is more spread than the data of

the MYL

Case Study

Solution. Simple data transformation using “Log10”

Example :

Number of Outliers are lessData Spread is more uniform

LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm;Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed

Results

Effects of the Log data Transformations:• Elimination of great number of outliers• Overall goodness of fit of the 3 models is

better • Discrimination (Most Faulty/Least Faulty)– All models discriminate well between most Faulty

and Least Faulty classes of the Mylyn System– What about using different projects?

Results

Group Model Correct Classification (RAW DATA)

Correct Classification(LOG Tx DATA)

Effect

MF(6 classes)

CBO 2 5

RFC 5 5 =

WMC 6 6 =

LF(5 classes)

CBO 5 5 =

RFC 3 3 =

WMC 4 4 =

BOTH(11 classes)

CBO 7 10

RFC 8 8 =

WMC 10 10 =

BANKING SYSTEM

MF: Most FaultyLF: Least Faulty

Results

Group Model Correct Classification (RAW DATA)

Correct Classification(LOG Tx DATA)

Effect

MF(9 classes)

CBO 3 7

RFC 9 8

WMC 7 6

LF(4 classes)

CBO 4 4 =

RFC 0 3

WMC 0 4

BOTH(13 classes)

CBO 7 11

RFC 9 11

WMC 7 10

E-COMMERCE SYSTEM

MF: Most FaultyLF: Least Faulty

Conclusions and Future work

• CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects

• Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model.

• Further data exploration and study of data transformations

Thank you!questions, comments …

contact: erika.camargo@jaist.ac.jp

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika...

Documents

CIUDAD CAMARGO G13 - 2

CV Koichiro Oyama

Camargo hurtado juan manuel ds librodigital

Camargo Research

1 Chidamber & Kemerer Suite of Metrics Camargo Cruz Ana Erika Supervisor: Ochimizu Koichiro May 2008 Japan Advanced Institute of Science and Technology

Tutorial 2 francisco camargo salas

Camargo (1997) a Biomedicina

De Camargo ISFP2008 Final. PDF

Camargo Research Profile

Unsaturated generic structurestsuboi/res/10rims/slides/ikeda.pdfUnsaturated generic structures Koichiro IKEDA Hosei University RIMS model theory meeting Nov. 29, 2010 Koichiro IKEDA

Quality Prediction Model using UML metrics [1] of [42] Quality prediction model for object oriented software using UML metrics Ana Erika Camargo, Koichiro

Vieda Gomez Camargo

Camargo Guarnieri - Ponteios

Outline of Talk Topics in Tools and Environments for a ... Topics in Tools and Environments for a Distributed Cooperative Work Koichiro Ochimizu Japan Advanced Institute of Science

Koichiro Yoshino , Shinsuke Mori and Tatsuya Kawahara Kyoto University, Japan

Álvaro Siza. Fundación Iberê Camargo

Asymmetric Incentives in Subsidies - Koichiro Ito

Shihan Koichiro Okuma Gasshuku 12 - 17 September, … Shihan Koichiro Okuma Gasshuku 12 - 17 September, 2016 JAPAN KARATE ASSOCIATION/ WORLD FEDERATION-TANZANIA 8TH INTERNATIONAL GASSHUKU

Promessa, M. Camargo Guarnieri

Financial Statements Camargo Corrêa Infra Projetos S.A.€¦ · Camargo Corrêa Infra Projetos S.A. São Paulo - SP Opinion We have audited the financial statements of Camargo Corrêa