Presentation

The Impact of Mislabelling on the Performance and Interpretation

of Defect Prediction Models

Chakkrit (Kla)Tantithamthavorn Shane McIntosh Ahmed E. Hassan Akinori Ihara Kenichi Matsumoto

@klainfo [email protected]

Software defects are costly

2


MonetaryNIST estimates that software defects cost the US economy $59.5 billion per year!

2


ReputationThe Obama administration will always

be connected to healthcare.gov

MonetaryNIST estimates that software defects cost the US economy $59.5 billion per year!

2

SQA teams try to find defects before they escape to the field

3

SQA teams have limited resources

4

LimitedQA Resources

Software continues to grow in size and complexity

5

Defect prediction models helpSQA teams to

5


Predictwhat are risky modules

5


Predictwhat are risky modules

Understand what makes software fail

Modules that are fixed during post-release development are set as defective

6

Changes

Release Date

IssuesPost-Release

Snapshot at the release date

Defect Dataset


6

Changes

Release Date

IssuesPost-Release Module 1

Module 2

Module 3

Module 4


Defect Dataset


6

Changes

Release Date


Module 2

Module 3

Module 4


Defect Dataset


6

Changes

Release Date


Module 2

Module 3

Module 4

Bug Report#1


Defect Dataset


6

Changes

Release Date


Module 2

Module 3

Module 4Fixed

Module1, Module2

Bug Report#1


Defect Dataset


6

Changes

Release Date


Module 2

Module 3

Module 4Fixed

Module1, Module2

Bug Report#1


Label as Defective

Defect Dataset


6

Changes

Release Date


Module 2

Module 3

Module 4Fixed

Module1, Module2

Bug Report#1


Label as Clean

Label as Defective

Defect Dataset

Defect models are trained using Machine Learning

7

Module 1

Module 2

Module 3

Module 4

Defect Dataset

Defect models are trained using Machine Learning

7

Module 1

Module 2

Module 3

Module 4

Defect Dataset

Machine Learning or Statistical Learning

Defect model

Defect data are noisy

The reliability of the models depends on the quality of the training data

8

Module 1

Module 2

Module 3

Module 4

Defect Dataset

Machine Learning or Statistical Learning

Defect modelNOISY

Unreliable

Issue reports are mislabelled

9

FixedModule1, Module2

Bug Report#1 Fields in issue tracking

systems are often missing or incorrect.

[Aranda et al., ICSE 2009]

Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

Non-Defect Mislabelling

A bug could be mislabelled as a

new feature


10


Bug Report#1

43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]

[Antoniol et al., CASCON 2008]

Fields in issue tracking systems are often missing

or incorrect. [Aranda et al., ICSE 2009]


Defect Mislabelling


labeled as a bug



new feature


11


Bug Report#1






Defect Mislabelling


labeled as a bug



new feature

Then, modules are mislabelled

12

#1 Actual Classify Meaning

Defect Mislabelling


labeled as a bug

M1

M2

M4

NOISY DATAM1,M2

M3

#1


12


Defect Mislabelling


labeled as a bug

M1

M2

M4

NOISY DATAM1,M2

#2

M3

M3

#1


12


Defect Mislabelling


labeled as a bug

M1

M2

M4

NOISY DATAM1,M2

#2

M3

#2 is mislabelled.

M3

#1


12


Defect Mislabelling


labeled as a bug

M1

M2

M4

NOISY DATAM1,M2

#2

M3

#2 is mislabelled.

M3M3

M3 should be a clean module

#2

#1

13

Mislabelling may impact the performance

Prior works assumed that mislabelling is random

[Kim et al., ICSE 2011] and [Seiffert et al., Information Science 2014]

Random mislabelling has a negative impact on the performance.

14

Mislabelling is likely non-random

We suspect that novice developers are likely to mislabel more than experienced developers.

Novice developers are known to overlook the bookkeeping issue

[Bachmann et al., FSE 2010]

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

15



(RQ3) Its Impact on the Interpretation

Defect model

(RQ2) Its Impact on the Performance

15

Using prediction models to classify whether issue reports are mislabelled

16

Prediction Model


16

Prediction Model

Mislabelling is predictable

Performs Well


16

Prediction Model

Mislabelling is random

Performs Poorly

Mislabelling is predictable

Performs Well

Selecting our studied systems

17

Manually-curated issue reports

[Herzig et al., ICSE 2013]

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Mislabelling is non-random

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue

Our Model Random GuessingJackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


18

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue



Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


18

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue



Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


19

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue



Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


20

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue



Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0


Perfo

rman

ce V

alue


Our models achieve a mean of F-measure up to 0.73, which is 4-34 times better than random guessing. 20



21



21




(RQ2) The Impact on the Performance

21


22

Compare the performance between clean models and noisy models

Clean Performance

Realistic Noisy Performance

Random Noisy Performance

VS VS

Generating three samples

23

Clean

M1

M2

M3

M4

(Oracle)#2 is mislabelled

Clean Sample

#2

#1


24

Add Noise

M1

M2

M3

M4

Clean

M1

M2

M3

M4


#2

#1

Clean Sample

Realistic Noisy

Sample

Realistically flip the modules’ label that

are addressed by the mislabelled issue

reports.


24

Add Noise

M1

M2

M3

M4

Clean

M1

M2

M3

M4


#2

#1

Clean Sample

Realistic Noisy

Sample



reports.


25

Add Noise

M1

M2

M3

M4

Random Noisy

Sample

M1

M2

M3

M4

Add Noise

Clean

M1

M2

M3

M4


#2

#1

Clean Sample

Realistic Noisy

Sample

Randomly flip the module’s label



reports.


25

Add Noise

M1

M2

M3

M4

Random Noisy

Sample

M1

M2

M3

M4

Add Noise

Clean

M1

M2

M3

M4


#2

#1

Clean Sample

Realistic Noisy

Sample

26

Clean Performance



Clean Sample

Realistic Noisy Sample

Random NoisySample

VS VS

Defect model

Defect model

Defect model

Generate the performance of clean models and noisy models

26

Clean Performance



Clean Sample

Realistic Noisy Sample

Random NoisySample

VS VS

Defect model

Defect model

Defect model

Performance Ratio =

Performance of Realistic Noisy ModelPerformance of Clean Model

Generate the performance of clean models and noisy models

While the recall is often impacted, the precision is rarely impacted.

27

= Realistic NoisyClean

Interpretation:Ratio = 1 means there is no impact.

Precision Recall

1.0

0.5

0.0

2.0

1.5

Rat

io


27



Precision is rarely impacted by realistic mislabelling.

Precision Recall

1.0

0.5

0.0

2.0

1.5

Rat

io


27



Models trained on noisy data

achieve 56% of the recall of models trained on clean

data.

Precision is rarely impacted by realistic mislabelling.

Precision Recall

1.0

0.5

0.0

2.0

1.5

Rat

io




28





28


While the recall is often impacted, the precision is rarely impacted



(RQ3) The Impact on the Interpretation

Defect model


28



29

Generate the rank of metrics of clean models and noisy models

Clean model

Realistic noisy model

Variable Importance Scores



Random noisy model

30

Generate the rank of metrics of clean models and noisy models

Clean model

Realistic noisy model




Rank of metrics

Ranking Ranking Ranking

Rank of metrics Rank of metrics

Random noisy model

31

2 1 3

Clean Model

Rank of metrics of the clean model

Whether a metric of the clean model appears at the same rank in the noisy models?

31

2 1 3

Clean Model

Rank of metrics of the clean model

Noisy Model

2 1 3

?Rank of metrics

of the noisy model

Whether a metric of the clean model appears at the same rank in the noisy models?

32

Only the metrics in the 1st rank are robust to the mislabelling

2 1 3

Clean Model Noisy Model

2 1 385% of the metrics in the 1st rank of the clean model

also appear in the 1st rank of the noisy model.

33

Conversely, the metrics in the 2nd and 3rd ranks are less stable

2 1 3

Clean Model Noisy Model

2 1 3As little as 18% of the metrics in the 2nd and 3rd rank of the clean models appear in the same rank in the noisy models




Defect model


34






Defect model


34



Only top-rank metrics are

robust to the mislabelling


Suggestions(RQ3) The Impact

on the Interpretation

Defect model


35




Defect model


35

Researchers can use our noise

models to clean mislabelled issue

reports




Defect model


35



reports

Cleaning data will improve the

ability to identify defective modules




Defect model


35



reports



Quality improvement plan should be made based on the top

rank metrics

36


37


Bug Report#1






Defect Mislabelling


labeled as a bug



new feature

38






Findings(RQ3) The Impact


Defect model


39








Defect model


40



reports




rank metrics


36


Bug Report#1






Defect Mislabelling


labeled as a bug



new feature

@klainfo [email protected]

12






Findings(RQ3) The Impact


Defect model


39








Defect model


40



reports




rank metrics

Documents

Presentation