74
The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models Chakkrit (Kla) Tantithamthavorn Shane McIntosh Ahmed E. Hassan Akinori Ihara Kenichi Matsumoto @klainfo [email protected]

Presentation

  • Upload
    sailqu

  • View
    8

  • Download
    2

Embed Size (px)

Citation preview

The Impact of Mislabelling on the Performance and Interpretation

of Defect Prediction Models

Chakkrit (Kla)Tantithamthavorn Shane McIntosh Ahmed E. Hassan Akinori Ihara Kenichi Matsumoto

@klainfo [email protected]

Software defects are costly

2

Software defects are costly

MonetaryNIST estimates that software defects cost the US economy $59.5 billion per year!

2

Software defects are costly

ReputationThe Obama administration will always

be connected to healthcare.gov

MonetaryNIST estimates that software defects cost the US economy $59.5 billion per year!

2

SQA teams try to find defects before they escape to the field

3

SQA teams have limited resources

4

LimitedQA Resources

Software continues to grow in size and complexity

5

Defect prediction models helpSQA teams to

5

Defect prediction models helpSQA teams to

Predictwhat are risky modules

5

Defect prediction models helpSQA teams to

Predictwhat are risky modules

Understand what makes software fail

Modules that are fixed during post-release development are set as defective

6

Changes

Release Date

IssuesPost-Release

Snapshot at the release date

Defect Dataset

Modules that are fixed during post-release development are set as defective

6

Changes

Release Date

IssuesPost-Release Module 1

Module 2

Module 3

Module 4

Snapshot at the release date

Defect Dataset

Modules that are fixed during post-release development are set as defective

6

Changes

Release Date

IssuesPost-Release Module 1

Module 2

Module 3

Module 4

Snapshot at the release date

Defect Dataset

Modules that are fixed during post-release development are set as defective

6

Changes

Release Date

IssuesPost-Release Module 1

Module 2

Module 3

Module 4

Bug Report#1

Snapshot at the release date

Defect Dataset

Modules that are fixed during post-release development are set as defective

6

Changes

Release Date

IssuesPost-Release Module 1

Module 2

Module 3

Module 4Fixed

Module1, Module2

Bug Report#1

Snapshot at the release date

Defect Dataset

Modules that are fixed during post-release development are set as defective

6

Changes

Release Date

IssuesPost-Release Module 1

Module 2

Module 3

Module 4Fixed

Module1, Module2

Bug Report#1

Snapshot at the release date

Label as Defective

Defect Dataset

Modules that are fixed during post-release development are set as defective

6

Changes

Release Date

IssuesPost-Release Module 1

Module 2

Module 3

Module 4Fixed

Module1, Module2

Bug Report#1

Snapshot at the release date

Label as Clean

Label as Defective

Defect Dataset

Defect models are trained using Machine Learning

7

Module 1

Module 2

Module 3

Module 4

Defect Dataset

Defect models are trained using Machine Learning

7

Module 1

Module 2

Module 3

Module 4

Defect Dataset

Machine Learning or Statistical Learning

Defect model

Defect data are noisy

The reliability of the models depends on the quality of the training data

8

Module 1

Module 2

Module 3

Module 4

Defect Dataset

Machine Learning or Statistical Learning

Defect modelNOISY

Unreliable

Issue reports are mislabelled

9

FixedModule1, Module2

Bug Report#1 Fields in issue tracking

systems are often missing or incorrect.

[Aranda et al., ICSE 2009]

Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

Non-Defect Mislabelling

A bug could be mislabelled as a

new feature

Issue reports are mislabelled

10

FixedModule1, Module2

Bug Report#1

43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]

[Antoniol et al., CASCON 2008]

Fields in issue tracking systems are often missing

or incorrect. [Aranda et al., ICSE 2009]

Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

Non-Defect Mislabelling

A bug could be mislabelled as a

new feature

Issue reports are mislabelled

11

FixedModule1, Module2

Bug Report#1

43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]

[Antoniol et al., CASCON 2008]

Fields in issue tracking systems are often missing

or incorrect. [Aranda et al., ICSE 2009]

Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

Non-Defect Mislabelling

A bug could be mislabelled as a

new feature

Then, modules are mislabelled

12

#1 Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

M1

M2

M4

NOISY DATAM1,M2

M3

#1

Then, modules are mislabelled

12

#1 Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

M1

M2

M4

NOISY DATAM1,M2

#2

M3

M3

#1

Then, modules are mislabelled

12

#1 Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

M1

M2

M4

NOISY DATAM1,M2

#2

M3

#2 is mislabelled.

M3

#1

Then, modules are mislabelled

12

#1 Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

M1

M2

M4

NOISY DATAM1,M2

#2

M3

#2 is mislabelled.

M3M3

M3 should be a clean module

#2

#1

13

Mislabelling may impact the performance

Prior works assumed that mislabelling is random

[Kim et al., ICSE 2011] and [Seiffert et al., Information Science 2014]

Random mislabelling has a negative impact on the performance.

14

Mislabelling is likely non-random

We suspect that novice developers are likely to mislabel more than experienced developers.

Novice developers are known to overlook the bookkeeping issue

[Bachmann et al., FSE 2010]

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

15

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

(RQ3) Its Impact on the Interpretation

Defect model

(RQ2) Its Impact on the Performance

15

Using prediction models to classify whether issue reports are mislabelled

16

Prediction Model

Using prediction models to classify whether issue reports are mislabelled

16

Prediction Model

Mislabelling is predictable

Performs Well

Using prediction models to classify whether issue reports are mislabelled

16

Prediction Model

Mislabelling is random

Performs Poorly

Mislabelling is predictable

Performs Well

Selecting our studied systems

17

Manually-curated issue reports

[Herzig et al., ICSE 2013]

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Mislabelling is non-random

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random GuessingJackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

18

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Mislabelling is non-random

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random GuessingJackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

18

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Mislabelling is non-random

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random GuessingJackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

19

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Mislabelling is non-random

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random GuessingJackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

20

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Mislabelling is non-random

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Jackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random GuessingJackrabbit Lucene

0.78

0.12

0.64

0.50

0.70

0.19

0.75

0.12

0.71

0.50

0.73

0.19

0.00.10.20.30.40.50.60.70.80.91.0

Precision Recall F−Measure Precision Recall F−Measure

Perfo

rman

ce V

alue

Our Model Random Guessing

Our models achieve a mean of F-measure up to 0.73, which is 4-34 times better than random guessing. 20

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

21

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

21

Mislabelling is non-random

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

(RQ2) The Impact on the Performance

21

Mislabelling is non-random

22

Compare the performance between clean models and noisy models

Clean Performance

Realistic Noisy Performance

Random Noisy Performance

VS VS

Generating three samples

23

Clean

M1

M2

M3

M4

(Oracle)#2 is mislabelled

Clean Sample

#2

#1

Generating three samples

24

Add Noise

M1

M2

M3

M4

Clean

M1

M2

M3

M4

(Oracle)#2 is mislabelled

#2

#1

Clean Sample

Realistic Noisy

Sample

Realistically flip the modules’ label that

are addressed by the mislabelled issue

reports.

Generating three samples

24

Add Noise

M1

M2

M3

M4

Clean

M1

M2

M3

M4

(Oracle)#2 is mislabelled

#2

#1

Clean Sample

Realistic Noisy

Sample

Realistically flip the modules’ label that

are addressed by the mislabelled issue

reports.

Generating three samples

25

Add Noise

M1

M2

M3

M4

Random Noisy

Sample

M1

M2

M3

M4

Add Noise

Clean

M1

M2

M3

M4

(Oracle)#2 is mislabelled

#2

#1

Clean Sample

Realistic Noisy

Sample

Randomly flip the module’s label

Realistically flip the modules’ label that

are addressed by the mislabelled issue

reports.

Generating three samples

25

Add Noise

M1

M2

M3

M4

Random Noisy

Sample

M1

M2

M3

M4

Add Noise

Clean

M1

M2

M3

M4

(Oracle)#2 is mislabelled

#2

#1

Clean Sample

Realistic Noisy

Sample

26

Clean Performance

Realistic Noisy Performance

Random Noisy Performance

Clean Sample

Realistic Noisy Sample

Random NoisySample

VS VS

Defect model

Defect model

Defect model

Generate the performance of clean models and noisy models

26

Clean Performance

Realistic Noisy Performance

Random Noisy Performance

Clean Sample

Realistic Noisy Sample

Random NoisySample

VS VS

Defect model

Defect model

Defect model

Performance Ratio =

Performance of Realistic Noisy ModelPerformance of Clean Model

Generate the performance of clean models and noisy models

While the recall is often impacted, the precision is rarely impacted.

27

= Realistic NoisyClean

Interpretation:Ratio = 1 means there is no impact.

Precision Recall

1.0

0.5

0.0

2.0

1.5

Rat

io

While the recall is often impacted, the precision is rarely impacted.

27

= Realistic NoisyClean

Interpretation:Ratio = 1 means there is no impact.

Precision is rarely impacted by realistic mislabelling.

Precision Recall

1.0

0.5

0.0

2.0

1.5

Rat

io

While the recall is often impacted, the precision is rarely impacted.

27

= Realistic NoisyClean

Interpretation:Ratio = 1 means there is no impact.

Models trained on noisy data

achieve 56% of the recall of models trained on clean

data.

Precision is rarely impacted by realistic mislabelling.

Precision Recall

1.0

0.5

0.0

2.0

1.5

Rat

io

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

(RQ2) The Impact on the Performance

28

Mislabelling is non-random

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

(RQ2) The Impact on the Performance

28

Mislabelling is non-random

While the recall is often impacted, the precision is rarely impacted

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

(RQ3) The Impact on the Interpretation

Defect model

(RQ2) The Impact on the Performance

28

Mislabelling is non-random

While the recall is often impacted, the precision is rarely impacted

29

Generate the rank of metrics of clean models and noisy models

Clean model

Realistic noisy model

Variable Importance Scores

Variable Importance Scores

Variable Importance Scores

Random noisy model

30

Generate the rank of metrics of clean models and noisy models

Clean model

Realistic noisy model

Variable Importance Scores

Variable Importance Scores

Variable Importance Scores

Rank of metrics

Ranking Ranking Ranking

Rank of metrics Rank of metrics

Random noisy model

31

2 1 3

Clean Model

Rank of metrics of the clean model

Whether a metric of the clean model appears at the same rank in the noisy models?

31

2 1 3

Clean Model

Rank of metrics of the clean model

Noisy Model

2 1 3

?Rank of metrics

of the noisy model

Whether a metric of the clean model appears at the same rank in the noisy models?

32

Only the metrics in the 1st rank are robust to the mislabelling

2 1 3

Clean Model Noisy Model

2 1 385% of the metrics in the 1st rank of the clean model

also appear in the 1st rank of the noisy model.

33

Conversely, the metrics in the 2nd and 3rd ranks are less stable

2 1 3

Clean Model Noisy Model

2 1 3As little as 18% of the metrics in the 2nd and 3rd rank of the clean models appear in the same rank in the noisy models

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

(RQ3) The Impact on the Interpretation

Defect model

(RQ2) The Impact on the Performance

34

Mislabelling is non-random

While the recall is often impacted, the precision is rarely impacted

(RQ1) The Nature of Mislabelling

The impact of realistic mislabelling on the performance and interpretation of defect models

(RQ3) The Impact on the Interpretation

Defect model

(RQ2) The Impact on the Performance

34

Mislabelling is non-random

While the recall is often impacted, the precision is rarely impacted

Only top-rank metrics are

robust to the mislabelling

(RQ1) The Nature of Mislabelling

Suggestions(RQ3) The Impact

on the Interpretation

Defect model

(RQ2) The Impact on the Performance

35

(RQ1) The Nature of Mislabelling

Suggestions(RQ3) The Impact

on the Interpretation

Defect model

(RQ2) The Impact on the Performance

35

Researchers can use our noise

models to clean mislabelled issue

reports

(RQ1) The Nature of Mislabelling

Suggestions(RQ3) The Impact

on the Interpretation

Defect model

(RQ2) The Impact on the Performance

35

Researchers can use our noise

models to clean mislabelled issue

reports

Cleaning data will improve the

ability to identify defective modules

(RQ1) The Nature of Mislabelling

Suggestions(RQ3) The Impact

on the Interpretation

Defect model

(RQ2) The Impact on the Performance

35

Researchers can use our noise

models to clean mislabelled issue

reports

Cleaning data will improve the

ability to identify defective modules

Quality improvement plan should be made based on the top

rank metrics

36

Issue reports are mislabelled

37

FixedModule1, Module2

Bug Report#1

43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]

[Antoniol et al., CASCON 2008]

Fields in issue tracking systems are often missing

or incorrect. [Aranda et al., ICSE 2009]

Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

Non-Defect Mislabelling

A bug could be mislabelled as a

new feature

38

Mislabelling may impact the performance

Prior works assumed that mislabelling is random

[Kim et al., ICSE 2011] and [Seiffert et al., Information Science 2014]

Random mislabelling has a negative impact on the performance.

(RQ1) The Nature of Mislabelling

Findings(RQ3) The Impact

on the Interpretation

Defect model

(RQ2) The Impact on the Performance

39

Mislabelling is non-random

While the recall is often impacted, the precision is rarely impacted.

Only top-rank metrics are

robust to the mislabelling

(RQ1) The Nature of Mislabelling

Suggestions(RQ3) The Impact

on the Interpretation

Defect model

(RQ2) The Impact on the Performance

40

Researchers can use our noise

models to clean mislabelled issue

reports

Cleaning data will improve the

ability to identify defective modules

Quality improvement plan should be made based on the top

rank metrics

Issue reports are mislabelled

36

FixedModule1, Module2

Bug Report#1

43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]

[Antoniol et al., CASCON 2008]

Fields in issue tracking systems are often missing

or incorrect. [Aranda et al., ICSE 2009]

Actual Classify Meaning

Defect Mislabelling

A new feature could be incorrectly

labeled as a bug

Non-Defect Mislabelling

A bug could be mislabelled as a

new feature

@klainfo [email protected]

12

Mislabelling may impact the performance

Prior works assumed that mislabelling is random

[Kim et al., ICSE 2011] and [Seiffert et al., Information Science 2014]

Random mislabelling has a negative impact on the performance.

(RQ1) The Nature of Mislabelling

Findings(RQ3) The Impact

on the Interpretation

Defect model

(RQ2) The Impact on the Performance

39

Mislabelling is non-random

While the recall is often impacted, the precision is rarely impacted.

Only top-rank metrics are

robust to the mislabelling

(RQ1) The Nature of Mislabelling

Suggestions(RQ3) The Impact

on the Interpretation

Defect model

(RQ2) The Impact on the Performance

40

Researchers can use our noise

models to clean mislabelled issue

reports

Cleaning data will improve the

ability to identify defective modules

Quality improvement plan should be made based on the top

rank metrics