Upload
sailqu
View
8
Download
2
Tags:
Embed Size (px)
Citation preview
The Impact of Mislabelling on the Performance and Interpretation
of Defect Prediction Models
Chakkrit (Kla)Tantithamthavorn Shane McIntosh Ahmed E. Hassan Akinori Ihara Kenichi Matsumoto
@klainfo [email protected]
Software defects are costly
MonetaryNIST estimates that software defects cost the US economy $59.5 billion per year!
2
Software defects are costly
ReputationThe Obama administration will always
be connected to healthcare.gov
MonetaryNIST estimates that software defects cost the US economy $59.5 billion per year!
2
SQA teams have limited resources
4
LimitedQA Resources
Software continues to grow in size and complexity
5
Defect prediction models helpSQA teams to
Predictwhat are risky modules
Understand what makes software fail
Modules that are fixed during post-release development are set as defective
6
Changes
Release Date
IssuesPost-Release
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release development are set as defective
6
Changes
Release Date
IssuesPost-Release Module 1
Module 2
Module 3
Module 4
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release development are set as defective
6
Changes
Release Date
IssuesPost-Release Module 1
Module 2
Module 3
Module 4
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release development are set as defective
6
Changes
Release Date
IssuesPost-Release Module 1
Module 2
Module 3
Module 4
Bug Report#1
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release development are set as defective
6
Changes
Release Date
IssuesPost-Release Module 1
Module 2
Module 3
Module 4Fixed
Module1, Module2
Bug Report#1
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release development are set as defective
6
Changes
Release Date
IssuesPost-Release Module 1
Module 2
Module 3
Module 4Fixed
Module1, Module2
Bug Report#1
Snapshot at the release date
Label as Defective
Defect Dataset
Modules that are fixed during post-release development are set as defective
6
Changes
Release Date
IssuesPost-Release Module 1
Module 2
Module 3
Module 4Fixed
Module1, Module2
Bug Report#1
Snapshot at the release date
Label as Clean
Label as Defective
Defect Dataset
Defect models are trained using Machine Learning
7
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Defect models are trained using Machine Learning
7
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Machine Learning or Statistical Learning
Defect model
Defect data are noisy
The reliability of the models depends on the quality of the training data
8
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Machine Learning or Statistical Learning
Defect modelNOISY
Unreliable
Issue reports are mislabelled
9
FixedModule1, Module2
Bug Report#1 Fields in issue tracking
systems are often missing or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
Non-Defect Mislabelling
A bug could be mislabelled as a
new feature
Issue reports are mislabelled
10
FixedModule1, Module2
Bug Report#1
43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking systems are often missing
or incorrect. [Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
Non-Defect Mislabelling
A bug could be mislabelled as a
new feature
Issue reports are mislabelled
11
FixedModule1, Module2
Bug Report#1
43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking systems are often missing
or incorrect. [Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
Non-Defect Mislabelling
A bug could be mislabelled as a
new feature
Then, modules are mislabelled
12
#1 Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATAM1,M2
M3
#1
Then, modules are mislabelled
12
#1 Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATAM1,M2
#2
M3
M3
#1
Then, modules are mislabelled
12
#1 Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATAM1,M2
#2
M3
#2 is mislabelled.
M3
#1
Then, modules are mislabelled
12
#1 Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATAM1,M2
#2
M3
#2 is mislabelled.
M3M3
M3 should be a clean module
#2
#1
13
Mislabelling may impact the performance
Prior works assumed that mislabelling is random
[Kim et al., ICSE 2011] and [Seiffert et al., Information Science 2014]
Random mislabelling has a negative impact on the performance.
14
Mislabelling is likely non-random
We suspect that novice developers are likely to mislabel more than experienced developers.
Novice developers are known to overlook the bookkeeping issue
[Bachmann et al., FSE 2010]
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
15
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
(RQ3) Its Impact on the Interpretation
Defect model
(RQ2) Its Impact on the Performance
15
Using prediction models to classify whether issue reports are mislabelled
16
Prediction Model
Mislabelling is predictable
Performs Well
Using prediction models to classify whether issue reports are mislabelled
16
Prediction Model
Mislabelling is random
Performs Poorly
Mislabelling is predictable
Performs Well
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Mislabelling is non-random
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random GuessingJackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
18
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Mislabelling is non-random
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random GuessingJackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
18
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Mislabelling is non-random
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random GuessingJackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
19
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Mislabelling is non-random
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random GuessingJackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
20
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Mislabelling is non-random
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random GuessingJackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.00.10.20.30.40.50.60.70.80.91.0
Precision Recall F−Measure Precision Recall F−Measure
Perfo
rman
ce V
alue
Our Model Random Guessing
Our models achieve a mean of F-measure up to 0.73, which is 4-34 times better than random guessing. 20
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
21
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
21
Mislabelling is non-random
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
(RQ2) The Impact on the Performance
21
Mislabelling is non-random
22
Compare the performance between clean models and noisy models
Clean Performance
Realistic Noisy Performance
Random Noisy Performance
VS VS
Generating three samples
24
Add Noise
M1
M2
M3
M4
Clean
M1
M2
M3
M4
(Oracle)#2 is mislabelled
#2
#1
Clean Sample
Realistic Noisy
Sample
Realistically flip the modules’ label that
are addressed by the mislabelled issue
reports.
Generating three samples
24
Add Noise
M1
M2
M3
M4
Clean
M1
M2
M3
M4
(Oracle)#2 is mislabelled
#2
#1
Clean Sample
Realistic Noisy
Sample
Realistically flip the modules’ label that
are addressed by the mislabelled issue
reports.
Generating three samples
25
Add Noise
M1
M2
M3
M4
Random Noisy
Sample
M1
M2
M3
M4
Add Noise
Clean
M1
M2
M3
M4
(Oracle)#2 is mislabelled
#2
#1
Clean Sample
Realistic Noisy
Sample
Randomly flip the module’s label
Realistically flip the modules’ label that
are addressed by the mislabelled issue
reports.
Generating three samples
25
Add Noise
M1
M2
M3
M4
Random Noisy
Sample
M1
M2
M3
M4
Add Noise
Clean
M1
M2
M3
M4
(Oracle)#2 is mislabelled
#2
#1
Clean Sample
Realistic Noisy
Sample
26
Clean Performance
Realistic Noisy Performance
Random Noisy Performance
Clean Sample
Realistic Noisy Sample
Random NoisySample
VS VS
Defect model
Defect model
Defect model
Generate the performance of clean models and noisy models
26
Clean Performance
Realistic Noisy Performance
Random Noisy Performance
Clean Sample
Realistic Noisy Sample
Random NoisySample
VS VS
Defect model
Defect model
Defect model
Performance Ratio =
Performance of Realistic Noisy ModelPerformance of Clean Model
Generate the performance of clean models and noisy models
While the recall is often impacted, the precision is rarely impacted.
27
= Realistic NoisyClean
Interpretation:Ratio = 1 means there is no impact.
Precision Recall
1.0
0.5
0.0
2.0
1.5
Rat
io
While the recall is often impacted, the precision is rarely impacted.
27
= Realistic NoisyClean
Interpretation:Ratio = 1 means there is no impact.
Precision is rarely impacted by realistic mislabelling.
Precision Recall
1.0
0.5
0.0
2.0
1.5
Rat
io
While the recall is often impacted, the precision is rarely impacted.
27
= Realistic NoisyClean
Interpretation:Ratio = 1 means there is no impact.
Models trained on noisy data
achieve 56% of the recall of models trained on clean
data.
Precision is rarely impacted by realistic mislabelling.
Precision Recall
1.0
0.5
0.0
2.0
1.5
Rat
io
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
(RQ2) The Impact on the Performance
28
Mislabelling is non-random
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
(RQ2) The Impact on the Performance
28
Mislabelling is non-random
While the recall is often impacted, the precision is rarely impacted
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
(RQ3) The Impact on the Interpretation
Defect model
(RQ2) The Impact on the Performance
28
Mislabelling is non-random
While the recall is often impacted, the precision is rarely impacted
29
Generate the rank of metrics of clean models and noisy models
Clean model
Realistic noisy model
Variable Importance Scores
Variable Importance Scores
Variable Importance Scores
Random noisy model
30
Generate the rank of metrics of clean models and noisy models
Clean model
Realistic noisy model
Variable Importance Scores
Variable Importance Scores
Variable Importance Scores
Rank of metrics
Ranking Ranking Ranking
Rank of metrics Rank of metrics
Random noisy model
31
2 1 3
Clean Model
Rank of metrics of the clean model
Whether a metric of the clean model appears at the same rank in the noisy models?
31
2 1 3
Clean Model
Rank of metrics of the clean model
Noisy Model
2 1 3
?Rank of metrics
of the noisy model
Whether a metric of the clean model appears at the same rank in the noisy models?
32
Only the metrics in the 1st rank are robust to the mislabelling
2 1 3
Clean Model Noisy Model
2 1 385% of the metrics in the 1st rank of the clean model
also appear in the 1st rank of the noisy model.
33
Conversely, the metrics in the 2nd and 3rd ranks are less stable
2 1 3
Clean Model Noisy Model
2 1 3As little as 18% of the metrics in the 2nd and 3rd rank of the clean models appear in the same rank in the noisy models
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
(RQ3) The Impact on the Interpretation
Defect model
(RQ2) The Impact on the Performance
34
Mislabelling is non-random
While the recall is often impacted, the precision is rarely impacted
(RQ1) The Nature of Mislabelling
The impact of realistic mislabelling on the performance and interpretation of defect models
(RQ3) The Impact on the Interpretation
Defect model
(RQ2) The Impact on the Performance
34
Mislabelling is non-random
While the recall is often impacted, the precision is rarely impacted
Only top-rank metrics are
robust to the mislabelling
(RQ1) The Nature of Mislabelling
Suggestions(RQ3) The Impact
on the Interpretation
Defect model
(RQ2) The Impact on the Performance
35
(RQ1) The Nature of Mislabelling
Suggestions(RQ3) The Impact
on the Interpretation
Defect model
(RQ2) The Impact on the Performance
35
Researchers can use our noise
models to clean mislabelled issue
reports
(RQ1) The Nature of Mislabelling
Suggestions(RQ3) The Impact
on the Interpretation
Defect model
(RQ2) The Impact on the Performance
35
Researchers can use our noise
models to clean mislabelled issue
reports
Cleaning data will improve the
ability to identify defective modules
(RQ1) The Nature of Mislabelling
Suggestions(RQ3) The Impact
on the Interpretation
Defect model
(RQ2) The Impact on the Performance
35
Researchers can use our noise
models to clean mislabelled issue
reports
Cleaning data will improve the
ability to identify defective modules
Quality improvement plan should be made based on the top
rank metrics
Issue reports are mislabelled
37
FixedModule1, Module2
Bug Report#1
43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking systems are often missing
or incorrect. [Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
Non-Defect Mislabelling
A bug could be mislabelled as a
new feature
38
Mislabelling may impact the performance
Prior works assumed that mislabelling is random
[Kim et al., ICSE 2011] and [Seiffert et al., Information Science 2014]
Random mislabelling has a negative impact on the performance.
(RQ1) The Nature of Mislabelling
Findings(RQ3) The Impact
on the Interpretation
Defect model
(RQ2) The Impact on the Performance
39
Mislabelling is non-random
While the recall is often impacted, the precision is rarely impacted.
Only top-rank metrics are
robust to the mislabelling
(RQ1) The Nature of Mislabelling
Suggestions(RQ3) The Impact
on the Interpretation
Defect model
(RQ2) The Impact on the Performance
40
Researchers can use our noise
models to clean mislabelled issue
reports
Cleaning data will improve the
ability to identify defective modules
Quality improvement plan should be made based on the top
rank metrics
Issue reports are mislabelled
36
FixedModule1, Module2
Bug Report#1
43% of issue reports are mislabelled. [Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking systems are often missing
or incorrect. [Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect Mislabelling
A new feature could be incorrectly
labeled as a bug
Non-Defect Mislabelling
A bug could be mislabelled as a
new feature
@klainfo [email protected]
12
Mislabelling may impact the performance
Prior works assumed that mislabelling is random
[Kim et al., ICSE 2011] and [Seiffert et al., Information Science 2014]
Random mislabelling has a negative impact on the performance.
(RQ1) The Nature of Mislabelling
Findings(RQ3) The Impact
on the Interpretation
Defect model
(RQ2) The Impact on the Performance
39
Mislabelling is non-random
While the recall is often impacted, the precision is rarely impacted.
Only top-rank metrics are
robust to the mislabelling
(RQ1) The Nature of Mislabelling
Suggestions(RQ3) The Impact
on the Interpretation
Defect model
(RQ2) The Impact on the Performance
40
Researchers can use our noise
models to clean mislabelled issue
reports
Cleaning data will improve the
ability to identify defective modules
Quality improvement plan should be made based on the top
rank metrics