118
Sung Kim The Hong Kong University of Science and Technology Defect, Defect, Defect Keynote

Defect, defect, defect: PROMISE 2012 Keynote

Embed Size (px)

DESCRIPTION

Software prediction leveraging repositories has received a tremendous amount of attention within the software engineering community, including PROMISE. In this talk, I will first present great achievements in defect prediction research including new defect prediction features, promising algorithms, and interesting analysis results. However, there are still many challenges in defect prediction. I will talk about them and discuss potential solutions for them leveraging prediction 2.0.

Citation preview

Page 1: Defect, defect, defect: PROMISE 2012 Keynote

Sung KimThe Hong Kong University of

Science and Technology

Defect, Defect, Defect

Keynote

Page 2: Defect, defect, defect: PROMISE 2012 Keynote
Page 3: Defect, defect, defect: PROMISE 2012 Keynote
Page 4: Defect, defect, defect: PROMISE 2012 Keynote
Page 5: Defect, defect, defect: PROMISE 2012 Keynote
Page 6: Defect, defect, defect: PROMISE 2012 Keynote

Program Analysis and Mining (PAM) Group

Page 7: Defect, defect, defect: PROMISE 2012 Keynote

Program Analysis and Mining (PAM) Group

Page 8: Defect, defect, defect: PROMISE 2012 Keynote

The First BugSeptember 9, 1947

Page 9: Defect, defect, defect: PROMISE 2012 Keynote

More Bugs

Page 10: Defect, defect, defect: PROMISE 2012 Keynote

Finding Bugs

Verification

Testing

Prediction

Page 11: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction

Program Future defects

42 24

14

Tool

Page 12: Defect, defect, defect: PROMISE 2012 Keynote

Why Prediction?

Page 13: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction Model

D=4.86+

0.018L

F. Akiyama, “An Example of Software System Debugging,” Information Processing, vol. 71, 1971

Page 14: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction

Identifying New Metrics

Developing New Algorithms

Various Granularities

Page 15: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction

Identifying New Metrics

Developing New Algorithms

Various Granularities

Page 16: Defect, defect, defect: PROMISE 2012 Keynote

Complex Files

Ostrand and Weyuker, Basili et al., TSE 1996, Ohlsson and Alberg, TSE 1996, Menzies et al., TSE 2007

a complex file

a simple file

Page 17: Defect, defect, defect: PROMISE 2012 Keynote

Complex Files

Ostrand and Weyuker, Basili et al., TSE 1996, Ohlsson and Alberg, TSE 1996, Menzies et al., TSE 2007

a complex file

a simple file

Page 18: Defect, defect, defect: PROMISE 2012 Keynote

Changes

Bell et al. PROMISE 2011, Moser et al., ICSE 2008, Nagappan et al., ICSE 2006, Hassan et al., ICSM 2005

Page 19: Defect, defect, defect: PROMISE 2012 Keynote

Changes

Bell et al. PROMISE 2011, Moser et al., ICSE 2008, Nagappan et al., ICSE 2006, Hassan et al., ICSM 2005

Page 20: Defect, defect, defect: PROMISE 2012 Keynote

Lee et al., FSE2011

View/Edit Patterns

Page 21: Defect, defect, defect: PROMISE 2012 Keynote

Slide by Mik Kersten. “Mylyn – The task-focused interface” (December 2007, http://live.eclipse.org)

Page 22: Defect, defect, defect: PROMISE 2012 Keynote

With Mylyn

Tasks are integrated

See only what you are working on

Slide by Mik Kersten. “Mylyn – The task-focused interface” (December 2007, http://live.eclipse.org)

Page 23: Defect, defect, defect: PROMISE 2012 Keynote

* Eclipse plug-in storing and recovering task contexts

Page 24: Defect, defect, defect: PROMISE 2012 Keynote

* Eclipse plug-in storing and recovering task contexts

Page 25: Defect, defect, defect: PROMISE 2012 Keynote

* Eclipse plug-in storing and recovering task contexts

<InteractionEvent … Kind=“ ” … StartDate=“ ” EndDate=“ ” … StructureHandle=“ ” … Interest=“ ” … >

Page 26: Defect, defect, defect: PROMISE 2012 Keynote

Burst Edits/Views

Lee et al., FSE2011

Page 27: Defect, defect, defect: PROMISE 2012 Keynote

Burst Edits/Views

Lee et al., FSE2011

Page 28: Defect, defect, defect: PROMISE 2012 Keynote

Change Entropy

Hassan, “Predicting Faults Using the Complexity of Code Changes,” ICSE 2009

11

1 1 1 1

F1 F2 F3 F4 F5

3 3 3 3 3

F1 F2 F3 F4 F5

Low Entropy High Entropy

The number of changes in a period (e.g., a week) per fileF1 F2 F3 F4 F5 F6 F7 F8 F9 F10

Page 29: Defect, defect, defect: PROMISE 2012 Keynote

Change Entropy

Hassan, “Predicting Faults Using the Complexity of Code Changes,” ICSE 2009

11

1 1 1 1

F1 F2 F3 F4 F5

3 3 3 3 3

F1 F2 F3 F4 F5

Low Entropy High Entropy

The number of changes in a period (e.g., a week) per fileF1 F2 F3 F4 F5 F6 F7 F8 F9 F10

Page 30: Defect, defect, defect: PROMISE 2012 Keynote

Previous Fixes

Hassan et al., ICSM 2005, Kim et al., ICSE 2007

Page 31: Defect, defect, defect: PROMISE 2012 Keynote

Previous Fixes

Hassan et al., ICSM 2005, Kim et al., ICSE 2007

Page 32: Defect, defect, defect: PROMISE 2012 Keynote

Previous Fixes

Hassan et al., ICSM 2005, Kim et al., ICSE 2007

Page 33: Defect, defect, defect: PROMISE 2012 Keynote

Network

Zimmermann and Nagappan, “Predicting Defects using Network Analysis on Dependency Graphs,”ICSE 2008

Page 34: Defect, defect, defect: PROMISE 2012 Keynote

Network

Zimmermann and Nagappan, “Predicting Defects using Network Analysis on Dependency Graphs,”ICSE 2008

Page 35: Defect, defect, defect: PROMISE 2012 Keynote

More MetricsComplexity (Size)

CKMcCabe

OOProcess metrics

HalsteadDeveloper Count metrics

Change metricsEntropy of changes (Change Complexity)

Churn (source code metrics)# of changes to the file

Previous defectsNetwork measures

Calling structure attributesEntropy (source code metrics)

0 5 10 15 20 25# of publications (last 7 years)

Page 36: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction

Identifying New Metrics

Developing New Algorithms

Various Granularities

Page 37: Defect, defect, defect: PROMISE 2012 Keynote

Classification

?

Learner

training instances (metrics+ labels)

Prediction(classification)

new instance

complexity metricshistorical metrics

...

Page 38: Defect, defect, defect: PROMISE 2012 Keynote

Regression

?

Learner

training instances (metrics+ values)

Prediction(values)

new instance

complexity metricshistorical metrics

...

Page 39: Defect, defect, defect: PROMISE 2012 Keynote

Active Learning

Lo et al., “Active Refinement of Clone Anomaly Reports,” ICSE 2012, Lu et al., PROMISE 2012

clones should be inherently similar to each other, and incon-sistent changes to the clones themselves or their surroundingcode (which are called contexts) may indicate unintentionalchanges, bad programming styles, and bugs.

The technique in [12] is summarized as follows:

1) It uses a code clone detection tool, DECKARD [5], todetect code clones in programs. The output of this stepis a set of clone groups, where each clone group is aset of code pieces that are syntactically similar to eachother (a.k.a. clones);

2) Then, it locates the locations of every clone in thesource code and generates parse (sub)trees for them;

3) Next, it detects inconsistencies among the parse trees ofthe clones and their contexts, e.g., whether the clonescontain different numbers of unique identifiers, and howthe language constructs of the contexts are different.The inconsistencies are then ranked heuristically basedon their potential relationship with bugs. Inconsistentclones unlike to be buggy are also filtered out.

4) Finally, it outputs a list of anomaly reports, each ofwhich indicates the location of a potential bug in thesource code, for developers to inspect.

It has been reported that this technique has high falsepositive rates, even though it can find true bugs of diversecharacteristics that are difficult to detect by other techniques.For example, among more than 800 reported bugs for theLinux Kernel, only 41 are true bugs and another 17 are badprogramming styles; among more than 400 reported bugsfor the Eclipse, only 21 are true bugs and 17 are issues withbad programming styles [12].

IV. OVERALL REFINEMENT FRAMEWORK

A typical clone-based anomaly detection system performsa single batch analysis where a static set of anomaly or bugreports (ordered or unordered) are produced. It requires noor little user intervention (e.g., setting some parameters),butmay produce many false positives. To alleviate this problem,we propose an active learning approach that can dynamicallyand continually refine anomaly reports based on incrementaluser feedbacks; each feedback is immediately incorporatedby our approach into the ordering of anomaly reports tomove possible true positive reports up in the list whilemoving likely false positives towards the end of the list.

Our proposed active refinement process supporting userfeedbacks is shown in Figure 4. It is composed of fiveparts corresponding to the boxes in the figure.2 Let usrefer to them as Block 1 to 5 (counter-clockwise from leftto right). Block 1 represents a typical batch-mode clone-based anomaly detection system. Given a program, thesystem identifies parts of the program that are differentfrom the norm, where the norm corresponds to the common

2A square, a trapeze, and a parallelogram represent a process, a manualoperation, and data respectively.

Anomaly Detection System

Refinement Engine

<<Refinement Loop>>

1

5

UserFeedback

4

Sorted Bug Reports 2

First Few Bug Reports 3

Figure 4. Active Refinement Process

characteristics in a clone group. Then, the set of anomalies orbugs (i.e., Block 2) is presented for manual user inspection.

We extend such typical clone-based anomaly detec-tion systems by incorporating incremental user feedbacksthrough the feedback and refinement loop starting at Block2 followed by Blocks 3, 4, and 5, and back to Block 2. AtBlocks 3 and 4, a user is presented with a few bug reportsand is asked to provide feedbacks on whether the reports heor she sees are false or true positives. These feedbacks arethen fed into our refinement engine (i.e., Block 5) to updatethe original or intermediate lists of bug reports.

With user feedbacks, the refinement engine analyzes thecharacteristics of both false positives and true positiveslabeled by users so far and hypothesizes about other falsepositives and true positives in the list based on various clas-sification and machine learning techniques. This hypothesisis then used to rearrange the remaining bug reports. It ispossible that a true positive, that is originally ranked low, ismoved up the list; a false positive, that is originally rankedhigh, is “downgraded” or pushed down the list.

The active refinement process repeats and users are askedfor more feedbacks. With more iterations, more feedbacksare received, and a better hypothesis can be made for theremaining unlabeled reports.

The ultimate goal of our refinement process is to producea better ordering of bug reports so that true positive reportsare listed first ahead of false positives, which we refer toas the bug report ordering problem. With better ordering,true positives can be identified earlier without the need toinvestigate the entire report list. With less false positivesearlier in the list, a debugger can be encouraged to continueinvestigating the rest of the reports and find more bugs in afixed period of time. If all (or most) of the true positives canappear early, a debugger may stop analyzing the anomalyreports once he or she finds many false positives.

V. REFINEMENT ENGINE

This section elaborates our refinement engine further. Ourrefinement engine takes in a list of anomaly reports andrefines it by reordering the reports. Each anomaly reportis a set of code clones (i.e., a clone group) which containinconsistencies among the clones. Given a list of anomalyreports, ordered either arbitrarily or with some ad-hoc cri-teria, and user-provided labels (i.e., true positives or false

400

Page 41: Defect, defect, defect: PROMISE 2012 Keynote

Algorithms

31

Classification

Regression

Both

Etc.

0 5 10 15 20 25

4

4

18

21

# of publications (recent 7 years)

Algo

rithm

s

Page 42: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction

Identifying New Metrics

Developing New Algorithms

Various Granularities

Page 43: Defect, defect, defect: PROMISE 2012 Keynote

Module/Binary/Package Level

Page 44: Defect, defect, defect: PROMISE 2012 Keynote

Module/Binary/Package Level

Page 45: Defect, defect, defect: PROMISE 2012 Keynote

File Level

Page 46: Defect, defect, defect: PROMISE 2012 Keynote

File Level

Page 47: Defect, defect, defect: PROMISE 2012 Keynote

Method Level

void foo () {...

}

Hata et al.,“Bug Prediction Based on Fine-Grained Module Histories,” ICSE 2012

Page 48: Defect, defect, defect: PROMISE 2012 Keynote

Method Level

void foo () {...

}

Hata et al.,“Bug Prediction Based on Fine-Grained Module Histories,” ICSE 2012

Page 49: Defect, defect, defect: PROMISE 2012 Keynote

Change Level

Did I just introduce

a bug?

...

...

...

...

Rev 4............

Rev 1

change............

Rev 2............

Rev 3

Development history of a file

change change

Kim et al., "Classifying Software Changes: Clean or Buggy?" TSE 2009

Page 50: Defect, defect, defect: PROMISE 2012 Keynote

Change Level

Did I just introduce

a bug?

...

...

...

...

Rev 4............

Rev 1

change............

Rev 2............

Rev 3

Development history of a file

change change

Kim et al., "Classifying Software Changes: Clean or Buggy?" TSE 2009

Page 51: Defect, defect, defect: PROMISE 2012 Keynote

More Granularities

Project/Release/SubSystem

Component/Module

Package

File

Class

Function/Method

Change/Hunk level

0 5 10 15 20

1

2

8

19

3

8

3

# of publications (recent 7 years)

Page 52: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction Summary

Identifying New Metrics

Developing New Algorithms

Various Granularities

Page 53: Defect, defect, defect: PROMISE 2012 Keynote

Performance 11

Figure 2. Data used in models

Figure 3. The size of the data sets used for Eclipse

Apache ArgoUML Eclipse Embedded Systemsystem

11

Figure 2. Data used in models

Figure 3. The size of the data sets used for Eclipse

Healthcaresystem

Microsoft

Mozilla

Hall et al., "A Systematic Review of Fault Prediction Performance in Software Engineering," TSE 2011 (Figure 2)

Page 54: Defect, defect, defect: PROMISE 2012 Keynote

Performance13

*For example plug-ins, binaries

Figure 6. The granularity of the results

in the on-line appendix shows how independent vari-ables as expressed by individual studies have been cate-gorised in relation to the labels used in Figure 7. It showsthat there is variation in performance between modelsusing different independent variables. Models using awide combination of metrics seem to be performing well.For example, models using a combination of static codemetrics (scm), process metrics and source code text seemto be performing best overall (e.g. Shivaji et al. [[164]]).Similarly Bird et al’s study [[18]] which uses a widecombination of socio-technical metrics (code dependencydata together with change data and developer data) alsoperforms well (though the results from Bird et al’s study[[18]] are reported at a high level of granularity). Processmetrics (i.e. metrics based on changes logged in reposito-ries) have not performed as well as expected. OO metricsseem to have been used in studies which perform betterthan studies based only on other static code metrics(e.g. complexity based metrics). Models using only LOCdata seem to have performed competitively comparedto models using other independent variables. Indeed ofthese models using only metrics based on static featuresof the code (OO or SCM), LOC seems as good as anyother metric to use. The use of source code text seemsrelated to good performance. Mizuno et al.’s studies[[116]], [[117]] have used only source code text within anovel spam filtering approach to relatively good effect.

5.4 Performance in relation to modelling techniqueFigure 8 shows model performance in relation to themodelling techniques used. Models based on NaïveBayes seem to be performing well overall. Naïve Bayesis a well understood technique that is in common use.Similarly models using Logistic Regression also seemto be performing well. Models using Linear Regressionperform not so well, though this technique assumesthat there is a linear relationship between the variables.Studies using Random Forests have not performed aswell as might be expected (many studies using NASAdata use Random Forests and report good performances[[97]]). Figure 8 also shows that SVM (Support VectorMachine) techniques do not seem to be related to modelsperforming well. Furthermore, there is a wide range oflow performances using SVMs. This may be becauseSVMs are difficult to tune and the default Weka settingsare not optimal. The performance of models using theC4.5 technique is fairly average. However, Arisholm etal.’s models [[8]], [[9]] used the C4.5 technique (as pre-viously explained these are not shown as their relativelypoor results skew the data presented). C4.5 is thought tostruggle with imbalanced data [16] and [17] and this mayexplain the performance of Arisholm et al.’s models.

6 SYNTHESIS OF RESULTSThis section answers our research questions by synthe-

13

*For example plug-ins, binaries

Figure 6. The granularity of the results

in the on-line appendix shows how independent vari-ables as expressed by individual studies have been cate-gorised in relation to the labels used in Figure 7. It showsthat there is variation in performance between modelsusing different independent variables. Models using awide combination of metrics seem to be performing well.For example, models using a combination of static codemetrics (scm), process metrics and source code text seemto be performing best overall (e.g. Shivaji et al. [[164]]).Similarly Bird et al’s study [[18]] which uses a widecombination of socio-technical metrics (code dependencydata together with change data and developer data) alsoperforms well (though the results from Bird et al’s study[[18]] are reported at a high level of granularity). Processmetrics (i.e. metrics based on changes logged in reposito-ries) have not performed as well as expected. OO metricsseem to have been used in studies which perform betterthan studies based only on other static code metrics(e.g. complexity based metrics). Models using only LOCdata seem to have performed competitively comparedto models using other independent variables. Indeed ofthese models using only metrics based on static featuresof the code (OO or SCM), LOC seems as good as anyother metric to use. The use of source code text seemsrelated to good performance. Mizuno et al.’s studies[[116]], [[117]] have used only source code text within anovel spam filtering approach to relatively good effect.

5.4 Performance in relation to modelling techniqueFigure 8 shows model performance in relation to themodelling techniques used. Models based on NaïveBayes seem to be performing well overall. Naïve Bayesis a well understood technique that is in common use.Similarly models using Logistic Regression also seemto be performing well. Models using Linear Regressionperform not so well, though this technique assumesthat there is a linear relationship between the variables.Studies using Random Forests have not performed aswell as might be expected (many studies using NASAdata use Random Forests and report good performances[[97]]). Figure 8 also shows that SVM (Support VectorMachine) techniques do not seem to be related to modelsperforming well. Furthermore, there is a wide range oflow performances using SVMs. This may be becauseSVMs are difficult to tune and the default Weka settingsare not optimal. The performance of models using theC4.5 technique is fairly average. However, Arisholm etal.’s models [[8]], [[9]] used the C4.5 technique (as pre-viously explained these are not shown as their relativelypoor results skew the data presented). C4.5 is thought tostruggle with imbalanced data [16] and [17] and this mayexplain the performance of Arisholm et al.’s models.

6 SYNTHESIS OF RESULTSThis section answers our research questions by synthe-

Class File Module Binary/plug-inBinary/plug-in

11

Figure 2. Data used in models

Figure 3. The size of the data sets used for Eclipse

Hall et al., "A Systematic Review of Fault Prediction Performance in Software Engineering," TSE 2011 (Figure 6)

Page 55: Defect, defect, defect: PROMISE 2012 Keynote

Performance13

*For example plug-ins, binaries

Figure 6. The granularity of the results

in the on-line appendix shows how independent vari-ables as expressed by individual studies have been cate-gorised in relation to the labels used in Figure 7. It showsthat there is variation in performance between modelsusing different independent variables. Models using awide combination of metrics seem to be performing well.For example, models using a combination of static codemetrics (scm), process metrics and source code text seemto be performing best overall (e.g. Shivaji et al. [[164]]).Similarly Bird et al’s study [[18]] which uses a widecombination of socio-technical metrics (code dependencydata together with change data and developer data) alsoperforms well (though the results from Bird et al’s study[[18]] are reported at a high level of granularity). Processmetrics (i.e. metrics based on changes logged in reposito-ries) have not performed as well as expected. OO metricsseem to have been used in studies which perform betterthan studies based only on other static code metrics(e.g. complexity based metrics). Models using only LOCdata seem to have performed competitively comparedto models using other independent variables. Indeed ofthese models using only metrics based on static featuresof the code (OO or SCM), LOC seems as good as anyother metric to use. The use of source code text seemsrelated to good performance. Mizuno et al.’s studies[[116]], [[117]] have used only source code text within anovel spam filtering approach to relatively good effect.

5.4 Performance in relation to modelling techniqueFigure 8 shows model performance in relation to themodelling techniques used. Models based on NaïveBayes seem to be performing well overall. Naïve Bayesis a well understood technique that is in common use.Similarly models using Logistic Regression also seemto be performing well. Models using Linear Regressionperform not so well, though this technique assumesthat there is a linear relationship between the variables.Studies using Random Forests have not performed aswell as might be expected (many studies using NASAdata use Random Forests and report good performances[[97]]). Figure 8 also shows that SVM (Support VectorMachine) techniques do not seem to be related to modelsperforming well. Furthermore, there is a wide range oflow performances using SVMs. This may be becauseSVMs are difficult to tune and the default Weka settingsare not optimal. The performance of models using theC4.5 technique is fairly average. However, Arisholm etal.’s models [[8]], [[9]] used the C4.5 technique (as pre-viously explained these are not shown as their relativelypoor results skew the data presented). C4.5 is thought tostruggle with imbalanced data [16] and [17] and this mayexplain the performance of Arisholm et al.’s models.

6 SYNTHESIS OF RESULTSThis section answers our research questions by synthe-

13

*For example plug-ins, binaries

Figure 6. The granularity of the results

in the on-line appendix shows how independent vari-ables as expressed by individual studies have been cate-gorised in relation to the labels used in Figure 7. It showsthat there is variation in performance between modelsusing different independent variables. Models using awide combination of metrics seem to be performing well.For example, models using a combination of static codemetrics (scm), process metrics and source code text seemto be performing best overall (e.g. Shivaji et al. [[164]]).Similarly Bird et al’s study [[18]] which uses a widecombination of socio-technical metrics (code dependencydata together with change data and developer data) alsoperforms well (though the results from Bird et al’s study[[18]] are reported at a high level of granularity). Processmetrics (i.e. metrics based on changes logged in reposito-ries) have not performed as well as expected. OO metricsseem to have been used in studies which perform betterthan studies based only on other static code metrics(e.g. complexity based metrics). Models using only LOCdata seem to have performed competitively comparedto models using other independent variables. Indeed ofthese models using only metrics based on static featuresof the code (OO or SCM), LOC seems as good as anyother metric to use. The use of source code text seemsrelated to good performance. Mizuno et al.’s studies[[116]], [[117]] have used only source code text within anovel spam filtering approach to relatively good effect.

5.4 Performance in relation to modelling techniqueFigure 8 shows model performance in relation to themodelling techniques used. Models based on NaïveBayes seem to be performing well overall. Naïve Bayesis a well understood technique that is in common use.Similarly models using Logistic Regression also seemto be performing well. Models using Linear Regressionperform not so well, though this technique assumesthat there is a linear relationship between the variables.Studies using Random Forests have not performed aswell as might be expected (many studies using NASAdata use Random Forests and report good performances[[97]]). Figure 8 also shows that SVM (Support VectorMachine) techniques do not seem to be related to modelsperforming well. Furthermore, there is a wide range oflow performances using SVMs. This may be becauseSVMs are difficult to tune and the default Weka settingsare not optimal. The performance of models using theC4.5 technique is fairly average. However, Arisholm etal.’s models [[8]], [[9]] used the C4.5 technique (as pre-viously explained these are not shown as their relativelypoor results skew the data presented). C4.5 is thought tostruggle with imbalanced data [16] and [17] and this mayexplain the performance of Arisholm et al.’s models.

6 SYNTHESIS OF RESULTSThis section answers our research questions by synthe-

Class File Module Binary/plug-inBinary/plug-in

11

Figure 2. Data used in models

Figure 3. The size of the data sets used for Eclipse

Hall et al., "A Systematic Review of Fault Prediction Performance in Software Engineering," TSE 2011 (Figure 6)

Page 56: Defect, defect, defect: PROMISE 2012 Keynote

Defect prediction totally works!

Page 57: Defect, defect, defect: PROMISE 2012 Keynote

Defect prediction totally works!

Page 58: Defect, defect, defect: PROMISE 2012 Keynote

Done? Why are not using?

Page 59: Defect, defect, defect: PROMISE 2012 Keynote

Detailed To Fix List VS Buggy Modules

Page 60: Defect, defect, defect: PROMISE 2012 Keynote

Detailed To Fix List VS Buggy Modules

Page 61: Defect, defect, defect: PROMISE 2012 Keynote

This is what developers want!

Page 62: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction 2.0

Finer Granularity

New Customers

Noise Handling

Page 63: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction 2.0

Finer Granularity

New Customers

Noise Handling

Page 64: Defect, defect, defect: PROMISE 2012 Keynote

FindBugs

http://findbugs.sourceforge.net/

Page 65: Defect, defect, defect: PROMISE 2012 Keynote

Performance of Bug Detection Tools

Tools' priority 1

FindBugs

jLint

PMD

0 5 10 15 20Precision (%)

War

ning

s

Kim and Ernst, “Which Warnings Should I Fix First?” FSE 2007

Page 66: Defect, defect, defect: PROMISE 2012 Keynote

RQ1: How Many False Negatives

!  Defects missed, partially, or fully captured

!  Warnings from a tool should also correctly explain in detail why a flagged line may be faulty

!  How many one-line defects are captured and explained reasonably well (so called, “strictly captured”)?

21

Very high miss rates!

Thung et al., “To What Extent Could We Detect Field Defects?” ASE 2012

Page 67: Defect, defect, defect: PROMISE 2012 Keynote

RQ1: How Many False Negatives

!  Defects missed, partially, or fully captured

!  Warnings from a tool should also correctly explain in detail why a flagged line may be faulty

!  How many one-line defects are captured and explained reasonably well (so called, “strictly captured”)?

21

Very high miss rates!

Thung et al., “To What Extent Could We Detect Field Defects?” ASE 2012

Page 68: Defect, defect, defect: PROMISE 2012 Keynote

Line Level Defect Prediction

Page 69: Defect, defect, defect: PROMISE 2012 Keynote

Line Level Defect Prediction

We have seen this bug in revision 100

Page 70: Defect, defect, defect: PROMISE 2012 Keynote

Bug Fix Memories

Extract patterns in bug fix change history

……

Bug fix changes in revision 1 .. n-1

Memory

Kim et al., “"Memories of bug fixes",” FSE 2006

Page 71: Defect, defect, defect: PROMISE 2012 Keynote

Extract patterns in bug fix change history

……

Search for patterns in Memory

Bug fix changes in revision 1 .. n-1

Memory

Code to examine

Bug Fix Memories

Kim et al., “"Memories of bug fixes",” FSE 2006

Page 72: Defect, defect, defect: PROMISE 2012 Keynote

Fix Wizard

Nguyen et al., “Recurring Bug Fixes in Object-Oriented Programs,” ICSE 2010

public void setColspan(int colspan) throws WrongValueException{if (colspan <= 0) throw new WrongValueException(...);if ( colspan != colspan) {

colspan = colspan;

final Execution exec = Executions.getCurrent();

if (exec != null && exec.isExplorer()) invalidate() ;

smartUpdate(”colspan”, Integer.toString( colspan));...

public void setRowspan(int rowspan) throws WrongValueException{if (rowspan <= 0) throw new WrongValueException(...);if ( rowspan != rowspan) {

rowspan = rowspan;

final Execution exec = Executions.getCurrent();

if (exec != null && exec.isExplorer()) invalidate();

smartUpdate(”rowspan”, Integer.toString( rowspan));...

Figure 1: Bug Fixes at v5088-v5089 in ZK

Usage in method colSpan Usage in method rowSpan

Usage in changed code

Executions.getCurrent

Execution.isExplorer

IF

WrongValueException .< init >

IF

Auxheader.smartUpdate

Auxheader.invalidate

IF

Executions.getCurrent

Execution.isExplorer

IF

WrongValueException .< init >

IF

Auxheader.smartUpdate

Auxheader.invalidate

IF

Figure 2: Graph-based Object Usages for Figure 1

velopers tend to copy-and-paste the implementation code,thus creating similar code fragments. Therefore, we hypoth-esize that code peers, i.e. classes/methods having similarfunctions/interactions, tend to have similar implementationcode, similar naming schemes, inherit from the same class,or implement the same interface (H2).

2.2 Manual Analysis of Recurring FixesWe conducted a manual analysis of recurring bug fixes in

a two-phase experiment. First, a group of experienced pro-grammers examined all fixing changes of the subject systemsand manually identified the similar ones. Then, we analyzedtheir reports to characterize such recurring fixes and theirenclosing code units in order to verify the main hypothesisH1: similar fixes tend to occur on code units having similarroles, i.e. providing similar functions and/or participatingin similar interactions, in term of object usages.

We represented object usages in such code units by graph-based object usage model, a technique in our previous workGrouMiner [21]. In general, each usage scenario is modeledas a labeled, directed, acyclic graph, called a groum, in whichnodes represent method invocations/field accesses of objects,as well as control structures (e.g. if, while) and edges repre-sent the usage orders and data dependencies among them.

Table 1 shows subject systems used in our study. Two ofthem were also used by Kim et al. [13] in previous researchon bug fixes. The fixes are considered at the method level,i.e. all fixing changes to a method at a revision of a sys-tem are considered as an atomic fix. Seven Ph.D. studentsin Software Engineering at Iowa State University with theaverage of 5-year experience in Java manually examined allthose fixes and identified the groups of recurring bug fixes

public class UMLOperationsListModel extendsUMLModelElementCachedListModel{

public void add( int index){Object target=getTarget();if (target instanceof MClassifier) {MClassifier classifier=(MClassifier)target;Collection oldFeatures=classifier.getFeatures();MOperation newOp=MMUtil.SINGLETON.buildOperation(classifier);classifier.setFeatures(addElement(oldFeatures,index,newOp,

operations.isEmpty()?null: operations.get(index)));

public class UMLAttributesListModel extendsUMLModelElementCachedListModel{

public void add( int index){Object target=getTarget();if (target instanceof MClassifier) {MClassifier classifier=(MClassifier)target;Collection oldFeatures=classifier.getFeatures();MAttribute newAt=MMUtil.SINGLETON.buildAttribute(classifier);classifier.setFeatures(addElement(oldFeatures,index,newAt,

attributes.isEmpty()?null: attributes.get(index)));

Figure 3: Bug Fixes at v0459-v0460 in ArgoUML

Usage in UMLOperationsListModel.addElement

IF

MClassifier.getFeatures

MMUtil.buildOperation

MClassifier.setFeatures

UMLOperationsListModel.addElement

List.get

List.isEmpty

Usage in UMLAttributesListModel.addElement

IF

MClassifier.getFeatures

MMUtil.buildAttribute

MClassifier.setFeatures

UMLAttributesListModel.addElement

List.get

List.isEmpty

Figure 4: Graph-based Object Usages for Figure 3

(RBFs). Conflicting identifications were resolved by the ma-jority vote among them. There were only 2 disputed groups.

Table 2 shows the collective reports. Columns RBF andPercentage show the total numbers and the percentage of re-curring bug fixes in all fixing ones. We can see that RBFsare between 17-45% of all fixing changes. This is consistentwith the previous report [13]. While many RBFs (85%-97%) occur at the same revisions on di!erent code units(column In Space), less RBFs occur in di!erent revisions (col-umn In Time). Analyzing such recurring fixes, we found thatmost of them (95%-98%) involve object usages (e.g. methodcalls and field accesses). This is understandable because thestudy is focused on object-oriented programs.

2.3 Representative ExamplesExample 1. Figure 1 shows two recurring fixes taken from

ZK system with added code shown in boxes . Two methodssetColspan and setRowspan are very similar in structure andfunction, thus, are considered as cloned code. When theirfunctions need to be changed, they are changed in the sameway. Figure 2 shows the object usage models of those twomethods with the changed parts shown in the boxes. Thenodes such as Executions.getCurrent and Auxheader.smartUpdaterepresent the invocations of the corresponding methods. Anedge such as the one from Executions.getCurrent to Execution.is-Explorer shows the usage order, i.e. the former is called beforethe latter. As we could see, both methods are implementedwith the same object usage. Then, they are also modified inthe same way as shown in the boxes.

317

Page 73: Defect, defect, defect: PROMISE 2012 Keynote

Fix Wizard

Nguyen et al., Recurring Bug Fixes in Object-Oriented Programs,” ICSE 2010

public void setColspan(int colspan) throws WrongValueException{if (colspan <= 0) throw new WrongValueException(...);if ( colspan != colspan) {

colspan = colspan;

final Execution exec = Executions.getCurrent();

if (exec != null && exec.isExplorer()) invalidate() ;

smartUpdate(”colspan”, Integer.toString( colspan));...

public void setRowspan(int rowspan) throws WrongValueException{if (rowspan <= 0) throw new WrongValueException(...);if ( rowspan != rowspan) {

rowspan = rowspan;

final Execution exec = Executions.getCurrent();

if (exec != null && exec.isExplorer()) invalidate();

smartUpdate(”rowspan”, Integer.toString( rowspan));...

public void setColspan(int colspan) throws WrongValueException{if (colspan <= 0) throw new WrongValueException(...);if ( colspan != colspan) {

colspan = colspan;

final Execution exec = Executions.getCurrent();

if (exec != null && exec.isExplorer()) invalidate() ;

smartUpdate(”colspan”, Integer.toString( colspan));...

public void setRowspan(int rowspan) throws WrongValueException{if (rowspan <= 0) throw new WrongValueException(...);if ( rowspan != rowspan) {

rowspan = rowspan;

final Execution exec = Executions.getCurrent();

if (exec != null && exec.isExplorer()) invalidate();

smartUpdate(”rowspan”, Integer.toString( rowspan));...

Figure 1: Bug Fixes at v5088-v5089 in ZK

Usage in method colSpan Usage in method rowSpan

Usage in changed code

Executions.getCurrent

Execution.isExplorer

IF

WrongValueException .< init >

IF

Auxheader.smartUpdate

Auxheader.invalidate

IF

Executions.getCurrent

Execution.isExplorer

IF

WrongValueException .< init >

IF

Auxheader.smartUpdate

Auxheader.invalidate

IF

Figure 2: Graph-based Object Usages for Figure 1

velopers tend to copy-and-paste the implementation code,thus creating similar code fragments. Therefore, we hypoth-esize that code peers, i.e. classes/methods having similarfunctions/interactions, tend to have similar implementationcode, similar naming schemes, inherit from the same class,or implement the same interface (H2).

2.2 Manual Analysis of Recurring FixesWe conducted a manual analysis of recurring bug fixes in

a two-phase experiment. First, a group of experienced pro-grammers examined all fixing changes of the subject systemsand manually identified the similar ones. Then, we analyzedtheir reports to characterize such recurring fixes and theirenclosing code units in order to verify the main hypothesisH1: similar fixes tend to occur on code units having similarroles, i.e. providing similar functions and/or participatingin similar interactions, in term of object usages.

We represented object usages in such code units by graph-based object usage model, a technique in our previous workGrouMiner [21]. In general, each usage scenario is modeledas a labeled, directed, acyclic graph, called a groum, in whichnodes represent method invocations/field accesses of objects,as well as control structures (e.g. if, while) and edges repre-sent the usage orders and data dependencies among them.

Table 1 shows subject systems used in our study. Two ofthem were also used by Kim et al. [13] in previous researchon bug fixes. The fixes are considered at the method level,i.e. all fixing changes to a method at a revision of a sys-tem are considered as an atomic fix. Seven Ph.D. studentsin Software Engineering at Iowa State University with theaverage of 5-year experience in Java manually examined allthose fixes and identified the groups of recurring bug fixes

public class UMLOperationsListModel extendsUMLModelElementCachedListModel{

public void add( int index){Object target=getTarget();if (target instanceof MClassifier) {MClassifier classifier=(MClassifier)target;Collection oldFeatures=classifier.getFeatures();MOperation newOp=MMUtil.SINGLETON.buildOperation(classifier);classifier.setFeatures(addElement(oldFeatures,index,newOp,

operations.isEmpty()?null: operations.get(index)));

public class UMLAttributesListModel extendsUMLModelElementCachedListModel{

public void add( int index){Object target=getTarget();if (target instanceof MClassifier) {MClassifier classifier=(MClassifier)target;Collection oldFeatures=classifier.getFeatures();MAttribute newAt=MMUtil.SINGLETON.buildAttribute(classifier);classifier.setFeatures(addElement(oldFeatures,index,newAt,

attributes.isEmpty()?null: attributes.get(index)));

Figure 3: Bug Fixes at v0459-v0460 in ArgoUML

Usage in UMLOperationsListModel.addElement

IF

MClassifier.getFeatures

MMUtil.buildOperation

MClassifier.setFeatures

UMLOperationsListModel.addElement

List.get

List.isEmpty

Usage in UMLAttributesListModel.addElement

IF

MClassifier.getFeatures

MMUtil.buildAttribute

MClassifier.setFeatures

UMLAttributesListModel.addElement

List.get

List.isEmpty

Figure 4: Graph-based Object Usages for Figure 3

(RBFs). Conflicting identifications were resolved by the ma-jority vote among them. There were only 2 disputed groups.

Table 2 shows the collective reports. Columns RBF andPercentage show the total numbers and the percentage of re-curring bug fixes in all fixing ones. We can see that RBFsare between 17-45% of all fixing changes. This is consistentwith the previous report [13]. While many RBFs (85%-97%) occur at the same revisions on di!erent code units(column In Space), less RBFs occur in di!erent revisions (col-umn In Time). Analyzing such recurring fixes, we found thatmost of them (95%-98%) involve object usages (e.g. methodcalls and field accesses). This is understandable because thestudy is focused on object-oriented programs.

2.3 Representative ExamplesExample 1. Figure 1 shows two recurring fixes taken from

ZK system with added code shown in boxes . Two methodssetColspan and setRowspan are very similar in structure andfunction, thus, are considered as cloned code. When theirfunctions need to be changed, they are changed in the sameway. Figure 2 shows the object usage models of those twomethods with the changed parts shown in the boxes. Thenodes such as Executions.getCurrent and Auxheader.smartUpdaterepresent the invocations of the corresponding methods. Anedge such as the one from Executions.getCurrent to Execution.is-Explorer shows the usage order, i.e. the former is called beforethe latter. As we could see, both methods are implementedwith the same object usage. Then, they are also modified inthe same way as shown in the boxes.

317

Page 74: Defect, defect, defect: PROMISE 2012 Keynote

Word Level Defect Prediction

Page 75: Defect, defect, defect: PROMISE 2012 Keynote

Word Level Defect Prediction

Fix suggestion...

Page 76: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction 2.0

Finer Granularity

New Customers

Noise Handling

Page 77: Defect, defect, defect: PROMISE 2012 Keynote

all bugs B

Bug DatabaseSource Repository

fixed bugs Bf

commit

commit

commit

commit

commit

all commits C

commit

commit

commit

commit

Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009

Page 78: Defect, defect, defect: PROMISE 2012 Keynote

all bugs B

Bug DatabaseSource Repository

fixed bugs Bf

commit

commit

commit

commit

commit

all commits C

commit

commit

commit

commit

linked via log messages

Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009

Page 79: Defect, defect, defect: PROMISE 2012 Keynote

all bugs B

Bug DatabaseSource Repository

fixed bugs Bf

linked fixed bugs Bfl

commit

commit

commit

commit

commit

all commits C

commit

commit

commit

commit

linked via log messages

Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009

Page 80: Defect, defect, defect: PROMISE 2012 Keynote

all bugs B

Bug DatabaseSource Repository

fixed bugs Bf

linked fixed bugs Bfl

commit

commit

commit

commit

commit

all commits C

commit

commit

linked fixes Cfl

commit

commit

linked via log messages

Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009

Page 81: Defect, defect, defect: PROMISE 2012 Keynote

all bugs B

Bug DatabaseSource Repository

fixed bugs Bf

linked fixed bugs Bfl

commit

commit

commit

commit

commit

all commits C

commit

commit

linked fixes Cfl

commit

commit

linked via log messages

related,but not linked

Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009

Page 82: Defect, defect, defect: PROMISE 2012 Keynote

all bugs B

Bug DatabaseSource Repository

fixed bugs Bf

linked fixed bugs Bfl

commit

commit

commit

commit

commit

all commits C

bug fixes Cf

commit

commit

linked fixes Cfl

commit

commit

linked via log messages

related,but not linked

Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009

Page 83: Defect, defect, defect: PROMISE 2012 Keynote

all bugs B

Bug DatabaseSource Repository

fixed bugs Bf

linked fixed bugs Bfl

commit

commit

commit

commit

commit

all commits C

bug fixes Cf

commit

commit

linked fixes Cfl

commit

commit

linked via log messages

related,but not linked

Noise!

Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009

Page 84: Defect, defect, defect: PROMISE 2012 Keynote

How resistant a defect prediction model is to noise?

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"

Buggy%F'measure�

(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�

SWT"

Debug"

Columba"

Eclipse"

Scarab"

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"

Buggy%F'measure�

(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�

SWT"

Debug"

Columba"

Eclipse"

Scarab"

Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011

Page 85: Defect, defect, defect: PROMISE 2012 Keynote

How resistant a defect prediction model is to noise?

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"

Buggy%F'measure�

(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�

SWT"

Debug"

Columba"

Eclipse"

Scarab"

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"

Buggy%F'measure�

(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�

SWT"

Debug"

Columba"

Eclipse"

Scarab"

Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011

Page 86: Defect, defect, defect: PROMISE 2012 Keynote

How resistant a defect prediction model is to noise?

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"

Buggy%F'measure�

(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�

SWT"

Debug"

Columba"

Eclipse"

Scarab"

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"

Buggy%F'measure�

(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�

SWT"

Debug"

Columba"

Eclipse"

Scarab"20%

Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011

Page 87: Defect, defect, defect: PROMISE 2012 Keynote

Closest List Noise Identification

� . Aj will be returned as the identified noise set. Empirical study found that when � is 3 and � is 0.99, this algorithm performs the best.

C LNI Algorithm: for each iteration j

for each instance Insti for each instance Instk

if(Instk�!Aj) continue;

else add EuclideanDistance(Insti, Instk) to Listi ; calculate percentile of top 10 instances in Listi whose label is different from Insti as � ; if � �� Aj = Aj�!Instj; end end

end end if N(Aj��j-1)/N(Aj��� and N(Aj��j-1)/N(Aj-1���

break; end

end return Aj

F igure 9. The pseudo-code of the C LN I algorithm

A

F igure 10. An illust ration of the C LN I algorithm The high level idea of CLNI can be illustrated as in Figure 10. The blue points represent clean instances and the white points represent buggy instances. When checking if an instance A is noisy, CLNI first lists all instances that are close to A (the points included in the circle). CLNI then calculates the ratio of instances in the list that have a class label different from that of A (the number of orange points over the total number of points in the circle). If the ratio is over a specific threshold � , we consider instance A to have a high probability to be a noisy instance.

6.2 Evaluation We evaluate our noise detection method using data from the Eclipse 3.4 SWT and Debug projects as described in Section 5.2. These two datasets are considered as the �golden sets� as most of their bugs are linked bugs. Following the method described in Section 4.2, we create the noisy datasets for these two projects by selecting random n% of instances and artificially changing their labels (from buggy to clean and from clean to buggy). We then apply the CLNI algorithm to detect noisy instances that we have just injected. We use Precision, Recall and F-measures to evaluate the performance in identifying the noisy instances.

Table 3 shows the results when the noise rate is 20%. The Precisions are above 60%, Recalls are above 83% and F-measures are above 0.71. These promising results confirm that the proposed CLNI algorithm is capable of identifying noisy instances.

Table 3. The performance of CL NI in identifying noisy instances

Precision Recall F-measure

Debug 0.681 0.871 0.764

SWT 0.624 0.830 0.712

Figure 11 also shows the performance of CLNI under different noise levels for the SWT component. When the noise rate is below 25%, F-measures increase with the increase of the noise rates. When the noise rate is above 35%, CLNI will have bias toward incorrect instances, causing F-measures to decrease.

F igure 11. Pe rformance of C LNI with diffe rent noise rates After identifying the noises in the noisy Eclipse 3.4 SWT and Debug datasets using CLNI, we eliminate these noises by flipping their class labels. We then evaluate if the noise-removed training set improves prediction accuracy.

The results for the SWT component before and after removing FN and FP noises are shown in Table 4. In general, after removing the noises, the prediction performance improves for all learners, especially for those that do not have strong noise resistance ability. For example, for the SVM learner, when 30% FN&FP noises were injected into the SWT dataset the F-measure was 0.339. After identifying and removing the noises, the F-measure jumped to 0.706. These results confirm that the proposed CLNI algorithm can improve defect prediction performance for noisy datasets.

Table 4. The defect predic tion per formance after identifying and removing noisy instances (SW T) Re move Noises ?

Noise Rate

Bayes Ne t

Na ïve Bayes SV M Bagging

No 15% 0.781 0.305 0.594 0.841 30% 0.777 0.308 0.339 0.781 45% 0.249 0.374 0.353 0.350

Yes

15% 0.793 0.429 0.797 0.838 30% 0.802 0.364 0.706 0.803 45% 0.762 0.418 0.235 0.505

Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011

Page 88: Defect, defect, defect: PROMISE 2012 Keynote

Noise detection performance

Precision Recall F-measure

Debug 0.681 0.871 0.764

SWT 0.624 0.830 0.712

(noise level =20%)Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011

Page 89: Defect, defect, defect: PROMISE 2012 Keynote

0

25

50

75

100

0% 15% 30% 45%

SWT

F-m

easu

re

Noise level

Noisey

Bug prediction using cleaned data

Page 90: Defect, defect, defect: PROMISE 2012 Keynote

Bug prediction using cleaned data

0

25

50

75

100

0% 15% 30% 45%

SWT

F-m

easu

re

Noise level

Noisey Cleaned

Page 91: Defect, defect, defect: PROMISE 2012 Keynote

Bug prediction using cleaned data

76% F-measure

with 45% noise0

25

50

75

100

0% 15% 30% 45%

SWT

F-m

easu

re

Noise level

Noisey Cleaned

Page 92: Defect, defect, defect: PROMISE 2012 Keynote

ReLink

Links

Bug database

Unknownlinks

Features

Source code

repository Traditionalheuristics

(link miner)

Recovering links using

feature

Links

Links

Combine

Wu et al., “ReLink: Recovering Links between Bugs and Changes,” FSE 2011

Page 93: Defect, defect, defect: PROMISE 2012 Keynote

ReLink

Links

Bug database

Unknownlinks

Features

Source code

repository Traditionalheuristics

(link miner)

Recovering links using

feature

Links

Links

Combine

Wu et al., “ReLink: Recovering Links between Bugs and Changes,” FSE 2011

Page 94: Defect, defect, defect: PROMISE 2012 Keynote

ReLink Performance

Wu et al., “ReLink: Recovering Links between Bugs and Changes,” FSE 2011

ZXing

OpenIntents

Apache

0 20 40 60 80 100F-measure

Proj

ects

Traditional ReLink

Page 95: Defect, defect, defect: PROMISE 2012 Keynote

Label Historical Changes

...

...

...

...

Rev 102 (no BUG)

...

...

...

...

Rev 101 (with BUG)

fixed

……

Development history of a file

...

...

...

...

Rev 102............

Rev 1............

Rev 100............

Rev 101

change change

Change message: “fix for bug 28434”

Fischer et al, “Populating a Release History Database from Version Control and Bug Tracking Systems,” ICSM2003

Page 96: Defect, defect, defect: PROMISE 2012 Keynote

Atomic Change

...insertTab()......

Rev 102 (no BUG)

...setText(“\t”)......

Rev 101 (with BUG)

fixed

Change message: “fix for bug 28434”

Fischer et al, “Populating a Release History Database from Version Control and Bug Tracking Systems,” ICSM2003

Page 97: Defect, defect, defect: PROMISE 2012 Keynote

Composite Change

does meet such expectation. In particular, it potentially offers the following benefits:

x Decomposition helps developers distinguish different concerns within a composite change.

Figure 5 shows the diff for JFreeChart revision 1083, which touches one source code file and modifies six lines of code. According to its commit log, this revision fixes bug 1864222 by modifying the method createCopy to handle an empty range. Our approach decomposed this revision into three change-slices. One change-slice contains only line 944, which is the exact fix intended for this change. The second change-slice, line 677, is a minor fix irrelevant to the intended fix at line 944. The third change-slice contains line 973, 974, 978 and 979, which are trivial formatting improvement. This accurate decomposition could help developers effectively distinguish between sub-changes with different concerns.

x Decomposition draws developer’s attention to incon-spicuous change, which might actually be an im-portant one.

When reviewing a change with tens of changed lines or more, a developer might quickly skim through the change to get a general idea instead of staying focused on every single changed line. Now, if one or two lines in this change in fact address different issues, the developer is likely to overlook them and thus miss these additional intentions. We find that when decomposition is applied, such inconspicuous change can be easily isolated as one independent change-slice. For exam-ple, Xerces revision 730796, 819653 and 944964 were decom-posed into 2, 10 and 2 change-slices, respectively. Figure 6 shows one of each their change-slices that contains only one

line of code change. Although these one-line changes are likely minor fixes or just for perfective purpose, the developer might still want to be aware of them during her change review.

Sometimes an inconspicuous change is not necessarily a minor one. Figure 7 shows one of the three change-slices for JFreeChart revision 1366, where a field is set to true instead of false. Figure 7 also shows one of the two change-slices for JFreeChart revision 1801. The one-line change here fixes a subtle bug in an incorrect if-condition where a parenthesis for the “||” operation is missing. Although both fixes are small, they could have significant impact on the program behavior. Like in the second case, a missing parenthesis messes up the operation precedence and flips the if-condition value, which could cause the program to end up with wrong results or even crash. However, due to its small size, such fix might slip through the developer’s eyes while she is busy reviewing other noticeable changed blocks. Our approach isolates such small but critical fix from the composite change, and presents it to the developer as an individual change-slice. Hence, the developer can be fully aware of such important changes.

x Decomposition reveals questionable changes. Our approach decomposed Commons Math revision

943068 into three change-slices. While the first two change-slices faithfully did what the commit log mentioned to do, the third change-slice caught our attention (Figure 8). In this change-slice, the parameter in the method setForce (boolean) is renamed from force to forceOverwrite. Then this method is directly called by a newly added method setOverwrite (boole-an). The intention here is probably adding a setter for the field forceOverwrite, but the change seems to fail on this purpose – the renamed parameter does not affect the assignment at (*). We suspected that force at the right hand side of the assignment should also be changed to forceOverwrite.

We then searched the subsequent revisions for more evi-dence of our speculation. As expected, the author of this change did fix this bug later in revision 943070, along with a commit log saying “wrong assignment after I renamed the pa-rameter. Unfortunately there doesn’t seem to be a testcase that

public TimeSeriesDataItem addOrUpdate(RegularTimePeriod period, double value)677 return this.addOrUpdate(period, new Double(value)); return addOrUpdate(period, new Double(value));678 } }public TimeSeries createCopy(RegularTimePeriod start, RegularTimePeriod end)944 if (endIndex < 0) { if ((endIndex < 0) || (endIndex < startIndex)) {945 emptyRange = true; emptyRange = true;946 } }public boolean equals(Object object)973 if (!ObjectUtilities.equal( if (!ObjectUtilities.equal(getDomainDescription(),974 getDomainDescription(), s.getDomainDescription() s.getDomainDescription())) {

)){

975 return false; return false;976 } }978 if (!ObjectUtilities.equal( if (!ObjectUtilities.equal(getRangeDescription(),979 getRangeDescription(), s.getRangeDescription() s.getRangeDescription())) {

)){

980 return false; return false;981 } }

Figure 5. JFreeChart revision 1083.

One of the two change-slices from Xerces revision 730796try { try {

//Write start tag //Write start tagwriter.write(“<”); writer.write(‘<‘);

One of the ten change-slices from Xerces revision 819653public Object[] getEntries() { public Object[] getEntries() {

Object[] entries = new String[fNum << 1]; Object[] entries = new Object[fNum << 1];for (int i=0, j=0; i<fTableSize && j<fNum << 1; i++) { for (int i=0, j=0; i<fTableSize && j<fNum << 1; i++) {

One of the two change-slices from Xerces revision 944964public int available() throws IOException { public int available() throws IOException {

int bytesLeft = fLength - fOffset; final int bytesLeft = fLength - fOffset;if (bytesLeft == 0) { if (bytesLeft == 0) {

Figure 6. Examples of inconspicuous and minor changes.

One of the three change-slices from JFreeChart revision 1366this.strokeList = new StrokeList(); this.strokeList = new StrokeList();this.baseStroke = DEFAULT_STROKE; this.baseStroke = DEFAULT_STROKE;this.autoPopulateSeriesStroke = false; this.autoPopulateSeriesStroke = true;

One of the two change-slices from JFreeChart revision 1801if (tick.getValue() != 0.0 if ((tick.getValue() != 0.0

|| !isRangeZeroBaselineVisible() && (paintLine)) { || !isRangeZeroBaselineVisible()) && paintLine) {getRenderer().drawRangeLine(g2, this, getRangeAxis(), getRenderer().drawRangeLine(g2, this, getRangeAxis(),

area, tick.getValue(), gridPaint, gridStroke); area, tick.getValue(), gridPaint, gridStroke);} }

Figure 7. Examples of inconspicuous yet important changes.

public void setForce(boolean force) { public void setForce(boolean forceOverwrite) {

this.forceOverwrite = force; this.forceOverwrite = force; (*)

} }

public void setOverwrite(boolean forceOverwrite) {

setForce(forceOverwrite);

}

Figure 8. The third change-slice from Commons Math revision 943068 reveals questionable change.

JFree revision 1083

Tao et al, “"How Do Software Engineers Understand Code Changes?” FSE 2012

hunk 1

hunk 2

hunk 3

hunk 4

Page 98: Defect, defect, defect: PROMISE 2012 Keynote

Defect Prediction 2.0

Finer Granularity

New Customers

Noise Handling

Page 99: Defect, defect, defect: PROMISE 2012 Keynote

Warning Developers

“Safe” Files(Predicted as not buggy)

“Risky” Files(Predicted as buggy)

Page 100: Defect, defect, defect: PROMISE 2012 Keynote

Change Classification

...

...

...

Rev 4

...

...

...

Rev 1

change.........

Rev 2

...

...

...

Rev 3

change change

Kim et al., "Classifying Software Changes: Clean or Buggy?" TSE 2009

Page 101: Defect, defect, defect: PROMISE 2012 Keynote

Change Classification

...

...

...

Rev 4

...

...

...

Rev 1

change.........

Rev 2

...

...

...

Rev 3

change change

...

...

...

Rev 4

...

...

...

Rev 1

change.........

Rev 2

...

...

...

Rev 3

change change

“Safe” Files

“Risky” Files

Page 102: Defect, defect, defect: PROMISE 2012 Keynote

Change Classification

...

...

...

Rev 4

...

...

...

Rev 1

change.........

Rev 2

...

...

...

Rev 3

change change

...

...

...

Rev 4

...

...

...

Rev 1

change.........

Rev 2

...

...

...

Rev 3

change change

“Safe” Files

“Risky” Files

Page 103: Defect, defect, defect: PROMISE 2012 Keynote

Defect prediction based Change Classification

Debug UI

JDT

JEdit

PDE

POI

Team UI

0 0.20 0.40 0.60 0.80F-measure

Proj

ects

CC Cached CC

Page 104: Defect, defect, defect: PROMISE 2012 Keynote

Warning Developers

“Safe” Location(Predicted as not buggy)

“Risky” Location(Predicted as buggy)

Page 105: Defect, defect, defect: PROMISE 2012 Keynote

Test-case Selection

Page 106: Defect, defect, defect: PROMISE 2012 Keynote

Test-case Selection

Executing test cases

Page 107: Defect, defect, defect: PROMISE 2012 Keynote

Test-case Selection

Runeson and Ljung, “Improving Regression Testing Transparency and Efficiency with History-Based Prioritization,” ICST 2011

0

0.25

0.50

0.75

1.00

R1.0 R1.1 R1.2 R1.3 R1.4 R1.5

APFD

Releases

BaselineHistory1History2

Page 108: Defect, defect, defect: PROMISE 2012 Keynote

Warning Prioritization

Page 109: Defect, defect, defect: PROMISE 2012 Keynote

Warning Prioritization

Page 110: Defect, defect, defect: PROMISE 2012 Keynote

Warning Prioritization

0"

2"

4"

6"

8"

10"

12"

14"

16"

18"

0" 20" 40" 60" 80" 100"

Pre

cision

)(%))

Warning)Instances)by)Priority)

History"

Tool"

Kim and Ernst, “Which Warnings Should I Fix First?” FSE 2007

Page 111: Defect, defect, defect: PROMISE 2012 Keynote

Other Topics

• Explanation- Why it has been predicted as defect-prone?

• Cross-project prediction

• Cost effectiveness measures

• Active Learning/Refinement

Page 112: Defect, defect, defect: PROMISE 2012 Keynote

New metrics

Algorithms

Coarse granularity 1.0

Defect Prediction 2.0

Page 113: Defect, defect, defect: PROMISE 2012 Keynote

New metrics

Algorithms

Coarse granularity 1.0 2.0New customers

Noise Handling

Finer granularity

Defect Prediction 2.0

Page 114: Defect, defect, defect: PROMISE 2012 Keynote

New metrics

Algorithms

Corse granularity 1.0 2.0New customers

Noise Handling

Finer granularity

Defect Prediction 2.0

Page 115: Defect, defect, defect: PROMISE 2012 Keynote

2013

Page 116: Defect, defect, defect: PROMISE 2012 Keynote

MSR$2013:$Back$to$roots$

Massimiliano$Di$Penta$and$Sung$Kim$Program'co)chairs'

Tom$Zimmermann$General'chair'

Alberto$Bacchelli$$Mining'Challenge'Chair'

Page 117: Defect, defect, defect: PROMISE 2012 Keynote

MSR$2013:$Back$to$roots$

Massimiliano$Di$Penta$and$Sung$Kim$Program'co)chairs'

Tom$Zimmermann$General'chair'

Alberto$Bacchelli$$Mining'Challenge'Chair'

15February

Page 118: Defect, defect, defect: PROMISE 2012 Keynote

Some slides/data are borrowed with thanks from

• Tom Zimmermann, Chris Bird

• Andreas Zeller

• Ahmed Hassan

• David Lo

• Jaechang Nam, Yida Tao

• Tien Neguan

• Steve Counsell, David Bowes, Tracy Hall and David Gray

• Wen Zhang