PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"

Detecting Bug Duplicate Reports Through Locality of Reference

Tomi Prifti, Sean Banerjee, Bojan Cukic

Lane Department of CSEEWest Virginia UniversityMorgantown, WV, USA

September 2011

Presentation Outline

• Introduction

• Goals

• Related Work

• Understanding the Firefox Repository

• Experimental Setup

• Results

• Summary

Introduction

• Bug tracking systems are essential for software maintenance and testing

• Developers and simple users can report failure occurrences

• Advantages:

Users involved in error reporting

Direct impact of software quality.• Disadvantages:

Large number of reports on daily basis.

Significant effort to triage.

Users may submit many duplicate reports.

A typical bug report

Goals

• Comprehensive empirical analysis of a large bug report dataset.

• Creation of a search tool– Encourage users to search the repository– Avoid duplicate report submission whenever possible– Assisting with report triage

• Build a list of reports possibly describing the same problem

• Let a triager examines the suggested list

Related Work

• Providing Triagers with a Suggested List– Provide a suggested list of similar bugs to triagers for

examinations.• Wang et. al. exploit NLP techniques and execution information• Duplicate detection rate as high as 67%-93%

• Semi-automated Filtering– Determine the type of the report (Duplicate or Primary). If the new

report is classified as a duplicate filter it out.• Jalbert et al. use text semantics and a graph clustering

technique to predict duplicate status• Filtered out only 8% of duplicate reports

Related Work

• Semi-automated Assignment– Apply text categorization techniques to predict the developer that

should work on the bug• Cubranic et. al. apply supervised Bayesian learning.

Correctly classify 30% of the reports• Anvik et. al. uses a supervised machine learning algorithm.

Precision rates of 57% and 64% for Firefox and Eclipse

• Improving Report Quality– Duplicate reports are not considered harmful

• Bettenburg et al. developed a tool, called CUEZILLA, that measures the quality of bug reports in real time

• “Steps to reproduce” and “Stack traces” are the most useful information in bug reports

Related Work

• Bugzilla Search Tool– Bugzilla 4.0 released around February 2011 provides duplicate

detection– Tool performs a Boolean full text search on the title over the entire

repository– Generates a dozen or so reports that may match at least one of

the search terms– In some instance testing with the exact title from an existing report

title did not return the report itself– Unknown accuracy of reported matches

Firefox Repository

• Firefox releases: 1.0.5, 1.5, 2.0, 3.0, 3.5 and the current version 3.6 (as of June 2010).

• 65% of reports reside in groups of one.• 90% of duplicates are distributed in small groups

of 2-16 reports

Time Interval Between Reports

• Many bugs receive the first duplicate within the first few months of the original report.

Experimental Setup

• Tokenization - “Bag-of-Words”

• Stemming. Reducing words to their root

• Stop Words Removal

– Lucene API used for pre-processing

• Term Frequency/Inverse Document Frequency (TF/IDF) used for weighting words

• Cosine Similarity used for similarity measures

Example of tokenizing, stemming and stop word removal

Sending email is not functional. send email function

Experimental Procedure

• Start with initial 50% as historical information• Group containing most recent primary or duplicate is on

top of the initial list• Build suggested list using IR techniques• As experiment progresses historical repository increases• Continue until reports classified as duplicate or primary

• If a bug is primary it is forwarded to the repository

• This may not be realistic as triagers may misjudge reports

Measuring Performance

• Performance of the bug search tool is measured by the recall rate

– Nrecalled refers to the number of duplicate reports correctly classified

– Ntotal refers to the total number of duplicate reports

Approach methodology

• Reporters query the repository.

• Use “title” (summary) to compare reports.

• Four experiments:– TF/IDF– “Sliding Window” - TF/IDF– “Sliding Window” - Group Centroids - TF/IDF– “Sliding Window” - Group Centroids

• The centroid is composed of all unique terms from all reports in the group and the sum of their frequencies in each report. The total frequency of each term is divided by the number of reports in the group.

Sliding Window Defined

• “Sliding-Window” approach. Keep a window of fixed size n– Sort all groups based on the time elapsed between the last

report and the new incoming report.– Select top n groups (2000 is optimal analysis shows 95%

accuracy of duplicate being in this group)– Apply IR techniques only on top n groups

– Build a short list of top m most similar reports to present to the

triager/reporter

Experimental Results

• Our results demonstrate that Time-Window/Group Centroid and report summaries predict duplicate problem reports with a recall ratio of up to 53%.

Performance and Runtime

• Large variance in recall rate initially. Time window approach stabilizes, while TF/IDF degrades.

• Classification run time is faster for the Time Window approach. Additional report increases computation time in TF/IDF

Result Comparisons

Group Approach Results

Hiew, et-al Text analysis Recall rate ~50%

Cubranic, et-al Bayesian learning Text categorization

Correctly predicted ~30% duplicates

Jalbert, et-al Text SimilarityClustering

Recall rate ~51% List size 20

Wang, et-al NLPExecution Information

67-93% detection rate (43-72% with NLP)

Wang, et-al Enhanced version of prior algorithm

17-31% improvement over state of art

Our approach Time Window/Centroids

~53% recall rate

Threats to Validity

• Assumption that the ground truth on duplicates is correct– The life cycle of a bug is ever changing

– Some reports often change state multiple times

Summary and Future Work

SUMMARY•Comprehensive study to analyze long term duplicate trends in a large, open source project.•Improve search features in duplicate detection by providing a search list.•Time interval between reports can be used to improve the search space.

FUTURE WORK•Compare with other projects (eg: Eclipse) to be able to generalize the approach.•Effects on duplicate propagation caused by a user incorrectly selecting a report from the suggested list.

TF/IDF

• Compare vector representing a new report to every vector that is currently in the database.

• Vectors in the database are weighted using TF/IDF to emphasize rare words.

• The reports are ranked based on their cosine-similarity scores.

• Report ranking is used to build the suggested list presented to the user.

• Run time impacted as repository size grows.

Sliding Window - TF/IDF

• Apply time window to limit groups under consideration for search.

• Only the reports within 2,000 groups are considered.

• Reports are weighted using TF/IDF.

• Scoring and building of the suggested list same as TF/IDF

Sliding Window – Centroid

• Same time window. • Reports from the 2,000 groups not immediately

searched and weighted using TF/IDF. • Centroid vector representing each group is used.• Example:

– Summary 1 unable send email– Summary 2 send email function– Summary 3 send email after enter recipient– The resulting centroid of the group is: 1.0 send, 0.33 unable, 1.0

mail, 0.33 function, 0.33 after, 0.33 enter, 0.33 recipient.

Sliding Window – Centroid – TD/IDF

• Uses centroid technique described before.

• Weight each term in centroids using TF/IDF weighting scheme.

Technology

PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"