Assisting Code Search with Automatic Query Reformulation for Bug Localization

Assisting Code Search with Automatic Query

Reformulation for Bug Localization

Bunyamin Sisman & Avinash [email protected] | [email protected]

Purdue University

MSR’13

mailto:[email protected]

mailto:[email protected]

Outline

I. Motivation

II. Past Work on Automatic Query Reformulation

III. Proposed Approach: Spatial Code Proximity (SCP)

IV. Data Preparation

V. Results

VI. Conclusions

MSR'13

MSR'13

Main Motivation

It is extremely common for software developers to use arbitrary

abbreviations & concatenations

in software. These are generally difficult to predict when searching the code base of a project.

The question is “Is there some way to automatically reformulate a user’s query so that all such relevant terms are also used in retrieval?”

MSR'13

We show how a query can be automatically reformulated for superior retrieval accuracy

We propose a new framework for Query Reformulation, which leverages the spatial proximity of the terms in files

The approach leads to significant improvements over the baseline and the competing Query Reformulation approaches

Summary of Our Contribution

MSR'13

Our approach preserves or improves the retrieval accuracy for 76% of the 4,393 bugs we analyzed for Eclipse and Chrome projects

Our approach improves the retrieval accuracy for 42% of the 4,393 bugs

Improvements are 66% for Eclipse and 90% for Chrome in terms of MAP (Mean Average Precision)

We also describe the conditions under which Query Reformulation may perform poorly.

Summary of Our Contribution

MSR'13

Query Reformulation with Relevance Feedback

1. Perform an initial retrieval with the original query

2. Analyze the set of top retrieved documents vis-à-vis the query

3. Reformulate the query

MSR'13

Acquiring Relevance Feedback

Implicitly: infer feedback from user interactions

Explicitly: user provides feedback [Gay et al. 2009]

Pseudo Relevance Feedback (PRF): Automatic QR

This is our work!

MSR'13

Data Flow in the Proposed Retrieval Framework

MSR'13

Automatic Query Reformulation

No user involvement!

It takes less than a second to reformulate a query on ordinary desktop hardware!

It is cheap!

It is effective!

MSR'13

Previous Work on Automatic QR (for Text Retrieval)

Rocchio’s Formula (ROCC)

Relevance Model (RM)

MSR'13

The Proposed Approach to QR:Spatial Code Proximity (SCP)

Spatial Code Proximity is an elegant approach to giving greater weights to terms in source code that occur in the vicinity of the terms in a users’ query

Proximities may be created through commonly used concatenations

Punctuation characters

Camel Casing etc…

Underscores: tab_strip_gtk

Camel casing: kPinnedTabAnimationDurationMs

MSR'13

Spatial Code Proximity (SCP) (Cont’d)

Tokenize source files and index the positions of the terms in each source file:

Use the distance between terms to find relevant terms vis-à-vis a query!

SCP: Bringing the Query into the Picture

MSR'13

Example Query: “Browser Animation”

First perform an initial retrieval with the original query

Increase the weights of those nearby terms!

MSR'13

Research Questions

Question 1: Does the proposed QR approach improve the accuracy of source code retrieval. If so, to what extent?

Question 2: How do the QR techniques that are currently in the literature perform for source code retrieval?

Question 3: How does the initial retrieval performance affect the performance of QR?

Question 4: What are the conditions under which QR may perform poorly?

MSR'13

Data Preparation

For evaluation, we need a set of queries and the relevant files

We use the titles of the bug reports as queries

We have to link the repository commits to the bug tracking database! Used regular expressions to detect Bug Fix

commits based on commit messages

MSR'13

Data Preparation (Cont’d)

Eclipse v3.1 Chrome v4.0

#Bugs 4,035 358

Avg. # Relevant Files

2.76 3.82

Avg. #Commits 1.36 1.23

1https://engineering.purdue.edu/RVL/Database/BUGLinks/

Resulting dataset: BUGLinks1

MSR'13

Evaluation Framework

We use Precision and Recall based metrics to evaluate the retrieval accuracy.

Determine the query sets for which the proposed QR approaches lead to

1. improvements in the retrieval accuracy

2. degradation in the retrieval accuracy

3. no change in the retrieval accuracy

Analyze these sets to understand the characteristics of the queries each set contains

MSR'13

Evaluation Framework (Cont’d)

For comparison of these sets, we used the following Query Performance Prediction (QPP) metrics [Haiduc et al. 2012, He et al. 2004]:

Average Inverse Document Frequency (avgIDF)

Average Inverse Collection Term Frequency (avgICTF)

Query Scope (QS)

Simplified Clarity Score (SCS)

Additionally, we analyzed

Query Lengths

Number of Relevant files per bug

MSR'13

QR with Bug Report Titles

#Improve

d

#Worsened

0

500

1000

1500

2000

ROCC

RM

SCP (Proposed)

#Bugs

ROCC RM SCP (Proposed)

MSR'13

Improvements in Retrieval Accuracy (% Increase in MAP)

Eclipse Chrome0%

2000%

4000%

6000%

8000%

10000%

ROCC

RM

SCP (Proposed)

ROCC RM SCP (Proposed)

MSR'13

Conclusions & Future WorkOur framework can use a weak initial query

as a jumping off point for a better query.

No user input is necessary

We obtained significant improvements over the baseline and the well-known Automatic QR methods.

Future Work includes evaluation of different term proximity metrics in source code for QR

MSR'13

References

[1] B. Sisman and A. Kak, “Incorporating version histories in information retrieval based bug localization,” in Proceedings of the 9th Working Conference on Mining Software Repositories (MSR’12). IEEE, 2012, pp. 50–59

[2] G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “On the use of relevance feedback in IR-based concept location,” in International Conference on Software Maintenance (ICSM’09), sept. 2009, pp. 351 –360.

[3] A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An information retrieval approach to concept location in source code,” in Proceedings of the 11th Working Conference on Reverse Engineering (WCRE’04). IEEE Computer Society, 2004, pp. 214–223

MSR'13

References

[4] S. Haiduc, G. Bavota, R. Oliveto, A. De Lucia, and A. Marcus, “Automatic query performance assessment during the retrieval of software artifacts,” in Proceedings of the 27th International Conference on Automated Software Engineering (ASE’12) . ACM, 2012, pp. 90–99

[5] B. He and I. Ounis, “Inferring query performance using pre-retrieval predictors,” in Proc. Symposium on String Processing and Information Retrieval . Springer Verlag, 2004, pp. 43–54

Technology

Assisting Code Search with Automatic Query Reformulation for Bug Localization