Slides

Preview:

DESCRIPTION

 

Citation preview

1

Recommendation Systems for Code Reuse

Tao XieDepartment of Computer Science

North Carolina State University

Raleigh, USA

2 2

Motivation• Programmers commonly reuse

APIs of existing frameworks or libraries

– Advantages: Low cost and high efficiency of development

– Challenges: Complexity and lack of documentation

E.g., searching for information nearly ¼ of developer time [metallect.com]

Frameworks

Example Task from Eclipse ProgrammingTask: How to parse code in a dirty editor of Eclipse?

?Query:

“IEditorPart -> ICompilationUnit”

Open Source ProjectsOpen Source Projects

1 2 N…

… Extract

MIS 1MIS 2

...…

MIS k

*MIS: Method-Invocation sequence, FMIS: Frequent MIS

FMIS 1FMIS 2

…FMIS nRecommend

Mine

PARSEWeb [Thummalapenta&Xie ASE 07]

4

Scenario 1

• While reusing APIs of existing open source frameworks or libraries, programmers often – know what type of object they need – but do not know how to write code

for getting that object

Query: “Source Destination”

How to use these

APIs?

Prospector [Mandelin et al. PLDI 05 ], XSnippet [Sahavechaphan&Claypool OOPSLA 06 ], PARSEWeb [Thummalapenta&Xie ASE 07]

5

Example Task from Eclipse Programming

• Task: How to parse code in a dirty editor?• Query: IEditorPart ICompilationUnit

• Example solution from Prospector/PARSEWeb:IEditorPart iep = ...IEditorInput editorInp = iep.getEditorInput();IWorkingCopyManager wcm = JavaUI.getWorkingCopyManager();ICompilationUnit icu = wcm.getWorkingCopy(editorInp);

• Difficulties: a. Needs an instance of IWorkingCopyManager b. Needs to invoke a static method of JavaUI for getting the preceding instance

Prospector [Mandelin et al. PLDI 05 ], XSnippet [Sahavechaphan&Claypool OOPSLA 06 ], PARSEWeb [Thummalapenta&Xie ASE 07]

6

Scenario 2

• While reusing APIs of existing open source frameworks or libraries, programmers often – know what method call they need– but do not know how to write code

before and after this method call

Query: “Method name”

How to use these

APIs?

MAPO [Xie&Pei MSR 05]

7

Example Task from BCEL Programming

• Task: How to instrument the bytecode of a Java class by adding an extra method to the class?

• Query: org.apache.bcel.generic.ClassGen public void addMethod(Method m)

• Example solution from MAPO: public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … }

MAPO [Xie&Pei MSR 05]

8

Scenario 3

• While reusing APIs of existing open source frameworks or libraries, programmers often – know structural context such as a

class’ type, its parents, and fields’ types, a method’s signature, method or constructor callees

– but do not know how to write code in this context

Query: Structural context

How to use these

APIs?

Strathcona [Holmes et al. 05], XSnippet [Sahavechaphan&Claypool OOPSLA 06 ]

9

Example Task from HttpClient Programming

• Task: How to evolve a system to use a third party library, HttpClient, for handling http connections?

• Query: HttpClient, PostMethod classes

• Example solution from Strathcona:

Strathcona [Holmes et al. 05], XSnippet [Sahavechaphan&Claypool OOPSLA 06 ]

10

Steps in Recommenders

• Data collection/extraction

• Data preprocessing

• Data analysis/mining

• Result postprocessing

• Result representation

11

Data Collection/Extraction

• From one or multiple local code repositories– Often followed by offline analysis or mining– Challenges: lack of relevant code examples– Ex.: Strathcona, Prospector, XSnippet

• From the whole open source world with a code search engine!– Often followed by on-the-fly analysis and mining– Challenges: only partial code files– Ex.: MAPO, PARSEWeb

12 12

Exploiting A Code Search Engine• Accepts queries including keywords of classes or/and

method names

• Interacts with a code search engine such as Google code search to gather related code samples

• Stores gathered code samples (source files) in a local code repository (later being analyzed and mined)

• Challenges: gathered code samples are partial and not compilable as code search engines retrieve individual source files instead of entire projects

PARSEWeb [Thummalapenta&Xie ASE 07]

13 13

Available Code Search Engines

• Google Code Search http://www.google.com/codesearch

• Krugle: http://www.krugle.com/• Koders: http://www.koders.com/• Codase: http://www.codase.com/• JExamples: http://www.jexamples.com/

etc.,

Why not using just code search engines?

What are Developers Searching for?

Assieme [Hoffmann et al. UIST 07]

339 sessions related to Java programming

15 million queries of Windows Live Search from May 2006.

117 API sessions (34.2%); 70 trouble-shooting sessions (20.6%)

15 15

API-related Search Sessions

• 64.1% sessions contained queries that were merely descriptive but did not contain actual names of APIs, packages, types, or members.

• The remaining sessions contained – API or package names (12.8%),– Type names (17.9%) – Method names (5.1%).

• Among all these API-related sessions, 17.9% contained terms like “example”, “using”, or “sample code”

Assieme [Hoffmann et al. UIST 07]

16 16

An Example 4-Query Session

• java JSP current date• java JSP currentdate• java SimpleDateFormat• using currentdate in jsp

Assieme [Hoffmann et al. UIST 07]

Only compatible with new Java versions

Why Not Use Web Search Engines?

Requires installation of external library,but no link

Code on pages essentially the same

Contains no code examples

parse xml java

©Raphael HoffmannAssieme [Hoffmann et al. UIST 07]

Code Search Engines

import javax.xml.parsers.*;import org.w3c.dom.*;public class JAXPSample {  public static void main(String[] args) {    String filename = "sample.xml";             try {      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();      DocumentBuilder parser = factory.newDocumentBuilder();      Document d = parser.parse(filename);    } catch (Exception e) {      System.err.println("Exception: " + e.getMessage());    }  }}

Index source code of open-source Projects (from compressed archiveFiles and CVS repositories)

Code is parsed and terms in typenames, variable names, etc. areweighted differently.

©Raphael HoffmannAssieme [Hoffmann et al. UIST 07]

Why not use code search engines only?

Irrelevant(An Emacs Lisp File!?!)

Code is complicated, contains no comments related to query,

and is more than 300(!) lines long

Requires installation of external library,but no link

Code on pages essentially the same

parse xml java

©Raphael HoffmannAssieme [Hoffmann et al. UIST 07]

Why not use code search engines only?

MAPO [Xie&Pei MSR 06]

21

Steps in Recommenders

• Data collection/extraction

• Data preprocessing

• Data analysis/mining

• Result postprocessing

• Result representation

22

Fact Extraction

• Whole-program analysis: applicable when the whole code bases are available and compilable

• Partial-program analysis: applicable when only partial code samples are available and not compilable– When a code search engine is used

23 23

Analysis of Partial Code Samples• Not all code samples contain main method or

driver code that can serve as an entry point– consider all public methods as entry points

• Deal with local method calls by inlining methods

• Deal with conditionals/loops by traversing control flow graphs

• Deal with unknown types with heuristicsPARSEWeb [Thummalapenta&Xie ASE 07]

24

Type Heuristics I

• Inferring fully qualified class namesimport javax.jms.QueueSession;

import java.util.*;

Public class test {

public QueueSession qsObj;

public Integer intObj;

public Iterator iter;

- Fully qualified name of QueueSession is “javax.jms.QueueSession”, inferred through lookup of import statement

- Fully qualified name of Integer is “java.lang.Integer”, inferred through loading of a class by appending “java.lang” to the class name

- Cannot infer the fully qualified name of “Iterator” (incorporating domain knowledge of java.util helps)

24PARSEWeb [Thummalapenta&Xie ASE 07]

25

Type Heuristics II

• Infer the receiver type in expression “X.Y”– Lookup the declaration of X in local variables or

member variables. If not, “X” is a class name and Y is a static member

• Infer the receiver type in expression “M1().Y”– Check the return type of M1() method declaration, if

not available locally, the receiver type cannot be inferred

25PARSEWeb [Thummalapenta&Xie ASE 07]

26

Type Heuristics III

• Infer the return type of a method invocation in an assignment statement such as “Queue qObj = createQueueSession()”

– Lookup the type of the variable on the left hand side. The return type is the same as or a sub class of Queue

• Infer the return type of a method invocation in a return statement such as public QueueSession test(){ ...

return connect.createQueueSession(false,int);}

- Lookup the return type of the enclosing method declaration

26PARSEWeb [Thummalapenta&Xie ASE 07]

27

Type Heuristics IV

• Infer types with multiple method invocations

Queue qObj = connect.m1();

Stack sObj = connect.m1().m2();

The receiver type of m2() can be inferred from the lookup of the return type of m1()

27PARSEWeb [Thummalapenta&Xie ASE 07]

28

Sequence Filtering

• Remove common Java library calls• Remove sequences that contain no query

words: ClassGen and addMethod

InstructionList.<init>()

genFromISList(InstructionList)

MethodGen.setMaxStack()

MethodGen.setMaxLocals()

MethodGen.getMethod()

ClassGen.addMethod(Method)PrintStream.println(String) …

public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); …

}

MAPO [Xie&Pei MSR 05]

Type Signature Graph

Any path from h to w is a (h,w)-jungloid

IFile CompilationUnit

ICompilationUnit

ASTNode

IClassFile

JavaCore.createCompilationUnitFrom()

AST.parseCompilationUnit()supertyp

e

AST.parseCompilationUnit()

JavaCore.createClassFileFrom()

IJavaElement IResource

supertype

getResource()

IContainer

getParent()

Prospector [Mandelin et al. PLDI 05 ]

Jungloids with Downcasts

IDebugView debugger = ...Viewer viewer = debugger.getViewer();IStructuredSelection sel = (IStructuredSelection) viewer.getSelection();JavaInspectExpression expr = (JavaInspectExpression) sel.getFirstElement();

IDebugView

Viewer

ISelection

IStructuredSelection

JavaInspectExpressionObject

getViewer()

getSelection()

getFirstElement()

getIn

put()

downcast

downcast

Prospector [Mandelin et al. PLDI 05 ]

31

Steps in Recommenders

• Data collection/extraction

• Data preprocessing

• Data analysis/mining

• Result postprocessing

• Result representation

32

Data Analysis/Mining• Some recommenders don’t use specific

mining techniques to “abstract” or “generalize” common patterns but return relevant raw code samples– Prospector, Strathcona, XSnippet, PARSEWeb

• Data mining can be used to uncover hidden patterns– Association rules: CodeWeb [Michail ICSE 00]– Frequent subsequences: MAPO [Xie&Pei MSR 06]– Frequent partial orders: Apiator [Acharya et al. FSE 07]

33

Association RulesKApplication reuse patterns

CodeWeb [Michail ICSE 00]

#include <abcdef.h>void p ( ) { b ( ); c ( ); }void q ( ) { c ( ); b ( ); }void r ( ) { e ( ); f ( ); }void s ( ) { f ( ); e ( ); }

int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } }

Frequent SubSeq/Partial Order

Consider APIs a, b, c, d, e, and f

Apiator [Acharya et al. FSE 07]

#include <abcdef.h>void p ( ) { b ( ); c ( ); }void q ( ) { c ( ); b ( ); }void r ( ) { e ( ); f ( ); }void s ( ) { f ( ); e ( ); }

int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } }

1 a f e c2 a b c d e f3 a c b d e f4 a b c d f e5 a c b d f e

a

d

c

e

b

f

a b d e a b d fa c d ea c d f

(b) Static program traces (c) Frequent sequential patternsSupport 4/5

(d) Frequent partial order R(a) Example code

Consider APIs a, b, c, d, e, and f

Frequent SubSeq/Partial Order

Apiator [Acharya et al. FSE 07]

#include <abcdef.h>void p ( ) { b ( ); c ( ); }void q ( ) { c ( ); b ( ); }void r ( ) { e ( ); f ( ); }void s ( ) { f ( ); e ( ); }

int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } }

1 a f e c2 a b c d e f3 a c b d e f4 a b c d f e5 a c b d f e

a

d

c

e

b

f

a b d e a b d fa c d ea c d f

(b) Static program traces (c) Frequent sequential patternssupport, 4/5

(d) Frequent partial order R(a) Example code

Frequent SubSeq/Partial Order

Consider APIs a, b, c, d, e, and f

Apiator [Acharya et al. FSE 07]

1 a f e c2 a b c d e f3 a c b d e f4 a b c d f e5 a c b d f e

a

d

c

e

b

f

a b d e a b d fa c d ea c d f

(b) Static program traces (c) Frequent sequential patternssupport, 4/5

(d) Frequent partial order R(a) Example code

#include <abcdef.h>void p ( ) { b ( ); c ( ); }void q ( ) { c ( ); b ( ); }void r ( ) { e ( ); f ( ); }void s ( ) { f ( ); e ( ); }

int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } }

Frequent SubSeq/Partial Order

Apiator [Acharya et al. FSE 07] MAPO [Xie&Pei MSR 05]

MAPO

Apiator

38

Data Analysis/Mining

• Data collection/extraction

• Data preprocessing

• Data analysis/mining

• Result postprocessing

• Result representation

39

Result Postprocessing

• When a third-party miner or learner isn’t used, this step may be considered part of the data analysis/mining step.

Examples

• Result clustering

• Result ranking

• Result filtering

40 40

Clustering and Ranking

• Candidate method sequences produced by the data analysis/mining step for query “Source Destination” may be too many

Solutions:• Cluster similar sequences

– Clustering heuristics are developed

• Rank sequences– Ranking heuristics are developed

PARSEWeb [Thummalapenta&Xie ASE 07]

41 41

Clustering Heuristics

• Method-invocation sequences with the same set of statements can be considered similar, although the statements are in different order.e.g., ''2 3 4 5'' and ''2 4 3 5 ''

• Method-invocation sequences with minor differences measured by an attribute cluster precision value can be considered similar.e.g., ''8 9 6 7'' and ''8 6 10 7 '' can be considered similar under cluster precision value one

PARSEWeb [Thummalapenta&Xie ASE 07]

42 42

Ranking Heuristics

• Heuristic 1: Higher frequency -> Higher

rank

• Heuristic 2: Shorter length -> Higher rank

• Heuristic 3: Fewer package boundaries -> Higher rank

PARSEWeb [Thummalapenta&Xie ASE 07]Prospector [Mandelin et al. PLDI 05 ]

43

Query Splitting

• Lack of code samples that give candidate method-invocation sequences in the results of code search engines– Required method-invocation sequences are split among

different source files

• Solution:– Split the user query into multiple queries– Compose the results for each split query

PARSEWeb [Thummalapenta&Xie ASE 07]

44

Query Splitting Example1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream”

Results: None

2. Query: “java.io.ObjectInputStream” Results: 3.

Most used immediate sources are: java.io.InputStream, java.io.ByteArrayInputStream, java.io.FileInputStream

3. Three Queries to be fired: “org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream” Results: 1

“org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream” Results: 5

“org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream” Results: None

PARSEWeb [Thummalapenta&Xie ASE 07]

45

Result Filtering• Remove sequences that contain no query

words: ClassGen and addMethod• Compress consecutive calls of the same

method into one, e.g., abbba aba• Remove duplicate frequent sequences

after the compression, e.g., aba, aba aba

• Reduce a seq if it is a subseq of another, e.g., aba, abab abab

MAPO [Xie&Pei MSR 06]

46

Data Analysis/Mining

• Data collection/extraction

• Data preprocessing

• Data analysis/mining

• Result postprocessing

• Result representation

47

Result Representation

• Display results in the tool user interface – Strathcona– XSnippet– PARSEWeb– MAPO– CodeBroker– Assieme

48 48

Strathcona

Strathcona [Holmes et al. 05]

49 49

XSnippet

XSnippet [Sahavechaphan&Claypool OOPSLA 06 ]

50 50

PARSEWeb

PARSEWeb [Thummalapenta&Xie ASE 07]

51 51

PARSEWebhttp://news.google.com/

52 52

MAPO (new)

MAPO [Xie&Pei MSR 06]

53 53

MAPO (new)

MAPO [Xie&Pei MSR 06]

CodeBroker

Comm

ents

signa

ture

CodeBroker [Ye&Fischer ICSE 01]

Information delivery that autonomously locates and presents software developers with task-relevant and personalized components. Active repository!!!

Assieme

• A hybrid search engine

• Index code snippets found on web pages

• Link them to required libraries and documentation

Assieme [Hoffmann et al. UIST 07]

Assieme

links to pages with snippets

group pages with similar snippets

links to required libraries

Assieme [Hoffmann et al. UIST 07]

Example Evaluations of Recommenders

• Prospector• Strathcona • PARSEWeb

Prospector Experiment 1 (ranking test)

• hypothesis: – to find the desired code, the user needs to

examine only top 5 candidate jungloids.

• result: – desired code in “top 5” 17 out 20 times (10

out of 20, in “top 1”)– remaining three fixable

• methodology:– used 20 real-world coding tasks– collected from FAQs, newsgroups, our

practice, emails to us

Prospector Experiment 2(user study)

• hypothesis:– Prospector-equipped programmers are better at

solving API programming problems than other programmers

• methodology: – 6 problems, each user did 3 with Prospector and 3

without– problems formulated not to reveal the query – sample problem:

“The new Java channel IO system represents files as channels. How do I get a channel that represents a String filename?”

– somewhat sparse data (10 users)

Experiment 2 (user study). Results.

• Prospector shortens development time– some problems solved only by Prospector users– when both groups succeeded, Prospector users

30% faster

• Prospector may help enable reuse– non-Prospector users sometimes reimplemented

• Prospector may help avoid making mistakes– mistakes applying code found on internet into

own code

• The authors expect even stronger results on a more robust infrastructure.

Strathcona: User Study• 2 developers were assigned 4 tasks on building a plug-in for

Eclipse. Neither developers knew how to implement any of the tasks at hand.

• The results showed that the tool can deliver relevant and useful examples to developers. They also showed a developer can determine when the examples returned are not relevant.

Table 2: Results from Evaluation:Useful Example Source Viewed Succeeded at Task

Task 1Subject 1 1 1 yesSubject 2 1 1 yesTask 2Subject 1 1 2 yesSubject 2 1 6 yesTask 3Subject 1 0 2 yesSubject 2 0 6 yesTask 4Subject 1 1 2 yesSubject 2 0 7 partially

Strathcona [Holmes et al. 05]

Strathcona: Performance and Scalability

• As a test case for scalability, Eclipse 3.0 source was populated to the repository. The resulting amount of information in the repository is shown in Table1.

• On a Pentium 3 800 MHz 1024 MB RAM Server, a Pentium 3 1000 MHz 256 MB RAM Repository with Postgresql DB the performance numbers are:

Table 1: Number of Structural Relations

Classes 17,456Methods 124,359Fields 48,441Inheritance Relations 15,187Object Instant ions 43,923Calls Relations 1,066,838

Total 1,316,204

– Less than 500 ms for building a structural context.

– Less than 300 ms for displaying the example.

– 4 – 12 seconds server response time.

Strathcona [Holmes et al. 05]

63

PARSEWeb Evaluations

• Real Programming Problems: To address problems posted in developer forums

• Real Projects: To show that solutions recommended by PARSEWeb are – available in real projects – better than solutions recommended by related tools PROSPECTOR,

Strathcona, and Google Code Search averagely

64

Real Programming Problems

Jakarta BCEL user forum, 2001

Problem: “How to disassemble java byte code”

Query: “Code Instruction”

Solution Sequence:FileName:2_RepMIStubGenerator.java MethodName: isWriteMethod Rank:1

NumberOfOccurrences:1

Code,getCode() ReturnType:#UNKNOWN#

CONSTRUCTOR,InstructionList(#UNKNOWN#) ReturnType:InstructionList

InstructionList,getInstructions() ReturnType:Instruction

Solution Sample Code: Code code;

InstructionList il = new InstructionList(code.getCode());

Instruction[] ins = il.getInstructions();

65

Real Programming Problems

Dev 2 Dev Newsgroups, 2006

Problem: “how to connect db by sessionBean”

Query: javax.naming.InitialContext java.sql.Connection

Solution Sequence: FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1

NumberOfOccurrences:34javax.naming.InitialContext,lookup(java.lang.String)

ReturnType:javax.sql.DataSourcejavax.sql.DataSource,getConnection()

ReturnType:java.sql.Connection

66

Real Project: Logic• Source File: LogicEditor.java

SUMMARY-> PARSEWeb: 8/10, Prospector: 6/10, Strathcona: 5/10

67

Comparison with Prospector• 12 specific programming tasks taken from XSnippet approach.

SUMMARY-> PARSEWeb: 11/12, Prospector: 7/12

68

Comparison with Other Tools

Percentage of tasks successfully completed by PARSEWeb,

Prospector, and XSnippet

69

Significance of Internal Techniques

*Legend:Method inline: Method inliningPost Process: Sequence Post ProcessorQuery Split: Query Splitter

70T. Xie Mining Program Source Code

Questions?

Bibliography on Mining Software Engineering Data http://ase.csc.ncsu.edu/dmse/•What software engineering tasks can be helped by data mining?•What kinds of software engineering data can be mined?•How are data mining techniques used in software engineering?•Resources

Available Data Mining Toolshttp://ase.csc.ncsu.edu/dmse/resources.html

Mining Partial Orders

71

Consider APIs a, b, c, d, e, and f

Partial OrderPartial Order with

Transitive Reduction

The extracted scenarios are fed to a partial order miner

The partial order miner mines frequent closed partial order

Closed Partial Order

Apiator [Acharya et al. FSE 07]

XOpenDisplay

XCloseDisplay

XCreateWindow

XGetWindowAttributes

XCreateGC

XSetForeground

XGetBackground

XMapWindow

XChageWindowAttributes

XMapWindow

XSelectInput

XGetAtomName

XFreeGC

XNextEvent

Example Partial Order

A usage scenario around XOpenDisplay API as apartial order.

Specifications are shown with dotted lines.

Apiator [Acharya et al. FSE 07]

Recommended