ANSWERING REACHABILITY QUESTIONStlatoza/papers/proposal.pdf · My studies indicate that reachability questions are pervasive throughout coding tasks. In one study, half of the bugs

1

ANSWERING REACHABILITY QUESTIONS

Thesis Proposal

Thomas D. LaToza

12/8/2009

Institute for Software ResearchSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA [email protected]

COMMITTEEBrad A. Myers, Human Computer Interaction Institute, Carnegie Mellon (Co-chair)

Jonathan Aldrich, Institute for Software Research, Carnegie Mellon (Co-chair)Aniket Kittur, Human Computer Interaction Institute, Carnegie Mellon

Thomas Ball, Microsoft Research

ABSTRACT

What are the most frequent, time-consuming, hard-to-answer, and error-prone questions professional software developers ask about programs? Reachability questions. A reachability question is a search upstream or downstream across paths from a statement for target statements. For example, a developer debugging a deadlock searched downstream for calls acquiring resources.

My studies indicate that reachability questions are pervasive throughout coding tasks. In one study, half of the bugs developers inserted were associated with reachability questions developers asked or should have asked. Developers report asking these questions more than 9 times a day, and 82% agree at least one is hard to answer. Neither increased professional experience nor even increased familiarity with a codebase make reachability-related questions easier or less frequent. In another study, 9 of the 10 longest investigation and debugging activities involved answering a single reachability question.

Using existing tools, developers traverse paths across method calls in search of target statements. Reachability questions are hard to answer because developers must guess both which paths lead to targets and which paths are feasible and may execute. To help developers more effectively answer reachability questions, I propose a new kind of reverse engineering technique in which developers search across paths for target statements. Starting at a statement in a program, developers enter search strings that are matched against identifiers or comments along paths. Specific situations can be considered by posing “What if?” questions such as “What happens when this data table is uninitialized?”

A static analysis for answering reachability questions determines the feasible paths through conditionals. Existing approaches either do not eliminate infeasible paths or are too slow to be used in an interactive tool. However, examples of reachability questions suggest that many common infeasible paths are caused by conditionals evaluating variables that may only contain constants (e.g., dynamic dispatch, flags). I propose to design a fast feasible path analysis which eliminates infeasible paths caused by these constant-controlled conditionals. A preliminary implementation is able to eliminate many common infeasible paths through a 50 KLOC Java program in just 13 seconds of analysis time.

mailto:[email protected]

mailto:[email protected]

2

1. INTRODUCTION

A central goal of software engineering is to help developers be more productive and create higher quality software by accomplishing tasks faster and introducing fewer defects. Throughout these tasks, developers must understand task-relevant code. Modern codebases range in size from hundreds of thousands to more than millions of lines of code. When interacting with code written by other teams or by other companies, code is often connected by complex interaction mechanisms and indirection using events and call backs. While these constructs help make software more extensible and reusuable, they also make it more challenging to understand. An analysis of code in Adobe’s desktop applications found that one third of the code is devoted to event handling logic and which caused half of the reported bugs [AN]. Successfully coordinating dependencies between effects in loosely connected modules can be very challenging [AO]. Developers often address this challenge by working exclusively on portions of the codebase that they “own” [A]. However, this boundary is imperfect, and developers often debug paths through others’ code, reuse functionality written by others, or are “load balanced” to work on other portions of the codebase. And when developers switch teams, they must learn a codebase anew.

To discover the nature and context of what makes work in large, complex codebases challenging, I conducted a series of studies examining the social context, activities, process, expertise effects, questions, and strategies of developers at work in coding tasks. Surprisingly, I discovered that much of developers’ work involves exploring code to answer reachability questions. Developers start at a statement stmt in a program and search upstream across paths reaching stmt or downstream across paths originating at stmt. Developers ask reachability questions when debugging to locate the statements which cause a fault to occur. When proposing changes, developers often first investigate code to understand the implications of their change and ask questions about the relationship of their change to upstream or downstream behavior.

Consider an example. An experienced developer participating in my lab study proposed a change but could not determine if it would work:

What I'd like to do is identify those core, hopefully EditBus events, and say just repaint the caret on that event. And the easiest thing to do is hook up the StatusBar to that event, get that event, and get the relevant events, and if so, update the caret. … [But] I'm concerned that I won't get all of the events that cause this guy to get updated. And I'm not sure, with the existing tools in Eclipse, how to find out all the places that can cause this thing to be called.

While the developer was aware that a provided call graph navigation tool could traverse chains of method calls, this did not directly help. Upstream from the update method was a bus onto which dozens of methods posted events, but only a few of these events triggered the update. Existing call graph tools are unable to identify only those upstream methods sending the events triggering the update. Unable to answer the question in any practical way, he instead optimistically hoped his guess would work, spent time determining how to reuse functionality to implement the change, edited the code, and tested his changes before learning the change would never work and all his past 23 minutes of work had been wasted.

Fantastically named EditBus, and it actually doesn't have any events related to edit. [laughing] It just has events related to buffer changes, which is not an edit. OHHH, I just wonder where edits might be going.

My studies indicate that reachability questions are pervasive throughout debugging and investigation activities of large complex, codebases. Yet existing tools often make it challenging for developers to answer these questions. Modern development environments include code exploration tools such as call graph navigation tools and reference searches that make it easy for developers to traverse many types of relationships between elements in a program. However, developers using these tools to answer reachability questions must explore the search space where target statements might be located by repeatedly traversing relationships. Traversing is challenging when the size of the search space is large, developers cannot predict which relationships to follow to find targets, or paths are infeasible and can never execute. In my studies, developers often spent tens of minutes answering a single reachability question. In observations of developers in the field, 9 of the 10 longest debugging and investigation activities each involved answering a single reachability question.

To help developers more effectively answer reachability questions, I propose a fundamentally new reverse engineering technique in which developers directly express their reachability questions and inspect matching target statements. For example, to answer the question, “What are the implications of deferring the initialization of this data table?”, a developer searches for statements downstream from an origin which differ when the table is or is not initialized. From examples of challenging reachability questions from my studies and other studies of developer questions, I designed a formalism for describing reachability

3

questions (section 2). I propose to design interaction techniques allowing developers to directly express these reachability questions (section 6). Asking a reachability question generates a list of statements. In order help developers make sense of these results, refine their questions, and ask follow-up questions, I propose to design interactive visualizations of feasible paths through a program (section 6).

The key technical challenge for reverse engineering answers to reachability questions from code is determining which paths are feasible. In general, infeasible paths exist in any language with conditional statements and control flow. Existing techniques such as model checkers can determine path feasibility in many cases but require hours or days to do so. My approach relies on determining feasible paths in response to a reachability question a developer has asked. As computing answers to all questions is impractical, my interactions require an analysis approach that is able to determine feasible paths on demand in a short time.

Examples of common infeasible path idioms suggest that many of the most common sources of infeasible paths can be eliminated by solving a simpler problem. Infeasible paths occur when the direction taken in a conditional statement evaluating an expression e is correlated with the path by which e is reached. Interestingly, in many examples conditionals are constant controlled with properties that make the infeasible path problem considerably simpler. Intuitively, knowing only where the path begins (e.g., a method handling a user input event) and tracking constant values through assignments to variables along the path is sufficient to determine whether e is true or false. Common idioms such as flags, case selects on enums, dynamic dispatch, and messages sent over buses often have these properties. Section 5.4 describes these properties in more detail. I propose to examine corpora of real programs to better understand how frequently these properties hold.

A reachability question is a search over feasible paths through code. I propose a new representation of these paths - a static trace. A static trace implicitly represents the set of feasible concrete traces between an origin statement o and a destination statement d. I propose to design a fast feasible path analysis capable of reverse engineering static traces from programs (section 8). Fast feasible path analysis first modularly analyzes each method in a program to determine paths and saves these results as method summaries. In response to a reachability question, the analysis executes the relevant summaries to construct a static trace.

Besides evaluating properties of infeasible paths in real programs, I propose several additional evaluations testing the usefulness of my approach. I propose to implement my interaction techniques and analysis in a new tool for Java called REACHER. To understand the effects of my approach on developer productivity, I will conduct a lab study comparing task time and correctness both answering reachability questions and accomplishing entire tasks. A between-subjects design comparing developers using Reacher against a control Eclipse condition will help understand how developers work differently and help measure the strength of any effects on task time and the quality of changes developers implement. This evaluation also tests the performance of the analysis - if it is too imprecise or too slow the tool will not be usable - but only indirectly. And it is only practical to conduct a lab study on a relatively small number of examples.

To more directly and completely assess the performance of the analysis, I will run the analysis on several large, open source Java programs. Speed can be evaluated simply by measuring the time to construct a static trace. Evaluating the quality of the results is more difficult. An ideal evaluation might be to compare the results to an oracle that produces only the true feasible paths and measure precision and recall. However, it is undecidable in general to decide if a path is feasible, so such an oracle cannot exist. Instead, I propose to first show either informally or by proof that the analysis always produces all feasible paths (soundness). Analysis precision can be measured by counting the number of infeasible paths ruled out compared to a naive analysis performing no infeasible path elimination. This allows precision to be calculated as a percentage improvement over a naive analysis.

My thesis is:

Developers searching, filtering, and comparing static traces computed in interactive time by a feasible paths analysis are able to more quickly and accurately answer reachability questions.

The proposed contributions of this thesis span findings and models derived from empirical studies of developers describing how developers understand code, new interactions and visualizations for asking and answering reachability questions, a new representation of sets of feasible concrete traces, and a static analysis for quickly eliminating infeasible paths caused by constant-controlled conditionals.

In the rest of this proposal, I first define reachability questions and survey work related to my proposed contributions. In order to understand the context in which developers ask reachability questions, I describe

4

three studies I conducted of program understanding tasks. Next, I describe results from 3 additional studies of how developers answer reachability questions using existing tools and why it can be hard. I then describe interactions and visualizations I propose to help developers more effectively answer reachability questions. Finally, I describe the nature of the core technical problem – feasible path analysis – and my approach to the problem.

2. REACHABILITY QUESTIONS

What does it mean for a developer to ask a reachability question about a program? From my studies and other studies of questions developers ask, it seemed clear that developers ask a class of questions that had not previously been explicitly characterized. But exactly which questions do these include? While there were many examples, often in developers’ own words, it became clear that a formal interpretation of these questions would help unambiguously describe how these questions might be interpreted as reachability questions and make relationships between similar questions clearer. So from these examples, I designed a formal language for reachability questions described in this section.

It is first helpful to define several related terms. A concrete trace describes a program’s dynamic behavior in a single execution as a list of executed statements. An executed statement s is a tuple of a statement stmt and a context ctx which assigns values to variables after stmt’s execution.

A program’s control flow graph (CFG) is a graph where nodes are statements and a directed edge from a predecessor statement a to a sucessor statement b exists iff there exists a concrete trace where b is the next statement executed after a. Thus, traversing a CFG generates a path of statements. A control flow branch occurs when a statement has multiple successors. Similarly, a control flow merge occurs when a statement has multiple predecessors. Since a concrete trace only follows a single path through the program, control flow branches always occur at a conditional statement that evaluates an expression e to determine which path to take. Common conditionals include if statements, switch statements, short circuit evaluation (e.g., && and ||), branching to function pointers, and dynamic dispatch in object-oriented languages. Note that even evaluating an expression that might throw an exception is a form of conditional as it determines whether to branch to the next statement or to an exception handler. However, exceptional control flow will not be considered any further as it is outside the scope of this proposal.

The expression e evaluated by a conditional is often quite complex, containing nested expressions with arithmetic comparisons and effect-generating method invocations. Due to short-circuit evaluation, expressions themselves have control flow graphs determining which portions are evaluated. However, a well-known transformation used by compilers can eliminate this complexity by translating programs into a form called three address code (TAC). In TAC programs, every statement in the program is an instruction with up to three variables (addresses) and an operation. A conditional simply evaluates a boolean variable v to determine which statement to execute next. In turn, complex expressions that a conditional evaluates are translated into statements that compute v. Without loss of generality, all analysis of programs in this proposal is formulated as analysis of TAC programs. REACHER itself analyzes Java programs translated to TAC. And the terms statement and expression are used interchangeably as expressions in Java correspond to TAC statements.

Consider the following code:1 if (g) 2 h = false;3 if (h) 4 foo();

Note that the path 1, 2, 3, 4 can never execute. This path is said to be infeasible. The conditional if (h) is said to be correlated with the previous conditional if (g) – the direction taken in the first branch influences the direction taken in the second branch. Correlated conditionals result in paths through a control flow graph that do not correspond to any concrete trace. Intuitively, statement b is said to be reachable from statement a iff there exists a concrete trace in which a and b both occur and b occurs after a. Note that b may be reachable from a even if a does not lead to a call to b: c may call first call a and then b. In general, paths from a to b could be arbitrarily long. But in practice, developers are only able to investigate code for which they have source. This disconnects the call graph when methods call into external libraries or are only called by libraries. Methods with no callers or callees are referred to as cut-points, similar to cut-points in existing program analysis systems [AU].

5

Throughout this section, lower case names refer to individual items while upper case names refer to collections of items. The set of traces TR is the set of all feasible concrete traces through a program p beginning at an origin statement o and ending at destination statement d. A behavior B is a list of executed statements that have been selected from the list of executed statements in a trace set. A trace set TR is also itself a behavior containing all s in TR. Conceptually, a behavior B corresponds to portions of TR matching a search criteria.

A reachability query takes a program p and produces a trace set. Reachability queries come in two forms – downstream(p, o, d) : TR and upstream(p, d) : TR. upstream(p, d): TR finds all concrete traces beginning at any cut-point in p that reaches d. downstream(p, o, d) finds all concrete traces through p from o to a destination statement d. Note that reachability queries are assymmetric - upstream requires only a destination while downstream requires both an origin and a destination. This assymetery arises because of a difference in what my studies suggest developers find task-relevant (see Figure 0). When asking upstream questions, it may be interesting to know that method m3 containing d is called by method m1 that previously called method m2 that contained relevant statements. In contrast, my studies do not suggest that developers would ask questions about what happens after a statement o contained in m2, called by m1, which subsequently calls m3. Part of the assymetery occurs because when inspecting downstream developers are more interested in what a block of code does rather than everything that might happen afterwards. For example, when inspecting a callsite, the question is usually what happens downstream from the callsite until control returns to the callsite. In this sense, upstream finds what happens before, up until cut-points are reached, while downstream finds what a block of code does.

m2 containing tup

m3 containing d

Not typicalTypical

upstream(p, d)

m1

m2 containing o

m3 containing tdown

m1

downstream(p, o, ?)

(a) (b)

Figure 0. When posing upstream reachability queries, developers may ask about statements tup that execute before a destination statement d even if only related by a common caller m1 (a). But developers do not typically ask downstream reachability queries about statements tdown invoked from a common caller m1 of an origin statement o.

While an explicit origin and destination statement is very expressive in specifying regions of code, many questions are actually much simpler and ask what a method or callsite does. downstream(p, m) is syntactic sugar for downstream(p, mentry, mexit) where mentry is the first statement in m and mexit is a final statement in m along all paths through m. Similarly, downstream(p, call) is syntactic sugar for downstream(p, before_call, call) where call is a method invocation statement and before_call refers to the context immediately before executing call. This specifies a static trace beginning at call and ending at call which includes statements in methods which might transitively be invoked by call. Note that downstream(p, m) specifies a single method while downstream(p, call) includes all methods that are possible dynamic dispatch targets at call.

Reachability questions combine one or more reachability queries producing static traces with zero or more criteria functions describing sets of target statements (behavior) within static traces. filter(TRi , s, v) : TRo is a criteria function selecting concrete traces in TRi where s assigns a value v to its result variable. This is equivalent to asserting that an expression equals v in the context ctx in which s executes. feasibleCallers(TR, m) : B finds all invocation statements of method m in TR. Similarly, feasibleCallees(TR, m) : B finds method declaration statements of callees of m. cutPoints(TR) : B finds (1) method declaration statements in TR with no callers and (2) method invocations in TR with no corresponding method declaration in TR. The primary use of cutPoints is to identify communication with a framework or library, but it also finds dead code. valueFlow(TR, expr) : B finds all statements connected by a dataflow path in TR to expr. A directed data flow edge from an executed statement s1 to an executed statement s2 exists iff s1 may be the last assignment in TR to a variable x read by s2. Finally, subtrace(TRi, o’, d’) : TRo selects the portion of TRi from o’ to d’.

6

search(Bi , str) : Bo finds executed statements s in Bi for which toString(s) contains the substring str. When an executed statement s is added, the method declaration statement of the method containing s is also added. toString(s) : str returns some string representation of s including at least the names of identifiers, values assigned to variables by the context, and the text of associated comments. Searches can also be performed for statements: search(Bi , stmt) : Bo , sets of statements: search(Bi , STMT) : Bo , or sets of string targets: search(Bi , STR) : Bo.

Element functions generate sets of statements associated with a source element. writes(F) : STMT, reads(F) : STMT, accesses(F) : STMT generate the set of statements reading, writing, or accessing a set of fields F. Sets of fields include the singleton field f, fields of type t: fields(t), and the set FIELDS containing all fields in the program. methods(T) : STMT finds all method declaration statements in a class of type T. We generalize upstream and downstream to sets of statements. For example, upstream(p, STMT) : TR runs upstream(p, stmt) : TR on each stmt in STMT.

Reachability questions qa and qb can be composed into nested reachability questions when qa produces an output Ba and qB has an output Bb. One form of nesting are the behavior operators conjunction, disjunction, and subtraction. Conceptually, behavior operators attempt to match an executed statement s1 in a behavior Ba to a corresponding executed statement s2 in a second behavior Bb . If no match is possible, the set operation is performed over the entire executed statement. If a match is found, the corresponding set operator is performed for context entries referenced by s1 and s2. An algorithmic description of how such a match can be performed is proposed work. ∧(Ba, Bb) : Bo computes the set intersection of executed statements in both Ba and Bb. Similarly, ∨(Ba, Bb) : Bo computes the set union of executed statements in either Ba or Bb. -(Ba, Bb) : Bo subtracts executed statements found in Bb from those in Ba. These behavior operators can be used to define a compare criteria function. compare(Ba , Bb) : Bcommon , B1 , B2 computes Bcommon = Ba ∧ Bb, B1 = Ba - Bb, and B2 = Bb - Ba. Some examples of behavior operators:

foo(true) ∧ foo(false) = foo({})foo(true) ∨ foo(false) = foo({true, false})foo(true) - foo(false) = foo({true})

Reachability questions do not include questions about module, type, or object structure. Such questions often involve relationships that are true at all or many program points rather than at a single program point. A reachability question is a search across only feasible paths, and thus does not include questions searching across control flow graphs as they contain both feasible and infeasible paths. For example, asking for all call sites of a method is not a reachability question, but asking for call sites that might feasibly invoke a method is a reachability question.

Reachability questions do not include static slicing questions. The backward static slice of a statement d is the subset of the program that could possibly influence the execution of d [20]. Similarly, the forward static slice is the subset of the program that could possibly be influenced by the execution of a statement o. There are several important differences between slices and reachability questions. Static slices have no notion of feasibility – dependencies in a static slice may occur through paths that may never execute (i.e., existing static slicing algorithms are not path-sensitive). Secondly, a slice is a subset of the program whereas an answer to a reachability question is a set of traces through a program where statements can occur multiple times in different contexts (i.e., static slices are not context-sensitive). Finally, static slices describe dependency as statements that influence other statements and thus include both control flow and dataflow paths. For example, consider asking “What are the situations in which stmt executes?”, where stmt is guarded by a conditional cond. An upstream reachability question will find all paths from a cut-point to stmt. In contrast, a static slice will find any statements that influences the value of cond. Static slices will not include statements along paths that reach cond but that do not influence cond.

In my studies, I sometimes found questions that were not explicitly phrased as a reachability question but which developers sometimes answered by refining into a reachability question. For example, developers asking “Why is this call necessary?”, a rationale question, often attempted to answer this question by asking questions about what the call did and the situations in which it was called. In contrast to behavioral reachability questions (or just reachability questions) which are about behavior exhibited in a program (sets of executed statements), intent reachability questions are about design decisions (e.g., explanations of past design decisions, constraints on future design decisions). An intent reachability question is a question for which there exists a strategy to answer it by asking a reachability question, finding target statements, and

7

using knowledge to interpret the statements into facts answering the original question. While reachability questions can be answered simply by inspecting code, intent reachability questions inherently require knowledge for developers to interpret code into design decisions (see section 4.2). To answer “Which of these methods should I pick?”, a developer could ask compare(downstream(p, m1), …., downstream(p, mn)), inspect candidate functions, traverse into callees to see what they do, and compare behavior between functions. Of course, the developer might also instead simply read the documentation, ask a teammate, or even compare differences in how they are used (compare(upstream(p, m1), …., upstream (p, mn)). An intent reachability question only requires that there exists a strategy for a developer to refine the question into a reachability question rather than that the strategy is always chosen. Intent and behavioral reachability questions are together referred to as reachability-related questions.

3. RELATED WORK

Related work falls into several categories relevant to the proposed contributions of this thesis proposal. A number of previous studies have examined developer activity during coding tasks to build models describing developer behavior. While these studies help inform the design of REACHER, they do not identify or explicitly distinguish reachability questions. A number of tools help developers to reason about traces through programs but are poorly suited for answering reachability questions. Finally, many existing analyses can compute feasible paths, but are either too slow or too imprecise to be useful for answering reachability questions in an interactive tool.

3.1 Studies of control flow questions and information needs

It has long been known that control and data flow are central to how developers mentally represent programs. Studies of program comprehension have found that developers begin understanding small programs by constructing a mental model of control flow [E]. More recently, a number of studies have applied the idea of information foraging to describe how developers explore and navigate code [G]. Developers select a focus point method, use cues such as method names to pick structural relationships (e.g., calls) to follow, and collect information as they remember what was found [F]. Particularly crucial is the choice of which structural relationship to traverse. For example, one study investigated the sufficiency of bug report text in choosing which structural relationship to traverse [G]. Information foraging studies illustrate the central importance of effectively navigating call graphs to programming tasks. But they also reveal two key difficulties of existing tools. While developers seek nodes several levels away, they must choose which call to traverse based only on identifier cues from each method that is called. And developers must either remember or write down found information, leading to information loss, poor representation choices, and difficulties returning to locations where information was found [D].

There is a long tradition of studies investigating how developers understand programs. Traditionally, studies of program comprehension have applied constructs from cognitive psychology to investigate how programmers mentally represent programs or studied the effects of expertise on developer behavior [H]. But more recently, attention has shifted to studies designed to elicit design recommendations for better tools or practices by identifying information needs and questions associated with different software engineering activities.

One study found 21 questions about interactions with code, other artifacts, and teammates [J]. When writing code, developers seek functionality to reuse and information about how to reuse it. Developers submitting a change ask if it is correct, whether it follows team conventions, and what changes it should include. Triaging a bug determines if it is a legitimate problem worth fixing. When receiving a new bug, developers reproduce it to determine what it looks like and when it occurs before asking about its cause. Developers also ask design questions about code’s rationale and the implications of a change. One third of the questions were reachability-related questions.

In another study, developers were observed to find 44 questions. These were primarily focused on lower level questions about code rather than questions about team interactions or design decisions [B]. Developers begin coding tasks by finding focus points corresponding to domain concepts or application functionality and work outward following relationships between methods and classes. Developers ask higher-level questions about relationships between multiple methods and classes, including questions about control and data flow. After making a change, developers ask about its implications. 52% of the questions identified as poorly supported by existing development environments are reachability questions (see 5.1).

8

Several studies have observed developers using existing tools and diagrams to derive design requirements for improved tools. One study observed developers using an existing UML tool while editing code [K]. Among many other usability problems with the tool, a key recommendation was supporting selecting a small number of elements of interest in the reverse engineered view to prevent wasted time understanding task-irrelevant parts of the system. They also saw the need for much more automated support for reverse engineering sequence diagrams, similar to what REACHER provides. Another study hung large posters diagramming developers’ codebase outside their offices [L]. However, these diagrams were rarely used and developers instead continued to use their whiteboards to draw diagrams by hand. Designed to be useful for all tasks, they instead had both too much and too little information – tasks required many details but only those relevant to the situation at hand. Thus, diagrams providing concise and targeted answers to questions are more likely to be useful than general-purpose diagrams.

Another study observed several students using a UML sequence diagram tool in the lab [W]. Qualitative analysis suggested element labels, animation between layouts, and diagram to source linkage are all important for such tools. Participants specifically requested the ability to do do exploratory browsing by rapidly selecting and focusing on a section of interest and then moving back using a back button. Finally, the authors suggest that the frequent navigation between corresponding portions of the diagram and source could be reduced by adding additional information to the diagram such as information about conditionals and loops.

3.2 Tools for reasoning about paths between statementsA huge number of tools exist for helping developers reason about and explore relationships between statements in a program. These tools employ a program analysis to construct a representation of a program as a directed graph of statements with predecessor and successor edges between statements. Some construct abstractions of these graphs (e.g., call graphs merge statements in a method into a single node). These tools can be characterized along two dimensions: the relationship between statements considered and the interactions and visualizations with which developers explore these relationships. In slicing, a predecessor statement is a statement which is control or data dependent. In data flow, edges are the reaching definitions describing where read variables were last written. Control flow includes predecessors that previously executed on any execution, while concrete trace tools include the single predecessor on a specific execution. Feasible paths include multiple possible traces through a program but exclude paths through the CFG that can never execute. Table 1 lists representative tools for points in this space.

Relationship /Interactions

Slice Data flow Control flow Concrete trace Feasible path

Contract verification ESC / JAVA [AC] JML COMPILER [AD]

Property verification STATIC DRIVER VERIFIER [AE]

Unit testing JUnit [AF] Parameterized unit testing [AG]

Artifact recommendation SUADE [AY]

Multi version program analysis

Semantic diff [AQ] Differential symbolic execution [AR], REACHER

Up front diagrams Flow charts [AS] Flow charts [AS] UML tools

Flow traversal CODESURFER [I] , WHYLINE [C]

CODESURFER [I] , thin slicing [AA]

Eclipse CALL HIERARCHY, SHRIMP [T], RELO [AK], JQUERY [P]

Pattern murals [R], OSE [W]

REACHER

Flow search DORA [AT] OSE [W], AspectJ [AH]

REACHER

Table 1. Examples of tools for reasoning about and exploring statement graphs. Tool names are listed in small caps while classes of tools are listed in lowercase.

In contract verification [AB], a developer expresses a contract as constraints (e.g., pre- and post-conditions, invariants) on a program’s state that must be satisfied at distinguished points in the CFG (e.g., method entry and exit). A program analysis traverses a statement graph, tracking state, and determines if the constraints are satisfied. If not, errors are reported to the user (e.g., compile errors). A vast number of verification and bug finding tools use this paradigm. Many tools statically traverse the CFG to check constraint satisfaction. For example, ESC/JAVA [AC] checks pre- and post-conditions written as JML annotations using a theorem prover. Other tools encode constraints as runtime assertions, traverse concrete traces by running the

9

program, and check for assertion violations (e.g., JML COMPILER [AD]). Contract verification allows developers to explicitly describe design intent – a contract – in an unambiguous notation for both other developers understanding the code and tools checking that the contract is satisfied. Instead of explicitly traversing control flow, developers implicitly state constraints about what must or must not happen on paths between constraint checks. Contracts can be particularly helpful for specifying interfaces between code produced by different teams or companies (e.g., frameworks) as it supports understanding foreign code without the code itself. For behavior specified in the interface, developers can answer downstream questions simply by reading the contract. Finally, when code code depends only on behavior specified in the contract, rather than all behavior, developers may change unspecified behavior without asking upstream reachability questions.

But despite its benefits, contract verification is ill-suited for answering most reachability questions. Contracts rely on the original developer specifying in the contract behaviors about which it is permissable to ask a question. However, if a developer wished to specify everything, contracts quickly become unscalable – at every method entry, a constraint on every effect that might possibly occur downstream would be necessary. Instead, contract-using developers are forced into a position of obliviousness to unspecified behavior. In practice, developers ask reachability questions about a wide range of behaviors. Moreover, even when developers ask a question specified in a contract, determining why a contract verification tool has reported an unsatisfied constraint requires determining the path taken by the tool [AD].

In property verification, developers specify a property using a specification language (e.g., a temporal logic) over program state that must be satisfied by all concrete traces (e.g., a always occurs before b). Property verification tools traverse feasible abstractions of concrete traces. Property verification generalizes contract verification by removing the modularity restriction that constraints may only be checked at distinguished points in the CFG. For example, the static driver verifier [AE] is able to check that drivers do not incorrectly use resources by checking that specified temporal orderings between method calls are respected. While property verification systems are specifically designed to check the reachability of error conditions, the specification languages are designed for the original developers to state complex correctness properties, not for investigating simpler reachability relationships in unfamiliar code. While property verification systems can be used to search for statements along feasible paths, they require the statements to be specified unambiguously, find only a single path reaching the specified statements (the error trace), and have primitive text displays listing any found paths. Thus, property specification is a poor interaction for investigating unfamiliar code where developers do not know the full names of what they seek, wish to see all paths matching search criteria, quickly iterate search criteria, wish to filter or compare paths, or need to make sense of control flow relationships.

In unit testing, developers write a short program to generate a concrete trace and constraints expressed as assertions over execution state. Unit tests differ from contract testing in that constraint checks need not occur only at distinguished program points. In contrast to property verification, unit tests state constraints over paths downstream from the test rather than globally over all paths, and they constrain the functionality code provides rather than ensure the preservation of global invariants. Unit tests are widely used in practice, often through tools such as JUNIT [AF] that automate running tests and viewing results. More recently, symbolic unit tests have been proposed which allow developers to add parameters to tests by performing constraint verification over feasible paths rather than a concrete trace [AG]. For answering reachability questions in unfamiliar code, unit tests suffer from all the same limitations as property verification.

Applying the idea of automated recommendations (e.g., Amazon customers who bought a also bought b [AI]) to code investigation, recommender systems implicitly or explicitly determine artifacts in which a developer is interested and recommend other similar artifacts to investigate. For example, SUADE [AY] uses call graph structure to recommend methods to investigate based on code elements a developer indicates are relevant. Recommender systems assume that there exists delocalized concern elements in code – methods or types implementing a feature provided by the application – and that the goal of developers’ investigative activity is to navigate from discovered elements to the remaining hidden elements. In general, reachability questions need not be about relationships between parts of a concern but may be about how loosely related portions of code interact. Moreover, by only considering as input the elements a developer has already found, there is no information available to determine what question has been asked. But while recommender systems are ill suited for answering reachability questions, algorithms for inferring relevance from call graph structure could be applied to ranking target statements matching reachability questions.

10

In multi-version program analysis, developers inspect differences before and after a code edit to understand the implications of a change. Semantic diff [AQ] compares static slicing dependency relationships between variables in a single method before and after a code edit. Differential symbolic execution compares trees of feasible paths through a method before and after a change to find differences in path conditions or symbolic values returned. Differential symbolic execution [AR] is more precise than REACHER in computing differences not only in behavior but of symbolic values. However, the analysis used to do so becomes unscalable for larger programs. Both tools only work for individual methods, not larger programs, making them ill suited for answering reachability questions.

When the original developer recognizes code as important and complex, a developer may choose to document interesting example concrete traces of specific situations. Structured design [AS] includes diagrams for depicting control and data flow relationships. UML sequence diagrams [N] depict concrete traces. Documenting concrete traces has the advantage of recording the original developers’ intent by their choice of elements to depict or text accompanying the diagram. But many developers do not invest the time to write documentation or do not reliably update the documentation, making it suspect [A]. Views of code that are not reverse engineered or checked for conformance against the code always run the risk of being inaccurate. This may be especially true of views attempting to answer reachability questions by depicting structural relationships, as the relationships are likely to change with even minor code edits. But most importantly, the huge number of structural relationships in a program that a reachability question might ask about makes documenting all of them impractical.

In contrast to interactions where developers are only implicitly aware of paths, in flow traversal developers explicitly navigate from an element in code or a diagram by selecting a structural relationship to traverse. Tools exist to traverse slice relationships such as static slices tracking control dependencies through CFG paths (c.f., CODESURFER [I]), dynamic slices tracking control dependencies over a concrete trace (c.f., [Q], WHYLINE [C]) recorded from a developer’s execution of a program, and thin slices tracking data flow between variables through CFG paths (c.f., CODESURFER [I], thin slicing [AA]). A number of tools exist to traverse the CFG at method granularity using different views – a tree view (e.g., Eclipse’s CALL HIERARCHY), calls overlaid on a diagram combining UML class diagrams with method text snippets (RELO [AK]), or calls overlaid on class structure (e.g., SHRIMP [T]). JQUERY [P] adds traversal of structural relationships amongst types and methods (e.g., method membership, subtyping, containment, references, constructors) to a unified tree view of a method granularity CFG. Many tools record concrete traces, often visualized as a UML sequence diagram depicting method invocations grouped by object instance lifelines, and support traversing method invocations (e.g., pattern murals [R], OASIS SEQUENCE EXPLORER (OSE) [W]). Other than OSE (discussed below), none of these tools support searching across statement graphs, leaving developers to use their knowledge to guess where target statements are located.

In flow search, developers specify an origin statements and a search string and search across paths in a statement graph for target statements. DORA [AT] searches statement graphs at a call graph granularity. A developer specifies an origin method and a search string, and DORA scores methods connected by a call graph path by their relevancy to the search string. Relevancy scores incorporate both information retrieval techniques and weights based on the importance of several method features. Logging aspects allow developers to construct a search string as a point cut descriptor, run the program, and browse matching target statements written to a log file [AH]. But developers must rerun the program whenever they change their search string and there is no support for exploring the log. OSE [W] depicts a concrete trace as a UML sequence diagram and allows developers to use regular expressions to search the names of methods. While employing the closest interaction technique to REACHER, existing tools are still ill-suited for answering reachability questions. Neither tool supports what-if filtering or comparison. Searching a single concrete trace cannot answer reachability questions as reachability questions ask about all traces. DORA is unable to reason about correlated conditionals and finds methods connected only by infeasible paths. The focus of DORA is the information-retrieval-based scoring heuristics for ranking matches. These could be also applied for ranking target statements in REACHER.

3.3 Static analysis for feasible pathsREACHER draws heavily from path-sensitive static analyses for bug detection and verification. Broadly, these tools traverse feasible paths through a program, update abstract state by inspecting statements, and output an error whenever an error state is encountered. Different tools strike different performance / precision tradeoffs for the bugs they seek to discover.

11

A symbolic execution of a program propagates input variables to a program symbolically by name across paths through a program [AX]. Instead of executing a program on concrete values (e.g., 5), the program is executed with names of input variables (e.g., x) and constraints on these variables (e.g., x > 5). At conditional statements, a symbolic execution first attempts to determine which branch is feasible. If this is not possible, multiple paths are forked off, constraints are added to variables, and each path is explored in a separate context. No merging of contexts is performed at control flow merges. In this way, a symbolic execution constructs an execution tree of potential paths a program might follow. However, the precision of a symbolic execution is limited by the ability to determine which paths through a conditional are feasible. Moreover, the number of paths through real programs are intractably large, so this approach in practice amounts to sampling paths and testing.

The most precise approach to determining feasible paths is CEGAR (counter-example guided abstraction refinement) model checking (c.f., SLAM [V]). In contrast to symbolic execution, CEGAR model checkers lazily add precision by only adding constraints to variables when necessary. These constraints are used to guide which paths are taken at conditionals. After finding any feasible or infeasible path to an error statement, a theorem prover or SAT solver is used to determine if the path is feasible. If the path is infeasible, constraints are added to variables to prevent this path from being traversed, and the model checker again begins searching for a path to an error statement. While CEGAR model checkers have been used in practice, both the use of the theorem prover and iteratively searching for paths results in runtimes of hours or days for even small programs.

Dataflow analyses are less precise than model checkers but take much less time to execute. Most dataflow analyses do not attempt to eliminate infeasible paths. Instead, dataflow analyses simply iteratively traverse paths through a program to populate a context mapping variables in scope at each statement in the program to abstract values - constraints designed by the analysis author. The key feature of a dataflow analysis is that cycles in paths due to loops or recursion are iteratively traversed until a fixed point is reached and none of the abstract values have changed on the final iteration. Interprocedural dataflow analyses are fast and can run on even large programs in short amounts of time. A path-sensitive dataflow analysis is sometimes able to determine which paths through a conditional are feasible by using information in the context. When this is not possible, these analyses traverse both paths using separate contexts. Such a fully path-sensitive analysis is impractical, as the number of contexts grows exponentially in the number of conditionals that cannot be resolved. Instead, practical tools (c.f., ESP [Y]) join contexts with identical abstract state at control flow merges.

Call graph construction algorithms eliminate infeasible paths created by dynamic dispatch or first class functions [X]. By propagating information about the possible runtime types of objects, these algorithms eliminate infeasible paths by determining the possible runtime types that might reach each receiver object or function pointer. These algorithms are fast and are often used as an input to other dataflow analyses, but only eliminate infeasible paths arising from dynamic dispatch. Yet even for this use they are still imprecise as values are propagated over both feasible and infeasible paths. As a result, the runtime types at a dynamic dispatch site include those that can never reach the site.

In summary, most existing techniques for eliminating infeasible paths are either slow (model checking, symbolic execution, fully path-sensitive dataflow analysis), do no eliminate infeasible paths (path-insensitive dataflow analysis), or eliminate a much more restricted set of infeasible paths than REACHER (call graph creation algorithms). The closest approach is partially path-sensitive dataflow analysis which both eliminates many of the same infeasible paths and is also fast. However, the benefits of this approach rely on merging contexts with identical abstract state. When only a simple property is being checked, there will be few potential distinct contexts possible. If this approach were to be applied to propagating constants to determine path feasibility, the number of contexts would be exponential in the number of variables that might have a constant value. Thus, this approach is also too slow.

4. EXPLORATORY STUDIES OF UNDERSTANDING CODE

To identify specific problems developers face understanding code and the context and nature of these problems, I conducted a series of studies of professional software developers. First, using surveys and interviews of developers at Microsoft, I examined the tools and practices developers use and the problems they perceive [A]. Second, in observations from a lab study, I discovered that developers reason about facts and described the process and effects of expertise on this reasoning [BZ]. Third, in a survey of developers

12

at Microsoft, I uncovered 95 hard-to-answer questions about code. This section overviews the results of these studies and describes general findings about how developers investigate unfamiliar code. In the next section (section 5), these findings are extended by considering how developers ask and answer reachability questions in particular.

These results can be used in several ways. Most directly, they describe the tasks developers do and the questions and information needs in these tasks in ways that can inform the design of new and improved tools that better support the tasks developers do. But they also describe factors influencing what developers do. For example, investing in higher code quality can make other tasks easier. Information learned in one task may make subsequent tasks easier. Questions may be answered by answering other types of questions. Thus, a model of the process and information in these tasks also helps to begin to understand situational factors that influence how frequently questions are asked and how difficult they are to answer.

4.1 Tools and practicesIn two surveys and 11 semi-structured interviews, I investigated the nature of coding tasks – developers’ use of time, the challenges they experience, and the practices surrounding these challenges [A]. Some of the results include:

-No single activity dominates coding activity time. Developers reported spending nearly equal time understanding code, communicating with teammates about code, writing new code, editing existing code, and other work-related activities. There are also large differences between developers – individual developers spent from 0 to more than 60% of their time in a given week on each of these activities. This suggests both that many activities are important, and differences between tasks, teams, projects, and lifecycle phase lead to very different activities.

-Developers spend most of their time communicating about code through informal communication - unplanned meetings or email discussions - not through documents.

-Developers go to great lengths to create and maintain rich mental models that are rarely permanently recorded.

-Developers reported that understanding the rationale behind code was their biggest problem. When understanding code, developers tried to understand the code themselves before asking their teammates.

-92% of developers agreed “there is a clear distinction between the code my feature team owns and the code owned by other teams”. This code ownership moat makes possible huge investments in understanding a small piece of the entire codebase.

-While newcomers often read design documents when present, they were quickly assigned bugs to work on to provide them specific, filtered situations in which to understand the behavior of code.

4.2 Fact finding, proposing changes, and the benefits of expertise

In observations of 13 developers in the lab during 3 hours of coding task sessions, I investigated the process by which developers make complex changes and identified several mechanisms by which expertise facilitates this process [BA]. Several of these findings also have implications for code exploration tools.

-Developers’ mental model consists not primarily of literal code snippets but facts abstracted from the code describing information chunks.

-Developers make choices about what to do next such as continuing to seek facts in their current location, seeking in a different location, or implementing a proposed change. This suggests that both productivity and the quality of changes could be increased by improving cues for estimating the cost and benefit of pursuing different strategies. But given a set of cues, experts’ knowledge allows them to more effectively interpret the cues, leading to better path choice decisions.

-Developers interpret information they read in code using their knowledge to learn facts. Experts’ greater knowledge allows them to learn more useful facts more quickly. Thus, tools may need to display different amounts of information to help users with varying levels of knowledge answer the same question.

-Not all actions in coding tasks come about from strictly hierarchic task decomposition. In some cases, developers act spontaneously in reaction to information discovered that triggers goals they did not set out to accomplish. For example, developers stumbling upon questionable code critiqued it and adapted their task

13

goals. This suggests that code exploration tools must balance filtering irrelevant information against providing task relevant information that developers may not have known to look for.

-Individual facts are connected in a fact graph by explanation relationships. To understand the implications of their changes, developers seek to explain facts. Explanations about why a fact must be true trace rationale from low level facts to higher level requirements. Explanations establish constraints on what changes are possible without changing other facts. Experienced developers used their knowledge to explain facts others could not, allowing them to address the underlying cause of a design problem rather than its symptoms.

4.3 Hard-to-answer questions about code

One way of modeling an understanding task is to view it as a tree of questions and actions performed to answer these questions. A better understanding of the nature of these questions suggests information needs that development environments must satisfy. Revealing questions that are challenging to answer suggests opportunities for tools to make them easier. In a survey of 179 developers at Microsoft, I investigated hard-to-answer questions about code. Respondents reported 371 questions, which we clustered into 95 unique questions spanning 21 categories. Each of these questions captures some problem a developer thought was hard enough to write down and report.

-The most frequently reported question categories were rationale (42 questions), intent and implementation (32), debugging (26), refactoring (25), and history (23).

-Many questions were not about code’s behavior in all situations but in specific situations or about confirming or rejecting hypotheses. For example, “What happens when an exception is thrown or an operation times out?”. Rationale questions asked why a specific decision was made or even why a specific alternative was not chosen. This suggests code exploration tools must provide support both for specifying specific contexts and using information in hypotheses.

-Different questions often do not capture distinct information needs but instead different strategies for answering questions. At the most highest level, code change tasks involve only two questions – “how do I do this?”, and “is this correct?” For example, consider a change removing a call to a method. Developers could ask the rationale question “why is this here?”, the history question “what changed when this was added?”, the dependency question “what depends on this?”, the architecture question “how is this code interacting with libraries?”, the implementation questions “what does this do?”, the control flow question “is this path dead?”, the data flow question “what parts of this data structure are modified by this code?”, the implication question “what are the implications of removing this call?”, the testing question “is this correct?” after the change, and the debugging question “what caused this state to occur” to understand what broke. The situation dictates how likely the answer to a particular question is to be informative – how good is the test infrastructure, how much work would it be to debug, how long does it take to implement this change and test it, was the method inserted when the whole method was written or as part of a focused change, is this call present only because of an unnecessary effect, or is this call unique to this situation or common in these situations? Questions are worth asking when they have informative answers and are the lowest cost route to finishing the task.

5. THE PROBLEM: ANSWERING REACHABILITY QUESTIONS

Developers understanding code during coding tasks ask questions about code and its behavior. To answer these questions, developers constantly choose amongst many strategies – e.g., using facts they already know to guess, running the program and playing with its behavior, using a debugger, or asking a teammate. One of the most frequently used strategies is to use an editor and source browsing tools to traverse control or data flow relationships to locate targets statements matching search criteria. To do so, developers must guess which relationships lead to targets. After traversing, developers skim the method text in search of targets. After finding a target, developers must either remember it or write it down. This leads to forgotten information, tediously writing down information, difficulties making sense of information to answer higher level questions, and wasted time using their editor to return to the location of targets in code [D].

In this section, I present results from 3 studies I conducted of the questions developers ask during coding tasks. Surprisingly, reachability questions are central to many of these tasks.

14

5.1 Developers ask reachability questions

In study 1, I reanalyzed observations from 13 developers who took part in 3 hours of coding tasks in the lab (section 4.2). Despite spending almost the entire task asking questions and investigating code, developers frequently incorrectly understood facts about the code. Acting on these false facts, developers implemented buggy changes which they sometimes later realized were mistaken and abandoned. Half of all changes developers implemented contained a bug. In half of these defective changes (8 changes), I was able to relate the bug to a reachability question either in a false assumption that developers made (75%) or a question they explicitly asked (25%). Table 2 lists the false assumptions or questions that were related to reachability questions and the corresponding reachability question. Developers often made incorrect assumptions about upstream or downstream behaviors as they reasoned about the implications of removing calls currently present in the code. These assumptions took different forms depending on the change they considered. upstream often occurred when developers asked or assumed that behavior was redundant and unnecessary because it would always be called somewhere else. In these cases, the call graph distance from the origin statement they investigated to target behavior was often small (mean = 1.75). These questions were challenging to reason about because it was difficult to determine which calls were feasible. In contrast, downstream often occurred when developers made false assumptions about how a method mutated data or invoked library calls. Here, the relevant effect was further away (mean = 3.5 calls), and developers had no reason to believe that traversing the path to the target would challenge their assumption.

In addition to the bugs that arose from assumptions developers made when they should have asked reachability questions, there were many cases where the developers did ask reachability questions and formulated a strategy to answer them. Developers spent much of the task investigating code by traversing calls in an attempt to understand what methods did and the situations in which they were invoked. Most participants rapidly switched between a call graph view (static) and the debugger call stack (dynamic). Static investigation allowed developers to navigate to any caller or callee at will. But as developers traversed longer paths of calls, developers were likely to hit infeasible paths. Several guessed incorrectly about which paths were feasible. Dynamic investigation was much more time-consuming to begin – developers set breakpoints, invoked application behavior, and skipped through breakpoint hits until the correct one was reached. At task start, most investigation was relatively unfocused – developers attempted to make sense of what the methods did and the situations in which they were called. As the tasks progressed and developers began to propose changes, questions grew increasingly focused and developers sought to navigate to specific points in code.

15

Developers differed greatly in the effectiveness and sophistication of the strategies they employed. Particularly challenging for many participants was upstream navigation. Two participants did not realize they could search the call stack to find an upstream method and instead spent time (16 mins, 10 mins) using string searches and browsing files. Three participants spent ten or more minutes (17, 13, and 10 mins) using a particularly tedious strategy to navigate upstream from a method m across only feasible paths: adding a breakpoint to each of m’s callers, running the application, executing functionality, noting which callers executed, and recursing on these callers. Many participants used Eclipse’s call graph exploration tool to traverse calls, but had problems due to infeasible paths and determining which calls led to search targets. The three most experienced participants instead invoked functionality and copied the entire call stack into a text editor. But even these experienced participants experienced problems reasoning about reachability relationships. Three of the defects inserted associated with reachability questions were inserted by these participants.

In study 2, I asked 460 developers at Microsoft to rate the frequency and difficulty of answering 12 reachability and reachability-related questions. The questions were first piloted to ensure that survey participants could successfully interpret the questions, resulting in several being reworded. Ten of these questions were adapted from reachability-related questions observed by Sillito [B] (see Figure 1) while the remaining two were taken from study 1:

Could this method call potentially be slow in some situation I need to consider?search(downstream(p, o, d), “slow” slowMethods externalCalls)

Is this method call now redundant or unnecessary in this situation? downstream(p, o, d)

On average, developers reported asking more than 9 reachability and reachability-related questions a day. Reachability-related questions were often hard to answer. Of the 12 reachability-related questions developers rated, developers rated an average of 4.1 questions at least somewhat hard to answer and 1.9 as hard or very hard to answer. Few developers thought all reachability-related questions were easy to answer. 82% of respondents rated at least 1 question at least somewhat hard to answer, and 29% rated at least 1 question as very hard to answer. Developers do not ask reachability-related questions less frequently as they become more experienced or after spending more time in a codebase. Nor does the quality of the codebase effect the frequency of reachability-related questions. Interestingly, reachability-related questions do not even get easier to answer as developers gain development experience or spend more time in a codebase. While it is harder to answer reachability-related questions on lower quality code (R = .36, p < .0001), my results cannot determine if this is true only of reachability-related questions or simply that all questions about poorly maintained code become harder to answer.

16

17

Figure 1 plots question frequency against difficulty. Interestingly, difficulty is positively related to frequency (R = .35, p < .0001). Both the most frequent and hardest to answer reachability question is “What are the implications of this change?” Some reachability questions are much more frequent and difficult than others. Over 60% of developers thought answering “What are the implications of this change?” was usually at least somewhat hard, while this was true of only 16% of respondents for “How are instances of these classes or data structures created and assembled?” In general, the more difficult questions were more high level, requiring consideration of design decisions or potentially leading to many lower level questions.

Survey respondents were also asked to report other hard-to-answer questions about code they had asked (see section 4.3). From 371 reported questions, there were 95 distinct questions. 27 of the distinct questions (28%) were behavioral (21%) or intent reachability questions (7%). These percentage could be lower than in the Sillito corpus [B] because developers do not think of reachability questions when they remember hard to answer questions they have asked about code. But more likely, the 12 reachability-related questions they had just rated (figure 1) covered some of the questions they would have otherwise reported. Table 3 lists the original natural language question from each study with its interpretation as a reachability question.

One way to both understand the frequency of reachability questions and their characteristics is to examine corpuses of questions developers ask about code. Sillito [B] observed developers at work in coding tasks in the lab and in the field to find 44 distinct questions that developers ask. Of these 44, they estimated that 15 are currently fully supported by existing research or industrial tools and 29 are partially supported – none were identified as not at all supported. Half of the partially supported questions (52%) are behavioral reachability questions (38%) or intent reachability questions (14%). Only one of the fully supported questions is a reachability question. Most of the fully supported questions involve questions that can be answered by tools implementing simple syntactic program analyses (e.g., “Who implements this interface or these abstract methods?”). The reachability questions from Sillito’s study are also listed in Table 3.

Table 3 demonstrates the need for both downstream and upstream and search, filter, and compare. Questions are nearly evenly split between downstream (24) and upstream (19). Over a third of the questions (35%) involve search for specific behavior upstream or downstream from an origin. Several (19%) compare behavior in different situations or filter (14%). Several questions (19%) ask about how code interacts with fields in a type. Only 1 question involves tracking valueFlow.

Behavioral reachability questions make use of far fewer criteria functions than intent reachability questions. One intent question uses search, one compare, and none use filter. All but two of the nine questions without a criteria function are intent questions. Intent questions involve interpretation of behaviors to learn facts,

Figure 1. Question frequency against difficulty for 12 reachability-related questions.

18

making it much harder to specify criteria that directly select a relevant behavior. This suggests that effectively supporting intent questions in code exploration tools requires helping developers explore and interpret static traces.

In study 3, I observed 17 developers at Microsoft at work on their coding tasks in the field. Each session lasted approximately 2 hours. When selecting tasks, participants were encourage to choose a task involving unfamiliar code, minimally defined as code they had not written themselves. While only 35% of the tasks that developers chose were tasks they planned to do at the time of our session, 95% (all but one) of the tasks developers chose were on their lists of tasks to do.

While debugging and investigating code, developers frequently asked reachability questions. In order to examine the relationship of these activities to reachability questions, I looked for reachability questions in the 5 longest debugging and 5 longest investigation activities. Each of these activities had a central, primary question developers tried to answer throughout the activity. Surprisingly, the primary question in 9 out of 10 of these activities was a reachability-related question. At the beginning of these activities, developers rapidly formulated a specific question describing behavior in the program they wished to locate. For example, to debug a deadlock, a developer began at a statement and began traversing callees in search of statements acquiring resources. 51 minutes later, this finally revealed the behavior that had caused the deadlock. Table 4 lists reachability questions associated with these long activities.

When answering reachability questions, developers explored the code either dynamically using the debugger and logging tools or statically using source browsing tools. Interestingly, developers did not primarily use the debugger to debug and code browsing tools to investigate implications. Instead, like the lab study participants, developers often made use of both tools as they sought to answer multiple lower-level questions or tried alternative strategies for answering their primary question. Developers constantly dealt with uncertainty during their tasks: from generating and testing hypotheses about code, determining whether their investigation strategies were likely to succeed or fail, and determining if the results produced by their tools contained false positives or false negatives.

An example from the longest debugging activity helps illustrate several of these points. Observing an error message in a running application, one developer spent 66 minutes locating the cause of the error message in the code. Using knowledge of the codebase, he rapidly located the code implementing the command he had invoked in the application. But it was not obvious where it trigerred the error. Hoping to “get lucky”, he did a string search for the error message but found no matches. Unsure why there were no matches, he next statically traversed calls from the command method in search of the error. But he was unsure which path would be followed when the command was invoked. Switching to the debugger, he stepped through the

19

code until learning his project was misconfigured and creating spurious results both in his debugger and code searches. After resetting his project configuration, he again did a string search for the error string and found a match. However, many callers called the method, any one of which might be causing his error. So he returned to stepping in the debugger. After locating code that seemed relevant, he quickly browsed through the code statically. Finally, he returned to the debugger to inspect the values of some variables.

5.2 Strategies for answering reachability questions

Using existing strategies to answer reachability questions is currently time consuming, difficult, and error prone. Developers in my lab study became overwhelmed investigating unfamiliar code and gave up, answered questions incorrectly, and inserted bugs because of false assumptions about reachability questions. In some cases, developers posed questions that, if they could have answered, would likely have prevented bugs. Developers elected not to answer them because they were either too time consuming or too difficult to answer. In contrast, none of the developers at work on their own codebase gave up answering a reachability question. But while they did not give up, they recounted questions they recalled as painful to answer or spent tens of minutes during my observation sessions answering them.

In order to answer a reachability question, developers select a strategy amongst the strategies with which they are familiar. All strategies either rely on the developer’s own knowledge, communicating with teammates, or the developer investigating code. For code that developers know well, developers may already know the answer as part of their understanding of how it works [AM]. Of course, this understanding is difficult to achieve in large codebases both due to the number of statements and paths and because they constantly change as developers edit the code. A field study participant spent several minutes investigating code he had written himself a little over a year earlier because he did not remember all the details and others had edited it. Conversely, even developers new to an application generate hypotheses and make assumptions about reachability questions. For example, developers in my lab studies assumed that an EditBus was connected to edit events. But if developers wish to test these hypotheses, they must employ other strategies.

Developers also answer reachability questions by communicating with their teammates. Developers sometimes document concrete traces describing important behavior in complex code with diagrams such as UML sequence diagrams. However, as there are many possible reachability questions, it is unlikely that an up to date diagram will exist for most reachability questions. Developers also communicate with their teammates by sending emails or instant messages, holding meetings, or interrupting their teammates while they are at work. In some cases, social convention may preclude teammate interruption from being the first strategy used. As it is expensive for the interrupted teammate, developers are often expected to have done some due diligence to get a general understanding before asking a more focused question of their teammates [A]. Of course, teammates also eventually leave the team, may be otherwise unavailable, might have forgotten the answer, or never knew the answer at all.

Finally, developers answer reachability questions by investigating the code. There are two classes of investigation strategies. In dynamic investigation, developers run the program and observe its output either directly or through tools such as a breakpoint debugger, logging statements, or investigative trace-based tools such as the WhyLine [C]. Like concrete trace documentation, dynamic investigation can reveal important paths in specific circumstances. In some cases, developers may even be able to answer reachability questions by exhaustively executing the code in all possible situations relevant to the question. But, generally, dynamic investigation only indicates what code does in the situation in which it was executed. Answering reachability questions which ask about behavior in all possible situations requires further investigation using other strategies. Moreover, starting the program or providing the input necessary to reach the situation of interest is sometimes time-consuming or impossible. A field study participant working with a web application was forced to wait days for it to restart and reach the execution state of interest. In other cases, special hardware or resources necessary for executing an application may not be available. When failures are reported in the field, they include only call stacks or logging information, sometimes making it impossible to recreate the concrete trace of interest.

The only strategy capable of answering reachability questions accurately from the code itself is static investigation. In static investigation, developers inspect the program’s source using an editor and source browsing tools. Call graph tools, such as the Eclipse call hierarchy, allow developers to follow chains of calls through source. However, these chains often include infeasible paths. Developers must use either knowledge to guess or manually simulate execution to remember the values of variables and determine

20

which paths will be taken. Moreover, developers must traverse call relationships to locate code, often visiting many irrelevant statements in search of their target.

5.3 Collecting and making sense of answers

Sometimes developers need to remember the history of the task they are performing. When traversing call relationships, developers frequently pick the wrong relationships to traverse. To proceed, developers must then backtrack, remember which relationships they have not already visited, and pick another to traverse.

In many tasks, remembering both task context information and answers to reachability questions can become overwhelming. Several developers investigating code in my lab studies gave up, reporting being too overloaded by information to continue. Developers also switch tasks because of interruptions or becoming blocked as frequently as once every 5 minutes [J] and must then put aside their current task information and remember any previous information for their new task. For complex tasks, recovering from interruptions can be so challenging that developers plan their day to do more complex tasks when they are least likely to be interrupted [A].

Common development environments such as Eclipse leave developers to cope with the memory burdens of their tasks themselves. To recall information collected from code or collect further information, developers frequently navigate back to code they have previously visited [D][B]. In other cases, developers jot down notes on paper such as names of methods on a trace, line numbers where interesting statements are located, facts that are interesting and worth remembering, or followup questions that should be pursued. However, the time and effort to externalize this information discourages externalization in all but the most challenging situations and minimizes the amount of information developers chose to record. Moreover, when developers wish to revisit the location of information they have collected to find additional information, they must still manually navigate back to the location. While tools have been designed to bookmark task relevant methods or blocks of code [AV][AW], no tools exist to track paths developers investigate.

6. THE PROBLEM: TRAVERSING FEASIBLE PATHS

Developers traversing paths using static investigation must determine both which paths lead to targets and which paths are feasible. Information foraging models predict that several factors influence the difficulty of traversal. Recall that these models predict that developers use their knowledge to interpret method names at call sites and traverse those they believe are most likely to lead to their goals [G]. Thus, the amount of knowledge the developers have about code and the amount and accuracy of “scent” in the identifiers determines the time required for successful traversal. Of course, the distance from the start to target statements and the number of potential relationships to traverse (the branching factor) shape the size of the search space.

In my studies, methods sometimes had high branching factors simply because they were long and contained many method invocation sites (downstream questions) or were widely used with many callers (upstream questions). But they were also sometimes caused by conditionals. For example, dynamic dispatch to a common interface (e.g., IRunnable in Java) often has many possible targets. Over the course of my lab study observations, field observations, and developers recounting stories of challenging tasks, I identified several infeasible path idioms. Surprisingly, all of these are examples constant-controlled conditionals (CCC). A constant-controlled conditional consists of three parts: a creation statement where a variable is assigned to a constant, a propagation path where the constant flows from its initial variable to other variables through assignment statements, and a conditional that evaluates a variable containing the constant to select which branch is taken. Along paths from alternative origin statements are different creation statements controlling which branch through the conditional taken. Conceptually, a creation statement selects a branch to be taken and this decision is propagated through assignment statements.

The simplest example of a constant-controlled conditional are flags. A constant (e.g., true or false) is assigned to a variable, the constant is propagated to a conditional statement, and the conditional determines whether or not a block of code is executed.

A more complex example is dynamic dispatch. For example, in implementations of the subject-observer design pattern [AL], an observer object registers for notifications sent by a subject object by passing itself to the subject. The constant being passed is the type of the observer. The observer gets an instance of itself (creation) and registers for notifications (propagation), adding itself to a collection of observers in the

21

subject. When the subject sends a notification, each observer is read from the collection and a dynamic dispatch (conditional) evaluates the runtime type of the observer (the constant), selecting one of potentially many dispatch targets.

In COM, code instantiates an implementation of a COM interface (creation). This implementation is propagated to many call sites, likely through a field, that perform dynamic dispatch (conditional). A field study participant found it difficult to determine which COM component would be called at dynamic dispatch sites:

I’ll tell you a problem that I encounter some times that is hard with a capital h. And it feels like it ought to be trivial. But you’ve got some widely implemented COM interface. And you care about a particular implementation of it. And you want to know who are all the people that call my implementation. But if you just search for the interface name and the method name, you’re going to get a gazillion people all over the place that have some implementation of that interface that they are calling. But there are only a few callers that are actually relevant because, you know, you can Cigure out who co-creates my class id. Great! Now what do they do with that pointer. Who query interfaces that pointer, who passes that to who. You can trace it through, sometimes. But, that’s something that I’ve deCinitely had to do that is frustrating.

When using a framework, developers write classes extending framework classes. Developers instantiate their classes (creation), register them with the framework (propagation), which then performs dynamic dispatch (conditional). This flow of control back and forth between the framework and user code can make it difficult to understand what is happening. A graduate student told a story of spending much of a day trying to debug analysis code he had written. While he had the entire framework source in his project, it was still difficult to follow paths back and forth between the framework code and the code he had written.

Work items are often used to queue up and sequentially perform actions. Rather than call a method implementing the action, the action method is enclosed in a work item type. The client code creates the work item type which is then propagated to some generic code that loops over work items and performs dynamic dispatch on each work item (conditional) transferring control back to the action method.

In publish / subscribe, message producers create messages (creation) which are sent on bus which then notifies subscribers of a new message (propagation). Either the subscribers themselves or the bus inspect the message to determine if it is a message type relevant to the subscriber (conditional).

7. THE SOLUTION: INTERACTIONS AND VISUALIZATIONS

In order to help developers more effectively ask and answer reachability questions, I propose a fundamentally new technique for reverse engineering where developers ask a reachability question, a static analysis finds matching target statements, and visualizations help make sense of the answers. Developers first select a statement in their development environment and start REACHER by invoking an upstream or downstream reachability query. REACHER executes an analysis (section 8) to compute static traces. Developers then describe a reachability question using search, compare, filter, and other criteria functions describing target statements in static traces.

Developers ask reachability questions in order to answer intent reachability questions about design decisions. Answering these questions often requires determining the original developers’ intent (e.g., “Why is this call necessary?”) and reverse engineering facts generalizing behavior. To do so, developers need to understand the context of target statements. REACHER provides three views of statement context. The search results pane depicts targets grouped and sorted, making it easy to get an overview of the targets that exist and refine questions from string searches to statement searches. TRACEMAPS depict control and data flow paths and help developers reason about ordering and dependency between targets spread across many methods. TRACESOURCE helps developers inspect individual targets in the context of a methods’ source to determine paths through a method, values in executed statements, and the intraprocedural paths taken.

In this section, I describe proposed work for asking and answering reachability questions including interactions by which developers describe reachability questions, a novel diagram for understanding behavior that might occur, and a novel source view for understanding executed statements. While the following sections sketch designs for each of these, completing the designs is proposed work. Thus the

22

discussion sometimes includes decisions to be made and alternatives that might be chosen as the design space is explored and experience is gained from using REACHER in practice.

In order to reduce risk, verify that this approach is capable of helping developers answer reachability questions more effectively, and iterate the design to make it more effective, I propose to conduct a paper prototype study of REACHER. Developers will work on a task taken from study 1 to allow direct comparison to the questions, strategies, and productivity of the earlier participants. Developers will use both Eclipse and a paper mockup of REACHER. After asking a reachability question invoking REACHER, developers will see a paper copy of what REACHER would depict. Early pilot participants will be used to determine the questions developers are most likely to ask to determine which diagrams to mockup. Each participant will work on a design enhanced to deal with problems the previous participant experienced. A successful conclusion of the study then provides evidence that an implementation of Reacher could help developers answer reachability questions more effectively.

Note that many of the diagrams in this section are in color.

7.1 Asking reachability questions

Developers begin a session with REACHER by invoking a reachability query in Eclipse. Using the insertion point or selection, developers denote a statement and invoke an upstream or downstream command. This first executes an analysis to generate a set of static traces and then creates a REACHER window. Next, developers may refine their reachability question by adding, editing, and deleting criteria functions in REACHER. To make sense of the results, developers view and navigate between statements and target statements in the static trace. Developers may also ask followup reachability questions relative to executed statements selected in a REACHER view by right clicking and and selecting a reachability question from the context menu (e.g., “When does this execute?”). Asking a followup question executes an analysis to compute static traces and creates a new reachability question. Developers may also ask new reachability questions by selecting statements in Eclipse and invoking reachability questions. At the bottom of the REACHER environment is a numbered list of reachability questions the developer has asked.

To create a reachability query, REACHER requires 2 or 3 pieces of information: is the query upstream or downstream, a destination statement (upstream queries and optionally downstream queries), and an origin statement (downstream queries). I propose to explore the space of interactions for specifying these arguments to devise a design that both requires minimal developer effort for the most common cases and is expressive enough to handle less frequent cases. Rather than simply map the insertion point to a line of code, the insertion point could be mapped to the first enclosing expression in statements with nested expressions, corresponding to a TAC statement. This allows developers to select TAC statements, rather than Java statements, providing greater expressiveness.

Examples of downstream reachability questions suggest that two common forms are queries downstream from a call site (downstream(p, call)) and querying an entire method (downstream(p, m)). These could be invoked by default from insertion points at call sites or method declarations. Selection regions could be used to denote an origin and destination within a method.

Origin and destination statements serve two distinct purposes. First, they describe the start and end of a static trace in which target statements might be located. Second, they describe the statements in which an analysis may discover creation and propagation statements in order to determine path feasibility. In some situations, it may be desirable to separate these goals by generating a larger static trace more likely to include assignment statements while restricting the area in which target statements can be matched. For example, when the insertion point is on a call site in m, beginning the static trace at the start of m may find additional constants (e.g., the call site is in a select statement which requires a variable to have a constant value), while subtrace can scope searches to only statements downstream from the call site.

After invoking a reachability query, the developer is brought into the REACHER environment. From here, the developer may navigate through the static trace at either a method granularity using a diagram (see section 7.2) or at a statement granularity in a listing of executed statements (see section 7.3). This navigation allows developers to ask feasibleCallers and feasibleCallee questions (but relative to an executed statement, not a statement in all contexts). But usually developers will use criteria functions to describe target statements.

The most common criteria function is search. To search, developers enter strings in a search textbox and toggle tabs above the textbox selecting criteria functions (Figure 3). Entering multiple search strings separated by spaces or selecting multiple tabs creates disjunctive searches (conjunctive searches can be

23

performed indirectly using the ∧ behavior operator - see below). Table 5 maps several common reachability questions with searches to interactions. Selecting multiple tabbing repeats the search using each criteria function. Leaving the textbox empty matches all strings and corresponds to not including a search function. As the developer enters each character of a string, a list of search results is immediately updated.

WRITES

FIELDTYPES

EXTERNAL

CALLS READS

COMMENTSALL

Figure 3. The search textbox and tabs.

Developers may also ask nested reachability questions containing multiple view functions, multiple reachability queries, or behavior operators. Some interactions (see below) create nested reachability questions but are not expressive enough to create all such questions. REACHER could support nested reachability questions by allowing developers to write reachability questions using the formalism in section 2. While this might be useful for advanced users, it would likely be cumbersome for most common questions. Moreover, developers might write syntactically incorrect reachability questions, making it necessary to debug the reachability questions themselves.

Instead, I propose to explore interaction techniques for developers to ask nested reachability questions. Figure 4 depicts a possible design for developers to view and edit a structured representation of a reachability question while enforcing syntactic correctness and prohibiting invalid compositions. Each reachability question is assigned a number q with which other reachability questions can reference it. The first line describes behavior to be searched and is either a reachability query, filter applied to a reachability query, two reachability queries combined with compare, or two reachability questions combined with a behavior operator (∧, ∨, -, compare). Developers may perform zero or more searches over the behavior to create search lines. Each search line consists of either cutPoints, valueFlow, search, search(cutPoints), or search(valueFlow). Each form of search line is displayed using a natural language syntax. However, valueFlow(TR, expr) : B is only available for behaviors which are traces (reachability queries or filter composed with a reachability query). Selecting a search line updates the search textbox and tabs and allows the search line to be edited. Each search line is given a unique color (for the reachability question) used to distinguish targets matching the search line. To create a new reachability question with a behavior operator, developers right click the reachability question, select a behavior operator, and select a reachability question for the second argument. Asking refinement or followup questions through other interactions also update the list of reachability questions.

Reachability question Tab selection Notes

search(Bi, str) all Matches str against any text in either statements, executed statements where variables are bound to values, fully qualified method names, or comments in Bi

search(Bi, reads(str)) field reads Matches str against all fully qualified field names (i.e., including matches against packages or type names containing fields) with a read statement in Bi

search(Bi, accesses(str)) field writes, field reads

Matches str against all fully qualified field names (i.e., including package and type names) with a read or write statement in Bi

search(cutPoints(Bi), str) external calls Matches str against any method call sites to methods for which there is no source in the project.

search(Bi, methods(str)) types Matches str against the containing type of method declarations in Bi

Table 5. Common search reachability questions are expressed using tab selections and entering a search string str.

24

downstream from JEditBuffer.getFoldLevel

search for external calls

search for values coming from or going to JEditBuffer.fireFoldLevelChanged: 2065: getListener()

1

Figure 4. A structured representation of reachability questions.

Developers ask filter reachability questions when they ask about the behavior of code in a specific situation. Developers ask these questions by selecting an expression in REACHER, invoking “What if this is” from a right click menu, and selecting a constant from a list of constants the expression might be. This refines the current reachability question by adding a filter view function. If developers select multiple constant values, a filter for each is created, and each is compared.

Developers also ask reachability questions comparing static traces before and after a change. After editing code in their development environment, REACHER will prompt developers switching back to REACHER to either keep the current results, refresh the static traces by rerunning the analysis, or refresh and compare differences caused by the changes.

Often developers may not initially know exactly what to search for. In these cases, developers ask an initial reachability question and iteratively refine it as they learn more. Developers may explore and use what they learn to either formulate a more specific string search or refine their search using target selection. In target selection, developers select statements in one of the three views, refining a search for statements matching a string to a search for a set of statements. In some cases, developers ask conjunctive reachability questions about targets that match multiple reachability questions. For example, a developer in study 2 tried to determine the correct method to call to set an application setting. To do so, he looked for methods downstream from several dialogs he knew changed it and upstream from the statement setting the application setting. Developers ask conjunctive questions by selecting a reachability question and selecting a behavior operator.

In some cases, REACHER will be unable to determine the branch taken at a conditional (see 8.3 for a discussion of when this occurs). In these cases, developers may either use filter to explicitly select a branch, filter to select and compare multiple branches, or valueFlow to find the situations controlling which path is followed. To ask a valueFlow question, developers select an expression in REACHER and invoke “Where might this come from?” from a context menu. This adds a new valueFlow search line to the active reachability question. valueFlow questions can also be used in a variety of other situations such as finding calls into an object or determining which, if any, objects are invoked when iterating over objects in a loop.

Table 6 summarizes the followup questions supported by REACHER.

After a search, REACHER provides 3 views of matching targets that provide increasing amounts of context. A search results pane lists target statements. TRACEMAPS (see 7.2) depict, for each target statement, the containing method and paths between these methods and the origin statement. TRACESOURCE (see 7.4) depicts every statement in the static trace with targets highlighted.

I propose to investigate possible designs for listing target statements in the search results pane. One key design decision is how targets are sorted and grouped. Targets could be hierarchically grouped by location in the source (e.g., containing method, type, package) or similarity (e.g., calls to the same method, calls with the same arguments, assignments to fields). Targets could be sorted by strength of match (e.g., method matches scored by the number of targets statements they contain), distinctiveness [AT][AY] (i.e., infrequent statements are ranked higher than frequent statements such as System.out.println), or alphabetically. Of course, multiple groupings and sort orders could be user selectable.

A second key decision is how much context is displayed for each matching target statement. Designs trade off the amount of detail against the number of targets visible. Possible designs range from the less verbose (e.g., only the identifier that matched, the line of code that matched) to the more verbose (e.g., the line before and after the match, lines of code and the fully qualified containing method name).

25

Selection Command Followup question

reachability question q1 “Compare this to”, select question q2 compare(q1, q2)

reachability question q1 “And also”, select question q2 q1 ∧ q2

reachability question q1 “Or”, select question q2 q1 ∨ q2

reachability question q1 “But not”, select question q2 q1 - q2

method m “When might this execute?” upstream(p, m)

expression e “When might this execute?” upstream(p, e)

expression e “Where might this come from or go to?”

valueFlow(upstream(p, e), e)

expression e “What if this is”, select value v filter(q, e, v)

expression e “What if this is”, select values v1 and v2 compare(filter(TR, e, v1), filter(TR, e, v2)

field f “Where might this come from?” valueFlow(upstream(p, writes(f)), writes(f))

field f “Where might this go to?” valueFlow(downstream(p, reads(f)), reads(f))

Table 6. Followup questions are invoked by selecting an item in the search results pane, reachability question list, TRACEMAP, or TRACESOURCE, right clicking to invoke a context menu, and selecting a command.

7.2 TRACEMAPS provide an overview of targets in the context of paths

TRACEMAPS provide an overview visualization of target statements at a method granularity in the context of static traces. TRACEMAPS help developers browse targets, refine questions, and make sense of how targets are related.

Figure 5 depicts a TRACEMAP from an example taken from study 1. TRACEMAPS primarily depict methods - each white box is a method name. Methods are enclosed by boxes corresponding to the containing type. TRACEMAPS are laid out as a tree from left to right (recursive calls do not influence layout). An in-order traversal of a TRACEMAP corresponds to execution order. Within the tree, elements are laid out on a two-dimensional grid. In contrast to UML sequence diagrams where each message is usually depicted on a separate row, this helps maximize the utilization of screen real estate. A number of visual attributes of methods and edges between methods are used to encode a variety of information (see Table 7). By default, methods are visible when they contain a target statement. When there exists a path between visible methods through hidden methods, it is shown with a thick line. Clicking on the line expands the path. Methods may also be made visible explicitly by invoking a command on a method selection in a TRACESOURCE.

Figure 5. A TRACEMAP depicting cutPoints(downstream(jEdit, getFoldLevel)) from study 1.

JEditBuffer

+getFoldLevel

ExplicitFoldHandler

+getFoldLevel

IndentFoldHandler

+getFoldLevel

#fireFoldLevelChanged

BufferHandler

+foldLevelChanged

JEditTextArea

+invalidateLineRange

isShowing

ChunkCache

+getLineInfo

JEditTextArea

+invalidateScreenLineRange

fm.getHeight()

fm.getHeight()

painter.getWidth()

gutter.getWidth()

JEditBuffer

—lineToChunkList—updateChunksUpTo +markTokens +markTokensframework

callsSwingUitlities.i...

JEditBuffer TokenMarker

26

Visual attribute Meaning

shaded gray box with text label type with type name

white background / colored background

method name / target statement. Only the expression matching the search is shown, not the entire Java statement.

+ / - / # public / private / protected. Used in method names

method name crossed out method marked with @deprecated annotation

method name ending in #N (e.g., #3)

method is nth overloaded method with name method name

circle on line between methods method call site in loop

solid / dashed call must execute / call may execute. Calls must execute when all paths through the static trace from method entry must contain the call. Note that may / must is determine with respect to method entry, not the origin of the static trace.

diverging curved lines mutually exclusive calls

90 degree lines with an arrow recursive call

thin line / thick line single call / paths of calls. Methods along a path may between methods may be hidden, creating a path.

method name in bold font developer has viewed method

whisker on left edge of method name

method is called by additional methods which are not currently visible. Methods with no callers visible and no whiskers are cut-points.

gray line with open arrowhead data flow from target statement a to target statement b

Table 7. TRACEMAP visual attributes and their meaning.

Developers often hypothesize methods are relevant, view them, and discover that they are not relevant. They then continue by viewing other methods they hypothesize to be relevant. A key challenge in this process is remembering the methods they have already viewed. Therefore, TRACEMAPS depict the names of viewed methods in bold.

Consider again the TRACEMAP in figure 5. In study 1, participants attempted to understand what getFoldLevel did by traversing through downstream methods. Despite spending considerable time doing this, only one was able to discover that the call sometimes led to the screen being redrawn, an important reason why the call was present. In contrast, searching for calls to external methods using REACHER would rapidly reveal a path along which the screen is invalidated, leading to repaint calls to the framework. Note, however, that manually reverse engineering this diagram using Eclipse required over an hour of investigation as hundreds or thousands of methods are downstream from getFoldLevel.

This example illustrates two techniques for scoping visible targets and methods. First, at least tens of calls were excluded when they were to methods contained in Object, Integer, Number, String, Character, HashMap, Vector, Segment, Math, or Thread. Additionally, calls into the utility user type Log were excluded. When using the cutPoints view function, an exclusion list is helpful for removing the many ubiquitous calls to collection and utility types. Second, the target search is depth-limited - only targets up to a depth of 8 are shown. Other targets are shown by a path leading to the text “framework calls”. Depth limiting informs developers that a path contains targets without showing targets beyond the depth limit. Clicking on on a path overrides the depth limiting and shows all methods along the path.

27

Both compare and upstream reachability questions either directly or indirectly compare static traces. The compare criteria function compares two static traces created either by filtering and selecting multiple values, comparing two reachability questions, or comparing multiple version of a method. Upstream reachability questions generate a static trace from each cut-point that reaches the destination statement. Depicting differences between static traces is challenging. A simple solution is to list each static trace in a separate TRACEMAP. However, when an upstream query is performed on a method called from many origin statements, this may generate tens or even hundreds of static traces. Moreover, many of these static traces will often be highly similar and share many common subtraces. In compare questions, the static traces may be nearly identical and difficult to differentiate.

A second solution is to depict the start of each each static trace separately but visually share similar common subtraces. But when are subtraces similar? A strong criterion might require them to be identical and call the same method in an identical context. However, this will only occur when both the parameters values are identical and all fields written along either path have identical values. This is unlikely to occur in practice. However, method calls in different contexts may appear equivalent if differences in the contexts never determine the feasibility of paths to visible methods. But as the developer refines their question and edits search lines, the visible methods will change. Thus, merging common paths that appear equivalent would change how paths are depicted as the developer iterates or refines their search.

A third possible solution is to share common paths that sometimes appear equivalent and provide a visual attribute to highlight differences when they are not. For example, the first method along each static trace could be associated with a unique color. When merged static traces later differ in which path is followed, each path could again be associated with colors describing which static trace follows each path. Figure 6 gives an example of such a TRACEMAP. However, this design still scales poorly in situations comparing many static traces.

Type4

Type1

+methodWithTarget

+a

Type2

+b

Type3

+c

Type5

+d

Type6

+d

Type8+methodWithTarget

target

Type7+methodWithTarget

target

target

Figure 6. A TRACEMAP comparing three static traces.

As developers interact with Reacher to iterate and refine their questions, REACHER updates the TRACEMAPS to reflect these new questions. Figure 7 continues the example from figure 5. Here, a developer asks which, i f any, ob jec t s may be no t i f ied when JEdi tBuffe r. f i reFoldLeve lChanged invokes BufferHandler.foldLevelChanged. Viewing the TRACESOURCE of the method, he selects the receiver object for the call to foldLevelChanged and uses the context menu to create a new valueFlow search line. The statements through which this expression have flowed are now visible (green) in the TRACESOURCE.

Figure 7. An example of using valueFlow to understand what may happen.

JEditBuffer

+getFoldLevel

ExplicitFoldHandler

+getFoldLevel

IndentFoldHandler

+getFoldLevel


BufferHandler

+foldLevelChanged

JEditTextArea


isShowing

ChunkCache

+getLineInfo

JEditTextArea


fm.getHeight()

fm.getHeight()

painter.getWidth()

gutter.getWidth()

JEditBufferbufferListeners

+getListener

getListener(...)

JEditBuffer

—lineToChunkList—updateChunksUpTo +markTokens +markTokensframework

callsSwingUitlities.i...

JEditBuffer TokenMarker

28

The developer learns that objects are flowing through a bufferListeners collection. But what objects might flow into this collection? The developer now selects bufferListeners and asks “Where might this come from?” As this question asks about values flowing into the collection in any execution (not simply downstream from getFoldLevel), this creates a new followup reachability question and TRACEMAP (figure 8). Three different methods write to bufferListeners by calling addBufferListener, either passing along a parameter (1), creating a new BufferChangeListener (2), or creating a new BufferHandler (3). But the lack of a whisker in (1) indicates that it is never called from user code and (2) occurs only in a deprecated method.

JEditBufferBuffer

DisplayManager

JEditBuffer

+addBufferListener+addBufferChangeListener

+addBufferListener#2

+DisplayManager

bufferListeners

listener

bufferHandler

new BufferHandler(...)

new BufferChangeListen...

JEditBuffer

+addBufferListener

bufferListeners


bufferListeners

Figure 8. Asking a followup question (“Where might this come from?”) creates a new TRACEMAP

7.3 TRACESOURCE helps make sense of statements along paths through a method

A static trace describes possible executions of the program as a list of executed statements that include both the statement and a context binding every variable to a value. In many situations, developers may wish to browse static traces at a statement granularity. When discovering interesting target statements, developers may wish to see the role they play in the method in which they are contained. When inspecting calls that may or may not execute in a TRACEMAP, developers may wish to see the conditionals and surrounding statements determining if they will execute. When asking filter questions about the behavior of code in a specific situation, developers need to select an expression and value for the expression. When asking valueFlow questions, developers may wish to see statements along the path or where the constants originally came from.

In contrast to source which depicts statements, TRACESOURCE depicts executed statements in static traces. When a static trace maps an expression to a constant, TRACESOURCE indicates that the expression has a constant value. TRACESOURCE also indicates when a static trace has resolved a conditional by determining which branch will be executed. An important design choice is how this information is displayed. In Figure 7, REACHER has determined that an EBMessage has a runtime type of EditPaneUpdate, and this allows branches through several conditionals to be resolved. Expressions known to be a constant are shown with a gray background followed by the constant in a color associated with the static trace. Expressions or statements known never to execute are replaced with “...” in a colored background. Clicking on the ellipses expands the hidden text. Target statements are depicted in the color of the corresponding search line.

public void handleMessage(EBMessage EditPaneUpdate msg)

{

if(msg instanceof PropertiesChanged false)

...

else if(msg instanceof SearchSettingsChanged false)

...

else if(msg instanceof BufferUpdate false)

...

else if(msg instanceof EditPaneUpdate true)

handleEditPaneUpdate((EditPaneUpdate)msg);

}

Figure 7. TRACESOURCE for a method from jEdit found in study 1. The developer searched for “handle”.

29

Like TRACEMAPS, TRACESOURCE also compares static traces when a reachability question generates more than one. For static traces in which an expression is known to be a constant, TRACESOURCE lists the constant value in a color associated with the static trace. For upstream reachability questions, this is the color of the origin method in the TRACEMAP.

In some cases, developers may wish to traverse calls or other relationships. Scrolling a TRACESOURCE up or down navigates to the previous or next method. When exploring a single static trace, this is the previous or next method in a depth-first search traversal. When multiple static traces are visible, this is the next or previous method in one of the static traces. Developers may also select call sites and invoke a command to navigate to the callee. Note that callsites on infeasible paths in a method cannot be navigated to as there is no callee in the static trace. Another command could be used to jump between target statements.

7.3 View management and integration

A key goal of REACHER is to enable developers to rapidly refine and iterate questions and make sense of the answers. While doing so, developers may switch between each of REACHER’S three views. At all times, all three views show the same static traces and target statements answering a reachability question. Interactions which change the reachability question are immediately reflected in all 3 views. When developers express interest in a statement they have discovered in a view, developers may wish to see it in other views. To do so, developers select a statement. This navigates and highlights the statement in all views. Developers may also decide that some navigation action or question they asked led to uninteresting code. Therefore, REACHER will maintain a history stack of navigation events and reachability question edits and support back and forward commands through the stack. Developers can recall where they have previously visited, as visited method names are persistently bold.

Figure 7. The main window of REACHER.

BackNew Forward

JEditBuffer.getListener (1)

bufferListeners

1

2

depth limitExclusions... 1 2 5 10 20 !

JEditBuffer

+getFoldLevel

ExplicitFoldHandler

+getFoldLevel

IndentFoldHandler

+getFoldLevel


BufferHandler

+foldLevelChanged

JEditTextArea


isShowing

ChunkCache

+getLineInfo

JEditTextArea


fm.getHeight()

fm.getHeight()

painter.getWidth()

gutter.getWidth()

JEditBufferbufferListeners

+getListener

getListener(...)

JEditBuffer

JEditBufferBuffer

DisplayManager

JEditBuffer

+addBufferListener+addBufferChangeListener


+DisplayManager

bufferListeners

listener

bufferHandler

new BufferHandler(...)

new BufferChangeListen...

JEditBuffer

+addBufferListener

bufferListeners


bufferListeners

WRITES

FIELDTYPES

EXTERNAL

CALLS READSALL

—lineToChunkList—updateChunksUpTo

COMMENTS

downstream from EditPane.setBuffer

search for "updateCaretSt"

upstream from writes to BufferHandler.bufferListeners2

3

downstream from JEditBuffer.getFoldLevel1search for external calls

search for values coming from or going to JEditBuffer.fireFoldLevelChanged: 2065: getListener()

search for values going to BufferHandler.bufferListeners

x

x

x protected void fireFoldLevelChanged(int start, int end) {

for (int i = 0; i < bufferListeners.size(); i++) {

try {

getListener(i).foldLevelChanged(this, start, end);

} catch (Throwable t) {

Log.log(Log.ERROR, this,

"Exception while sending buffer event to "

+ getListener(i) + " :");

Log.log(Log.ERROR, this, t);

}

30

Developers often ask followup questions, ask and answer multiple questions in parallel, work on multiple tasks, or try to recall what they have learned to guide future questions. To support these activities, the REACHER environment allows developers to interact with multiple reachability questions. Figure 7 depicts a mockup of the view management features of REACHER. At the lower right of the REACHER window is a question list of reachability questions. Clicking on a reachability question makes it the active reachability question (black box outline) and updates the static trace views. Within the active question, the active search line is shown in the search textbox and search results pane in the corresponding color. Each reachability question is either visible (white background) or hidden (gray background). Clicking the x icon deletes the reachability question and the minus makes it hidden. The TRACEMAP pane shows the TRACEMAP of each visible reachability question. To help developers understand how followup questions are related to an initial question, REACHER depicts data or control flow relationships between the expression on which the followup questions was asked (if any) and the origin or destination statements of the followup question. Followup questions are also laid out in the same scrollable region as the initial question. When followup questions refer to an expression or method in the initial question, the origin (downstream) or destination (upstream) statements are laid out in the same column as the expression or method in the initial question. Colored outline boxes are used to denote selections and associate linked selected elements. For example, methods currently visible in the TRACESOURCE are shown with a yellow outline in the TRACEMAP.

The TRACESOURCE pane contains a list of TRACESOURCES for methods. Several designs are possible for how developers navigate between TRACESOURCE for different methods. The TRACESOURCE pane could contain a single scrollable list containing TRACESOURCE for every method in the reachability question in execution order. This view makes it easy to skim through a static trace. Alternatively, the TRACESOURCE window could contain separate scrollable subwindows listing the TRACESOURCE of methods containing targets. This would help developer see the context surrounding multiple targets at a time.

REACHER runs in a separate top-level window from the developer’s IDE. In the increasingly prevalent situations where a developer either has a large monitor or multiple monitors, this allows both REACHER and the IDE to be visible at the same time. As developers work on coding tasks in their IDE, they may ask reachability questions either from within the IDE or REACHER. Navigating in REACHER also navigates the IDE.

7.5 Evaluation

To test my thesis statement and evaluate the usefulness and usability of REACHER’S reverse engineering approach, I propose to conduct a between-subjects lab study. Developers using Eclipse and REACHER will be compared against a control condition using only Eclipse. Additional competing tools such as UML tools might be added as additional conditions. To maximize external validity of the tasks, tasks will take place in open source codebases and may use actual past changes or bugs. Dependent variables to measure will include task time and completion and perhaps also the design quality of the changes, the facts learned, the number of task irrelevant methods inspected, or the number of renavigations to previously visited methods. To understand the usefulness of this technique for developers with varying levels of codebase knowledge, it would be interesting to study developers with knowledge of a codebase. But as running a study long enough for a developer to be familiar with a codebase is likely to be infeasible, evaluating this would likely require conducting a field study of developers at work in their own codebases. While it would be ideal to conduct a field study of REACHER, this will only be possible if the tool is sufficiently robust to work on the codebases field participants use.

8. THE SOLUTION: FAST FEASIBLE PATH ANALYSIS

In order to answer reachability questions, I propose to design a fast feasible path analysis (FFPA) capable of reverse engineering static traces from code quickly enough to use in an interactive tool. To do so, it will eliminate many of the infeasible paths created by constant-controlled conditionals. The key technical problem is implementing downstream. Given a program p, an origin statement o, and a destination statement d, downstream finds all of the feasible concrete traces starting at o. upstream can be implemented by executing a downstream query from each cut-point connected by any feasible or infeasible path to d. Most criteria functions simply select executed statements from static traces and do not require additional program analysis. filter can be implemented by extending a downstream implementation to read assertions on program variables imposed by filter.

31

Conceptually, a static trace is an abstraction of a set of feasible concrete traces. A static trace has similarities with both control flow graphs and concrete traces. Like a concrete trace, a static trace is a list of executed statements containing variables that are bound to values. Values may be either concrete values or abstract values representing sets of concrete values. Like a CFG, a static trace contains control flow branches at unresolved conditionals and control flow merges and represents multiple paths through a program. But unlike CFGs, a static trace only contains successor statements which may be feasible along the path followed.

When a developer asks a reachability question, REACHER will use FFPA to generate a static trace, execute any criteria functions, and display this output. This must happen quickly enough that users do not have to wait long for the tool to respond. The results should be precise and correspond mostly to feasible concrete traces. And results should be sound and report all concrete traces that might occur. Thus, the three main goals of FFPA are performance, precision, and soundness. The next several sections describe the assumptions and approach of fast feasible path analysis and the structure of the static traces it produces. Then, an approach implemented in a prototype FFPA is described. Finally, several extensions are proposed to allow FFPA to achieve its performance, precision, and soundness goals.

8.1 Assumptions

FFPA makes several assumptions about the characteristics of real world programs and the reachability questions developers ask. First, it assumes a closed world – only the code in its current configuration is analyzed. This assumption is false when developers wish to reason about all the code that might possibly exist (e.g., subclasses) but for which source is not available or for which source has not yet been written. The closed world assumption leads to a description of code’s behavior in its current configuration as the developer sees it, not every possible configuration that might exist. Without the closed world assumption, subtypes might exist that introduce arbitrary paths. Thus, a conservative analysis would have to assume that there exist call edges from every method that could be overriden to every other public method. Not making the closed world assumption would make searching static traces nearly meaningless as most methods would be highly interconnected. Note, however, that developers might have more specific expectations about paths that exist in code outside the closed world. Specifications could be used to state these assumptions in a form usable by an analysis. Thus, the closed world assumption and open world assumption are really only extremes on a continuum, with specifications allowing arbitrary points between to be explored. But using specifications to describe paths outside the closed world is outside the scope of this thesis proposal.

FFPA analyzes fragments of a program separated by cut-points. FFPA creates cut-points for statements in a control flow graph that either have no incoming edges (dead code or called only by framework code for which there is no source) or are method invocation statements to methods that have no source (e.g., framework methods). When applications are structured so that user input received by a framework invokes event handler cut-point methods in user code, this separates the call graph into portions that are reachable from different cut-points. Analyzing static traces between cut-points, rather than including paths through the framework, greatly reduces the length of static traces when cut points dice the program into small pieces. But this decision also limits developers - they may only directly ask reachability questions about paths for which they have source. Of course, developers may still ask questions indirectly by asking about the reachability of cut-point methods they believe have paths to other methods. For example, developers might know that the framework call invalidateScreen eventually leads to the screen being repainted and may search for paths to this method rather than the painting method itself. The idea that developers can express their reachability questions relative to cut-points is the cut-point assumption.

Without the cut point assumption, FFPA would be much less scalable. Not only would static traces be much longer, but methods would need to be explored in many more possible contexts. Evidence is mixed for how often this assumption holds in practice. In my observations, developers asked some questions that explicitly stopped at event handling methods called by the framework. But in a few other cases, they tried to track control flow from user code to framework code back to user code. My proposed analysis approach cannot be used to answer these questions.

The third assumption is that analyzed programs are single threaded. As is common in other static analyses, my proposed analysis falsely models the program as being single threaded. This causes the approach to be both unsound and imprecise for programs that are not in fact singly threaded as other threads may mutate state that the analysis assumes has not changed. However, by analyzing only the portion of a program bounded by cut-points, this assumption is somewhat more likely to hold. Moreover, for flags passed as

32

parameters, other threads cannot (for values) or are unlikely to (for objects) mutate these flags. Removing this assumption to make the analysis work for multi-threaded code is outside the scope of this thesis proposal.

8.2 The static trace

A static trace is a representation of a set of feasible concrete traces implicitly explored by partially path-sensitive dataflow analyses that is explicitly constructed by FFPA. Conceptually, a static trace is an abstraction of the union of all feasible concrete traces from an origin statement to a destination statement. More formally, a static trace is a graph where nodes are executed statements and directed edges denote state transitions between executed statements. Static traces can be characterized and distinguished from other possible representations by their behavior at branches and merges in a CFG. A static trace abstracts over individual concrete traces by concisely describing a superset of the concrete traces’ behavior. In this sense, a static trace is similar to how any static analysis abstractly models execution - a single trace models many concrete traces by not representing the details that distinguish these traces. What distinguishes static traces from the representation used by other analyses are abstraction choices about the information from individual concrete traces which is maintained.

First, for statements that follow a CFG merge, a static trace contains only a single s abstracting feasible concrete traces that may have flowed through different predecessors with contexts ctx1, ..., ctxn. These contexts are joined to produce a single context ctx used in subsequent statements. Information distinguishing contexts in predecessor statements is lost. When information distinguishing ctx1, ..., ctxn could later be used to pick which a branch to follow in a subsequent conditional, this creates infeasible paths which FFPA will not eliminate.

Second, a static trace sometimes determines which branches through a conditional are feasible. In a fully imprecise analysis, all concrete traces flowing into a branch statement would be falsely modeled as flowing to all successor statements, creating infeasible traces. In a fully precise analysis, all concrete traces would flow only to the correct successor statement. Static traces are partially precise. The context of the predecessor statement is used and sometimes contains information to pick a single successor statement or rule out successor statements from feasibility.

Loops and recursion cause several complications. In a loop, a conditional cond determines whether the path into the loop or exiting the loop is taken. Loops create cycles in the CFG - statements inside a loop may execute many times. Like in data flow analysis, each statement in a loop is assigned a single context describing its behavior across all iterations. But while data flow analysis repeatedly analyzes a statement and joins the results to the context created on the previous iteration, FFPA first builds a conservative approximation of the context by iterating the loop and then analyzes the statement. Similarly, method invocations are analyzed once in a single context that conservatively approximates all contexts in which it may be invoked.

In a recursive call, a path of one or more method calls results in a method already on the stack being reentered and executed a second time. Concrete traces might traverse this cycle 0 or more times. Static traces use two different devices to describe recursion. In some cases, the context in which the method is reentered may differ from the contexts already observed in the recursion. In this case, the static trace contains an additional copy of the method with executed statements in contexts from this new iteration. But the number of contexts in which a method can be observed is finite - there are a finite number of variables and a finite number of constant values which each may take. Eventually, the method will be reentered in the same context in which it was entered on an earlier iteration. In this case, a back edge is created from the call site to an earlier portion of the static trace.

8.3 Analysis approach

The goal of fast feasible path analysis is to generate a static trace from a program p between an origin statement o and a destination d abstractly describing all concrete traces while eliminating infeasible concrete traces created by constant-controlled conditionals. Conceptually, FFPA performs a symbolic execution of p beginning at o maintaining a context ctx mapping variables to constants. At conditionals evaluating a variable v, ctx is inspected to determine if v is a constant. If so, the appropriate branch is taken. Otherwise, all branches are followed using separate copies of ctx. At the corresponding control flow merge, each context is joined by comparing the values for each variable. If identical, this value is used. Otherwise, a special value ⊤ (“top”) is used signifying that nothing is known about the variable. FFPA analyzes executed statements in different contexts (context sensitive), follows CFG paths (flow sensitive), and

33

propagates distinct contexts to each successor at conditionals (branch sensitive). FFPA is partially path-sensitive as it sometimes eliminates infeasible paths but loses information about paths at control flow merges. However, a unique feature of FFPA is that the developer themselves can choose to add arbitrary path-sensitivity. When a user invokes filter on an expression and selects all possible constants it may be, a separate static trace will be generated for each constant.

The prototype FFPA implementation tracks boolean and type constants. Type constants are created at object allocation (i.e., new Type()) and at runtime type tests (i.e., x instanceof Type). But FFPA could be trivially extended to track other constants such as string constants, null, enumerated types, or numeric constants.

FFPA is able to eliminate infeasible paths caused by constant-controlled conditionals when it is able to determine what constant value a variable holds at a conditional statement. More formally, FFPA is able to resolve a conditional cond evaluating x by determining which branch is followed when the incoming context ctx maps x to a constant c. Let rd1 ... rdn be the reaching definitions of x for which there exists a feasible or infeasible path from o to cond containing rdk where rdk was the last assignment statement to x. FFPA determines x is a constant c at cond iff all reaching definitions rdk of x are of one of three cases:

•FFPA resolved a second condb which caused rdk to become infeasible

•rdk assigns c to x

•rdk assigns xc to x and the context before rdk maps xc to c.

These conditions inductively describes propagation paths by which x acquires a value c. In the simplest case, a single assignment statement assigns x to c which is read by cond. But, more generally, c may be propagated from multiple creation statements through assignment statements to reach cond. One important requirement is that the creation statements must all occur after o. This is not the case for propagation paths originating from a field not set along the path or from a library call or callback.

Besides precision and soundness, FFPA seeks to produce static traces in response to developer queries. One way to achieve the best performance would be to simply precompute static traces from every statement in the program. However, developers may edit the source, making it necessary to recompute many of these traces with every edit. And the user would then be forced to wait after the edit for the static trace generation process to complete. Thus, an analysis seeks to maximize the amount of information that is precomputed and minimize the information recomputed after a code change.

FFPA employs two types of caching strategies. First, FFPA constructs method summaries modularly, using only information local to the method. Thus, these summaries only must be recomputed when the method has been edited. Second, FFPA caches a subtrace whenever a method is entered. When the same method is entered again in a similar context, the cached subtrace is reused.

FFPA consists of three analysis phases. In the dataflow analysis and summary construction phases, a modular dataflow analysis is used to construct summaries of every method in the program. These summaries are stored and recomputed following method edits. In the static trace construction phase, method summaries are used to construct a static trace. This analysis must be recomputed whenever code has been edited.

8.4 FFPA: Dataflow analysis phase

The goal of the dataflow analysis phase and summary construction phases is to modularly construct a summary describing paths through a method m using only intraprocedural information from m. The summary describes a tree of possible paths through a method. As the summary only uses intraprocedural information, it is parameterized by the interprocedural sources through which values flow into m - formal parameters of m, return values of call sites in m, and field reads. Sources and constants are propagated path-sensitively through assignment statements to sinks - actual parameters at call sites, return statements in m, and field writes.

More specifically, the analysis computes a map from edges in the control flow graph (points before and after statements) to a set of contexts. Each context maps local variables to a constant, source, constrained source with a constant constraint, or ⊤. At each conditional cond, each context is inspected to determine the branches to which it should be propagated. When cond evaluates a boolean source s, s is forked, creating a

34

source with a true constraint s : t and a source with a false constraint s : f and new contexts for each. At control flow merges, incoming contexts are joined using set union, maintaining path-sensitivity for the joined contexts.

Some statements which are not conditionals also evaluate an expression to assign a boolean variable. At statements of the form !x or x instanceof Type, FFPA attempts to use the context to determine what value will be produced. If x is a source, x is forked, creating a context in which x has a false constraint and a context in which x has a true constraint.

Consider programs of the following form where x1 ... xn are uncorrelated:if (x1) ......if (xn) ...

The number of paths through uncorrelated conditional statements is exponential in the number of conditionals, creating an exponential number of contexts after each statement. To prevent this problem from occurring, FFPA joins contexts when the path-sensitivity provided by keeping them distinct no longer provides any benefit in precision. This occurs for variables that are dead and will never be read again in m. Contexts that differ only in dead variables always follow the same paths through m. FFPA employs a path-insensitive, flow-sensitive intraprocedural live variable analysis to compute a set of dead variables before every statement. Entries for dead variables are removed from all contexts, and identical contexts are joined. In the best case - a list of uncorrelated conditionals where the variables written by a conditional are dead before the next conditional - FFPA maintains at most two contexts at any statement. In practice, this usually results in few contexts except in loops where the live variable analysis is unable to prove many variables dead.

While dead variables will not influence which path a context will subsequently follow, dead variables mapped to constrained sources describe the path the context already followed. This information is needed in the next phase (see below) to determine the path to which the context should be matched. To maintain this information while preventing an exponential blowup in the number of contexts, each context includes a path constraint γ containing constrained sources in disjunctive normal form. Whenever a fork occurs, the constrained source is conjoined with each term in the path constraint. When identical contexts are joined at control flow merges, the new context's path constraint is the disjunction of tuple lattices, simplified where possible.

As in standard dataflow analysis, loops are analyzed iteratively until a fixed-point is reached and the results do not change. Forking loop guards poses a problem:

x = false;while (foo()) x = bar();

Forking foo() only in the loop’s first iteration would create two contexts - one with foo() : t and a second with foo() : f. However, the context with foo() : t would never escape the loop. Alternatively, reforking foo() at every loop iteration solves this problem, but results in contexts that traveled through the loop with path constraints including foo() : t. This causes it to not match the correct paths in the next phase. To solve this problem, loop guards are never forked.

8.5 FFPA: Summary construction phase

Conceptually, a summary is a tree of feasible paths through a method. Each path is a list of statements with local variables bound to either a constant, ⊤, a source, or a constrained source. Branches in the summary consist of either source test nodes or nondeterministic test nodes. Source test nodes evaluate a source to determine which path to follow. Nondeterministic test nodes occur when there is not enough information to determine which path will be taken.

Representing summaries literally as a tree would be inefficient as paths may contain identical portions. Thus, summaries are represented as a directed graph of summary nodes. Each summary node contains a list of executed statements. Each interior (non-leaf) node is either a test node or a shared node with two parents.

35

The tree starts with a single empty leaf node. Summary construction iterates through statements in a method m and each context computed for the statement in phase 1 and uses this information to update the summary.

In step 1, each context is matched to compatible summary leaf nodes. A context is compatible with a node if no pair of forks in either conflict. A pair of forks conflict when they are mutually exclusive (e.g, x : f and x : t). Note that either may contain additional forks and still be compatible. Each node may match zero or more contexts and is active when it matches at least one.

In step 2, we scan active leaf nodes for opportunities to create shared summary nodes. A shared summary node is created when a node and its sibling are active, match the same contexts, and the source variable which was forked to create the nodes is dead. When a shared node is created, step 1 is performed again to rematch.

In step 3, children for summary nodes may be created. If a context matches a node and contains a fork the node does not, the fork is observed. Observing a fork creates two new true and false summary child nodes. After observing a fork, we return to step 1 to rematch.

In step 4, we add contexts to the summary nodes they have matched. An executed statement containing a single context joining all of the contexts that matched is appended to all leaf nodes that matched at least one context.

8.6 FFPA: Static trace construction phase

The static trace construction phase interprets paths through summaries and executes statements to construct a static trace. Like in an interpreter, a stack of method frames with local stores is maintained. Executing a statement with effects performs the effects. Executing field writes update an interprocedural store mapping field names to values. Executing a method invocation creates a new stack frame, binds source arguments to values using the local store, and jumps to the new stack frame. Returning from a method updates the local store to map the method source to its return value.

To find statements to execute, paths through a method summary are traversed. When a test node is encountered, the local store is examined. If the test can be resolved, the appropriate path is followed. Otherwise, both paths are followed. When multiple paths are followed and a statement occurs in more than one path, the statement is executed once using a single context joining contexts along all paths.

In order to halt when analyzing recursive calls, stack frames record the field store and actual parameters when they were created. Whenever a method is invoked in the same context in which it is currently in the method stack, the call is not executed. Instead, a reference to the frame is appended and the return value is mapped to ⊥ (“bottom”) in the local store. ⊥ is a special value denoting no information. While joining ⊤ with any value is ⊤, joining ⊥ with a value is the value.

8.7 Proposed work

I have constructed an initial prototype of FFPA implementing the features described above. This prototype demonstrates that the approach can be fast. I ran FFPA on an example from one of my lab studies. On an Intel Core 2 Duo 2.0 GHz computer with 2 GB of memory, phase 3 of the analysis completed in 13 seconds. Producing 36,287 method contexts, it reduced the total number of paths to a method from ~1.28 * 1017 to 25, of which 7 are actually feasible.

I propose to evaluate the initial prototype on a number of additional examples. The results produced can be compared against a manual investigation to attempt to determine idioms causing imprecision. And profiling can be used to understand its performance. This evaluation can then be used to determine characteristics of code idioms which cause poor performance, poor precision, or create unsoundness, and I propose to extend the analysis to address some of the limitations this reveals.

Several limitations of the prototype analysis are already known, and I propose to address them:

1. Instance-sensitive points-to-analysis

The prototype FFPA falsely assumes that objects are never aliased (have more than one reference) and all classes have only a single instance. This assumption can be removed by using an instance-sensitive points-to-analysis. An instance-sensitive points-to-analysis computes for every reference variable in the program in every context in which it might be accessed a set of objects to which it might point. Objects are

36

described as a class name, allocation site (e.g., new C()), and context in which the allocation site executed. Many algorithms for instance-sensitive points to analysis exist and several have open source Java implementations. The best algorithms can analyze codebases with over a million lines of code in minutes to tens of minutes. For example, DOOP can perform an instance-sensitive points-to-analysis of Eclipse (5000 methods) in 25 minutes of analysis time [AZ]. Unfortunately, the analysis needs to be executed whenever a developer edits a method. Fortunately, the analysis is of a form shown to be incremental where analysis results from a previous version of a program can be updated for a newer version. The developers hope to implement this analysis in hopes of making it suitable for use in an IDE tool [AZ]. Thus, hopefully an incremental analysis will only require seconds of analysis following a code edit. Otherwise, a less precise analysis could be used.

FFPA uses the points-to-analysis results in the trace construction phase. Instead of reading or writing values from reference variables, the points-to-analysis results are used to read or write data from zero or more objects. Unfortunately, whenever a field is written through a reference variable that may point to multiple objects, the value being written into each field must now be joined with the current value. This is because the write may or may not occur to this object and a conservative analysis must account for both cases. This is known as a weak update. However, being able to perform strong updates is a well known problem in the points-to-analysis literature, and I propose to investigate approaches for being able to more frequently perform strong updates.

2. Encoding conditionals in the static trace

The prototype FFPA implementation does not currently distinguish statements that may execute from statements that must execute. Adding this feature will require including conditionals that can never be resolved in the static trace and tracking whether a path to a return statement has been passed.

3.filter

The simplest implementation of filter would allow developers to filter any expression e REACHER could resolve given alternative interprocedural information but was unable to resolve. This occurs when the intraprocedural portion of propagation paths for variables flowing into e meet the conditions in sections 8.3 but FFPA never found a constant creation statement, the propagation path outside the method does not meet the 8.3 conditions, or there are multiple paths assigning e to different values for which Reacher does not have sufficient interprocudural information to resolve. To implement in filter in these cases, it is sufficient to create an annotation mapping the expression to a value and reexecute phase 3 of FFPA using this information.

A more ambitious implementation of filter would support filtering over any boolean expression e. This would sometimes require recomputing the method summary of the method containing e. I propose to investigate how often this form of filter would be beneficial and then consider implementing it.

As described, filter influences executed statements downstream from e. However, developers might also wish to use filter to select static traces reaching e for which the filter holds. I propose to first investigate whether this is a useful question to support and then considering extending the analysis to support it.

4. valueFlow

The prototype FFPA analysis does not currently produce static traces containing value flow information. To do so, the context used in all 3 phases of the analysis must record, whenever a variable is assigned, the statement assigning it. Note, this can be determined for all variables whether or not they ever contain constants. For each variable accessed in an executed statement in the static trace, this information can then be used to build data flow edges between executed statements. Data flow edges correspond to the edges valueFlow follows.

5. Default field values

FFPA currently assumes that fields not assigned so far in a static trace are ⊤. While this assumption is sound, it is also imprecise. However, fast flow-insensitive, path-insensitive interprocedural constant propagation analyses exist. Such an analysis could be used to populate fields with more precise defaults. If the analysis is bytecode based, it could be run even on framework code and be used to imprecisely track value flow through framework calls.

5. Parameterized static traces

37

When FFPA encounters a method that has already been analyzed, it is currently only able to reuse the static trace previously produced when the method is invoked in exactly the same context as before. But most differences in context will not influence which paths are taken through conditionals. Thus, static traces could be more efficiently reused by requiring only that variables which later influence the path taken must have the same value. Of course, additional variables may be used to write fields or accessed in statements. A parameterized trace would thus contain a list of preconditions describing variables that must have specific values and parameters read by the static trace but which may take any value.

6. The soundness of static traces

FFPA is sound if it always produces static traces that approximate all feasible concrete traces. That is, the contexts for all executed statements in a static trace should conservatively approximate all concrete traces which have followed the same path. To date, I have been unable to produce a counter-example demonstrating that the approach is unsound. However, both the treatment of loops and the construction of summaries differ from standard dataflow analyses and thus cannot directly reuse existing dataflow analysis soundness proofs. I propose to develop at least an informal prose argument for why FFPA is sound that considers these features and investigate the feasibility of a more formal proof.

8.8 Evaluation

To evaluate the performance and precision of FFPA, I will gather a corpus of Java programs, manually identify several methods responding to user input, and run FFPA on each. Performance can be measured simply by recording clock time. To measure precision, results produced by FFPA can be compared against a naive analysis in which no infeasible paths are eliminated or other FFPA features are selectively disabled.

9. RESEARCH PLAN

The completed work includes all of the exploratory studies and a prototype implementation of the analysis. The proposed work includes designing and implementing the visualizations and interactions, designing and implementing analysis extensions, and evaluations of both the tool and analysis. In order to best prioritize the proposed work and reveal the most important analysis limitations to address, the prototype FFPA will first be run on additional examples. This will provide data on the most important analysis limitations to address that have the greatest impact on performance and precision. Concurrently, a paper prototype study of the visualization and interaction features will be conducted to ensure that they can indeed help developers more effectively answer reachability questions and reveal limitations. Next, an improved analysis will be designed and implemented to address the currently known limitations and additional limitations that are discovered. Concurrently, features of the visualization and interaction will be iterated based on the results of the paper prototype study. Finally, a more thorough evaluation of FFPA using more examples will be conducted, and the visualizations and interactions will be evaluated in a lab study.

This research involves several risks. (1) The interactions and visualizations do not help developers answer reachability questions more effectively. (2) The analysis is too imprecise to be useful. (3) The analysis is too slow to be usable. (4) The cut point or other assumptions too greatly restrict the reachability questions developers can ask. My approach for addressing these risks is to fail fast. The evaluation of the FFPA prototype and paper prototype studies should help reveal whether these risks are actually a problem. If so, the interactions and visualizations and FFPA will be redesigned specifically to address these problems.

38

Novermber2009

January 2010

April2010

July October January2011

April2011

Protoype

FFPA

evaluation

Improved analysis

Lab

study

Write

dissertation

Iterate and implement visualization and interactions

FFPA

eval

Paper

Prot

10. REFERENCES

[A] T. D. LaToza, G. Venolia, and R. DeLine. (2006). Maintaining Mental Models: A Study of Developer Work Habits. In Proc. Int’l Conf. Software Eng (ICSE), 492-501.

[B] J. Sillito, G. C. Murphy, and K. De Volder (2008). Asking and answering questions during a programming change task. In Transactions on Software Engineering (TSE), 34(4).

[C] Ko, A.J. and Myers, B.A. (2008) Debugging Reinvented: Asking and Answering Why and Why Not Questions about Program Behavior. In Proc. Int’l Conf. Software Eng (ICSE), 301-310.

[D] Ko, A. J., Aung, H., and Myers, B. A. (2005). Eliciting Design Requirements for Maintenance-Oriented IDEs: A Detailed Study of Corrective and Perfective Maintenance Tasks. In Proc. Int’l Conf. Software Eng (ICSE), 126-135.

[E] N. Pennington. (1987). Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs. In Cognitive Psychology, Vol. 19, 295-341.

[F] Ko, A., Myers, B., Coblenz, M., and Aung, H. (2006). An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. In IEEE Trans. Soft. Eng. (TSE), 32 (12), 971 - 987.

[G] J. Lawrance, R. Bellamy, M. Burnett, and K. Rector. (2008). Using Information Scent to Model the Dynamic Foraging Behavior of Programmers in Maintenance Tasks. In Proc. Conference on Human Factors in Computing Systems (CHI), 1323-1332.

[H] F. Détienne. Software Design---Cognitive Aspects. Springer-Verlag New York, Inc, 2002.

[I] P. Anderson and T. Teitelbaum. (2001). Software Inspection using CodeSurfer. Workshop on Inspection in Software Engineering at CAV.

[J] A. J. Ko., R. DeLine, and G. Venolia. (2007). Information Needs in Collocated Software Development Teams. In Proc. Int’l Conf. Software Eng (ICSE).

[K] W. J. Dzidek, E. Arisholm, and L.C. Briand. (2008). A Realistic Empirical Evaluation of the Costs and Benefits of UML in Software Maintenance. In Transactions on Software Engineering (TSE), 34 (3).

[L] M. Cherubini, G. Venolia, and R. DeLine. (2007). Building an Ecologically valid, Large-scale Diagram to Help Developers Stay Oriented in Their Code. In Symposium on Visual Languages and Human-Centric Computing (VLHCC).

[M] R. J. Buhr and R. S. Casselman. (1996). Use CASE Maps for Object-Oriented Systems. Prentice Hall.

[N] J. Rumbaugh, I. Jacobson, and G. Booch. (1998). The Unified Modeling Language Reference Manual. Addison-Wesley.

[O] M. Weiser. (1984). Program Slicing. In Transactions on Software Engineering (TSE), 10 (4).

39

[P] D. Janzen and K. De Volder. (2003). Navigating and querying code without getting lost. In Proc. Aspect-Oriented Software Development (AOSD).

[Q] H. Agrawal and J.R. Horgan. (1990). Dynamic program slicing. In Proc. Programming Language Design and Implementation (PLDI).

[R] D. F. Jerding, J. T. Stasko, and T. Ball. (1997). Visualizing interactions in program executions. In Proc. Int’l Conf. Software Eng (ICSE).

[T] M.A. Storey and H.A. Müller. (1995). Manipulating and Documenting Software Structures using Shrimp Views. In International Conference on Software Maintenance.

[V] T. Ball and S. K. Rajamani. (2001). Automatically validating temporal safety properties of interfaces. In SPIN '01: Proceedings of the 8th international SPIN workshop on model checking of software.

[W] Bennett, C., Myers, D., Storey, M., German, D. M., Ouellet, D., Salois, M., and Charland, P. (2008). A survey and evaluation of tool features for understanding reverse-engineered sequence diagrams. J. Softw. Maint. Evol., 20 (4), 291-315.

[X] D. Grove and C. Chambers. (2001). A framework for call graph construction algorithms. ACM Transactions on Programming Languages, 23 (6), 685-746.

[Y] M. Das, S. Lerner, and M. Seigle. (2002). ESP: path-sensitive program verification in polynomial time. Programming Language Design and Implementation, 57-68.

[Z] W. R. Bush, J. D. Pincus, D. J. Sielaff. (2000). A static analyzer for finding dynamic programming errors. Software – Practice and Experience, 30 (7).

[AA] M. Sridharan, S. J. Fink, and R. Bodik. (2007). Thin slicing. In Programming Language Design and Implementation.

[AB] C. A. R. Hoare. (1969). An axiomatic basis for computer programming. In Communications of the ACM, 12(10), 576–583.

[AC] C. Flanagan, K.R.M. Leino, M. Lillibridge, G. Nelson, J.B. Saxe, R. Stata. (2002). Extended static checking for Java. In Programming Language Design and Implementation.

[AD] K.Y. Phang, J.S. Foster, M. Hicks, and V. Sazawal. (2008). Path projection for user-centered static analysis tools. In PASTE.

[AE] T. Ball, E. Bounimova, B. Cook, V. Levin, J. Lichtenberg, C. McGarvey, B. Ondrusek, S. K. Rajamani, and A. Ustuner. (2006). Thorough static analysis of device drivers. In EuroSys.

[AF] E. Gamma and K. Beck. (2001). JUnit: a regression testing framework. http://www.junit.org.

[AG] P. Godefroid. (2007). Compositional dynamic test generation. In Symposium on principles of programming languages.

[AH] G. Kiczales, E. Hillsdale, J. Hugunin, M. Kersten, J. Palm, and W. Griswold. (2001). An overview of AspectJ. In ECOOP.

[AI] G. Linden, B. Smith, and J. York. (2003) Amazon.com recommendations: item-to-item collaborative filtering. In IEEE Internet Computing.

[AK] V. Sinha, D. Karger, R. Miller. (2006). Relo: helping users manage context during interactive exploratory visualization of large codebases. In Visual Languages and Human-Centric Computing.

[AL] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. (1995). Design patterns: elements of reusable object-oriented software. Addison-Wesley.

[AM] T. Fritz, G. C. Murphy, and E. Hill. (2007). Does a programmer’s activity indiciate knowledge of code? In ESEC/FSE.

[AN] S. Parent. (2006). A possible future for software development. Keynote talk at the Workshop of Library-Centric Software Design, OOPSLA.

[AO] J. Edwards. (2009). Coherent reaction. To appear in Onward 2009.

[AP] T. Ball and S. K. Rajamani. (2000). Bebop: a symbolic model checker for boolean programs. In SPIN.

40

[AQ] D. Jackson and D. A. Ladd. (1994). Semantic diff: a tool for summarizing the effects of modifications. In ICSM.

[AR] S. Person, M. B. Dwyer, S. Elbaum, C. S. Pasareanu. (2008). Differential symbolic execution. In FSE.

[AS] E. Yourdon and L. L. Constantine. (1979). Structured design: fundamentals of a discipline of computer program and systems design. Prentice Hall.

[AT] E. Hill, L. Pollock, A K. Vijay-Shanker. (2007). Exploring the neighborhood with Dora to expedite software maintenance. In ASE.

[AU] Sutherland, D.F., Greenhouse, A. and Scherlis, W.L. (2002). The code of many colors: relating threads to code code and shared state. In Proc. of Workshop on Program Analysis for Software Tools and Engineering (PASTE).

[AV] M. Kersten and G. C. Murphy. (2005). Mylar: a degree-of-interest model for IDEs. In Proceedings of Aspect-Oriented Software Development.

[AW] M. J. Coblenz, A. J. Ko, and B. A. Myers. (2006). JASPER: an eclipse plug-in to facilitate software maintenance tasks. In Proceedings of the 2006 OOPSLA workshop on eclipse technology eXchange.

[AX] J. C. King. (1976). Symbolic execution and program testing. In Communications of the ACM, 19 (7), 385-394.

[AY] M. Robillard. (2008). Topology analysis of software dependencies. In ACM Transactions on Software Engineering and Methodology, 17(4).

[AZ] M. Bravenboer and Y. Smaragdakis. (2009). Strictly declarative specification of sophisticated points-to-analyses. In OOPSLA.

[BA] T. LaToza, D. Garlan, J.D. Herbsleb, and B.A. Myers. (2007). Program comprehension as fact finding. In ESEC/FSE.

Documents

ANSWERING REACHABILITY QUESTIONStlatoza/papers/proposal.pdf · My studies indicate that reachability questions are pervasive throughout coding tasks. In one study, half of the bugs