Click here to load reader

ANSWERING REACHABILITY tlatoza/papers/proposal.pdf · PDF file My studies indicate that reachability questions are pervasive throughout coding tasks. In one study, half of the bugs

  • View
    7

  • Download
    0

Embed Size (px)

Text of ANSWERING REACHABILITY tlatoza/papers/proposal.pdf · PDF file My studies indicate that...

  • 1

    ANSWERING REACHABILITY QUESTIONS

    Thesis Proposal

    Thomas D. LaToza

    12/8/2009

    Institute for Software Research School of Computer Science Carnegie Mellon University

    Pittsburgh, PA 15213 [email protected]

    COMMITTEE Brad A. Myers, Human Computer Interaction Institute, Carnegie Mellon (Co-chair)

    Jonathan Aldrich, Institute for Software Research, Carnegie Mellon (Co-chair) Aniket Kittur, Human Computer Interaction Institute, Carnegie Mellon

    Thomas Ball, Microsoft Research

    ABSTRACT

    What are the most frequent, time-consuming, hard-to-answer, and error-prone questions professional software developers ask about programs? Reachability questions. A reachability question is a search upstream or downstream across paths from a statement for target statements. For example, a developer debugging a deadlock searched downstream for calls acquiring resources.

    My studies indicate that reachability questions are pervasive throughout coding tasks. In one study, half of the bugs developers inserted were associated with reachability questions developers asked or should have asked. Developers report asking these questions more than 9 times a day, and 82% agree at least one is hard to answer. Neither increased professional experience nor even increased familiarity with a codebase make reachability-related questions easier or less frequent. In another study, 9 of the 10 longest investigation and debugging activities involved answering a single reachability question.

    Using existing tools, developers traverse paths across method calls in search of target statements. Reachability questions are hard to answer because developers must guess both which paths lead to targets and which paths are feasible and may execute. To help developers more effectively answer reachability questions, I propose a new kind of reverse engineering technique in which developers search across paths for target statements. Starting at a statement in a program, developers enter search strings that are matched against identifiers or comments along paths. Specific situations can be considered by posing “What if?” questions such as “What happens when this data table is uninitialized?”

    A static analysis for answering reachability questions determines the feasible paths through conditionals. Existing approaches either do not eliminate infeasible paths or are too slow to be used in an interactive tool. However, examples of reachability questions suggest that many common infeasible paths are caused by conditionals evaluating variables that may only contain constants (e.g., dynamic dispatch, flags). I propose to design a fast feasible path analysis which eliminates infeasible paths caused by these constant-controlled conditionals. A preliminary implementation is able to eliminate many common infeasible paths through a 50 KLOC Java program in just 13 seconds of analysis time.

    mailto:[email protected] mailto:[email protected]

  • 2

    1. INTRODUCTION A central goal of software engineering is to help developers be more productive and create higher quality software by accomplishing tasks faster and introducing fewer defects. Throughout these tasks, developers must understand task-relevant code. Modern codebases range in size from hundreds of thousands to more than millions of lines of code. When interacting with code written by other teams or by other companies, code is often connected by complex interaction mechanisms and indirection using events and call backs. While these constructs help make software more extensible and reusuable, they also make it more challenging to understand. An analysis of code in Adobe’s desktop applications found that one third of the code is devoted to event handling logic and which caused half of the reported bugs [AN]. Successfully coordinating dependencies between effects in loosely connected modules can be very challenging [AO]. Developers often address this challenge by working exclusively on portions of the codebase that they “own” [A]. However, this boundary is imperfect, and developers often debug paths through others’ code, reuse functionality written by others, or are “load balanced” to work on other portions of the codebase. And when developers switch teams, they must learn a codebase anew.

    To discover the nature and context of what makes work in large, complex codebases challenging, I conducted a series of studies examining the social context, activities, process, expertise effects, questions, and strategies of developers at work in coding tasks. Surprisingly, I discovered that much of developers’ work involves exploring code to answer reachability questions. Developers start at a statement stmt in a program and search upstream across paths reaching stmt or downstream across paths originating at stmt. Developers ask reachability questions when debugging to locate the statements which cause a fault to occur. When proposing changes, developers often first investigate code to understand the implications of their change and ask questions about the relationship of their change to upstream or downstream behavior.

    Consider an example. An experienced developer participating in my lab study proposed a change but could not determine if it would work:

    What I'd like to do is identify those core, hopefully EditBus events, and say just repaint the caret on that event. And the easiest thing to do is hook up the StatusBar to that event, get that event, and get the relevant events, and if so, update the caret. … [But] I'm concerned that I won't get all of the events that cause this guy to get updated. And I'm not sure, with the existing tools in Eclipse, how to find out all the places that can cause this thing to be called.

    While the developer was aware that a provided call graph navigation tool could traverse chains of method calls, this did not directly help. Upstream from the update method was a bus onto which dozens of methods posted events, but only a few of these events triggered the update. Existing call graph tools are unable to identify only those upstream methods sending the events triggering the update. Unable to answer the question in any practical way, he instead optimistically hoped his guess would work, spent time determining how to reuse functionality to implement the change, edited the code, and tested his changes before learning the change would never work and all his past 23 minutes of work had been wasted.

    Fantastically named EditBus, and it actually doesn't have any events related to edit. [laughing] It just has events related to buffer changes, which is not an edit. OHHH, I just wonder where edits might be going.

    My studies indicate that reachability questions are pervasive throughout debugging and investigation activities of large complex, codebases. Yet existing tools often make it challenging for developers to answer these questions. Modern development environments include code exploration tools such as call graph navigation tools and reference searches that make it easy for developers to traverse many types of relationships between elements in a program. However, developers using these tools to answer reachability questions must explore the search space where target statements might be located by repeatedly traversing relationships. Traversing is challenging when the size of the search space is large, developers cannot predict which relationships to follow to find targets, or paths are infeasible and can never execute. In my studies, developers often spent tens of minutes answering a single reachability question. In observations of developers in the field, 9 of the 10 longest debugging and investigation activities each involved answering a single reachability question.

    To help developers more effectively answer reachability questions, I propose a fundamentally new reverse engineering technique in which developers directly express their reachability questions and inspect matching target statements. For example, to answer the question, “What are the implications of deferring the initialization of this data table?”, a developer searches for statements downstream from an origin which differ when the table is or is not initialized. From examples of challenging reachability questions from my studies and other studies of developer questions, I designed a formalism for describing reachability

  • 3

    questions (section 2). I propose to design interaction techniques allowing developers to directly express these reachability questions (section 6). Asking a reachability question generates a list of statements. In order help developers make sense of these results, refine their questions, and ask follow-up questions, I propose to design interactive visualizations of feasible paths through a program (section 6).

    The key technical challenge for reverse engineering answers to reachability questions from code is determining which paths are feasible. In general, infeasible paths exist in any language with conditional statements and control flow. Existing techniques such as model checkers can determine path feasibility in many cases but require hours or days to do so. My approach relies on determining feasible paths in response to a reachability question a developer has asked. As computing answers to all questions is impractical, my interactions require an analysis approach that is able to determine feasible paths on demand in a short time.

    Examples of common infeasible path idioms suggest that many of the most common sources of infeasible paths can be eliminated by solving a simpler problem. Infeasible paths occur when the direction taken in a conditional statement evalua