10
Trace Normalization Madeline Diep, Sebastian Elbaum, Matthew Dwyer Department of Computer Science and Engineering University of Nebraska - Lincoln; Lincoln, Nebraska, USA {mhardojo, elbaum, dwyer}@cse.unl.edu Abstract Identifying truly distinct traces is crucial for the performance of many dynamic analysis activities. For example, given a set of traces associated with a pro- gram failure, identifying a subset of unique traces can reduce the debugging effort by producing a smaller set of candidate fault locations. The process of identify- ing unique traces, however, is subject to the presence of irrelevant variations in the sequence of trace events, which can make a trace appear unique when it is not. In this paper we present an approach to reduce in- consequential and potentially detrimental trace varia- tions. The approach decomposes traces into segments on which irrelevant variations caused by event ordering or repetition can be identified, and then used to normal- ize the traces in the pool. The approach is investigated on two well-known client dynamic analyses by replicat- ing the conditions under which they were originally as- sessed, revealing that the clients can deliver more pre- cise results with the normalized traces. 1. Introduction The strength of many dynamic analysis techniques depends heavily on the variability of the pool of traces they operate on. A richer trace pool can result, for example, in improved sets of inferred invariants [10], more precise fault isolation [3], and smaller change im- pact sets [15], which in turn raises the techniques’ ef- fectiveness in validating software correctness. On the contrary, trace variability may be detrimental when it introduces noise, i.e., variation in traces that is not re- lated to the property under analysis. Researchers have addressed this problem by devis- ing mechanisms to cluster traces then to select or use a subset of traces from each cluster. The basic underlying assumption of these clustering approaches is that traces that are further apart are more likely to contain distinct information valuable to a client analysis. For example, Podgurski et al. compute Euclidean distances between coverage vectors to form clusters of traces to improve test case filtering and ordering [5, 16], Bowring et al. use branch traces projected into Markov models which are then clustered based on their Hamming distance to more efficiently predict software behavior [2], Haran et al. build a classification tree utilizing various profile vectors to predict the outcome of a software execution [12], and Liu and Han compute the proximity between the lists of predicate rankings of failing and passing exe- cutions from a statistical debugging tool to cluster traces that may be caused by the same fault [17]. The basic assumption for clustering approaches, however, may not always hold. Traces may end up in different clusters due to irrelevant variations in the se- quences of events that comprise them and make them appear different, and therefore valuable, even if the ob- served variation is unrelated to the program property under analysis. This can lead to the retention of a trace that provides no added value relative to a given trace pool. Furthermore, as we will show, the addition of such traces can diminish the precision of client analyses. There are at least two sources of irrelevant varia- tions. The first source are events whose occurrence can be re-ordered without affecting the program state. This is common for events affecting distinct portions of the state. For example, in a web browser, the events for changing a password and setting one’s homepage affect different variables or fields. Event independence means that their relative order of execution should not be con- sidered as a valuable trace discriminator. The second source of irrelevant variations are re- dundant events whose occurrence does not lead to a distinct program state. A trivial instance of redundant events are ones that do not affect the state at all. For ex- ample, if the events considered in the trace are method invocations, inspector methods should not be utilized as 19th International Symposium on Software Reliability Engineering 1071-9458/08 $25.00 © 2008 IEEE DOI 10.1109/ISSRE.2008.37 67

[IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

  • Upload
    matthew

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

Trace Normalization

Madeline Diep, Sebastian Elbaum, Matthew DwyerDepartment of Computer Science and Engineering

University of Nebraska - Lincoln; Lincoln, Nebraska, USA{mhardojo, elbaum, dwyer}@cse.unl.edu

Abstract

Identifying truly distinct traces is crucial for theperformance of many dynamic analysis activities. Forexample, given a set of traces associated with a pro-gram failure, identifying a subset of unique traces canreduce the debugging effort by producing a smaller setof candidate fault locations. The process of identify-ing unique traces, however, is subject to the presenceof irrelevant variations in the sequence of trace events,which can make a trace appear unique when it is not.In this paper we present an approach to reduce in-consequential and potentially detrimental trace varia-tions. The approach decomposes traces into segmentson which irrelevant variations caused by event orderingor repetition can be identified, and then used to normal-ize the traces in the pool. The approach is investigatedon two well-known client dynamic analyses by replicat-ing the conditions under which they were originally as-sessed, revealing that the clients can deliver more pre-cise results with the normalized traces.

1. Introduction

The strength of many dynamic analysis techniquesdepends heavily on the variability of the pool of tracesthey operate on. A richer trace pool can result, forexample, in improved sets of inferred invariants [10],more precise fault isolation [3], and smaller change im-pact sets [15], which in turn raises the techniques’ ef-fectiveness in validating software correctness. On thecontrary, trace variability may be detrimental when itintroduces noise, i.e., variation in traces that is not re-lated to the property under analysis.

Researchers have addressed this problem by devis-ing mechanisms to cluster traces then to select or use asubset of traces from each cluster. The basic underlyingassumption of these clustering approaches is that traces

that are further apart are more likely to contain distinctinformation valuable to a client analysis. For example,Podgurski et al. compute Euclidean distances betweencoverage vectors to form clusters of traces to improvetest case filtering and ordering [5, 16], Bowring et al.use branch traces projected into Markov models whichare then clustered based on their Hamming distance tomore efficiently predict software behavior [2], Haran etal. build a classification tree utilizing various profilevectors to predict the outcome of a software execution[12], and Liu and Han compute the proximity betweenthe lists of predicate rankings of failing and passing exe-cutions from a statistical debugging tool to cluster tracesthat may be caused by the same fault [17].

The basic assumption for clustering approaches,however, may not always hold. Traces may end up indifferent clusters due to irrelevant variations in the se-quences of events that comprise them and make themappear different, and therefore valuable, even if the ob-served variation is unrelated to the program propertyunder analysis. This can lead to the retention of a tracethat provides no added value relative to a given tracepool. Furthermore, as we will show, the addition of suchtraces can diminish the precision of client analyses.

There are at least two sources of irrelevant varia-tions. The first source are events whose occurrence canbe re-ordered without affecting the program state. Thisis common for events affecting distinct portions of thestate. For example, in a web browser, the events forchanging a password and setting one’s homepage affectdifferent variables or fields. Event independence meansthat their relative order of execution should not be con-sidered as a valuable trace discriminator.

The second source of irrelevant variations are re-dundant events whose occurrence does not lead to adistinct program state. A trivial instance of redundantevents are ones that do not affect the state at all. For ex-ample, if the events considered in the trace are methodinvocations, inspector methods should not be utilized as

19th International Symposium on Software Reliability Engineering

1071-9458/08 $25.00 © 2008 IEEE

DOI 10.1109/ISSRE.2008.37

67

Page 2: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

trace discriminators. A more complex instance of thissource of noise are events that repeatedly set the sameregion of the program state. For example, a methodthat is invoked repeatedly to set a variable value with-out other intermediate events reading that variable canbe collapsed to the last method invocation.

Our work aims to reduce the irrelevant variationsin traces through a transformation that attempts to pre-serve the distinct structure of traces while eliminatingdifferences due to commuted or collapsed sub-traces.We note that while Mazurkiewicz’s theory of tracesdevelops a formal treatment of notions of equivalenceamong program executions that exploits the indepen-dence, and thus the commutativity, of program opera-tions [19], our approach is based on heuristics that maysacrifice precision to gain performance, and its cost-effectiveness tradeoffs must be assessed for each clientanalysis.

Furthermore, our work is orthogonal to existing ef-forts on trace discrimination [2, 5, 9, 11, 12, 16] in thatwe are not trying to cluster a group of traces, but ratherto identify portions of traces where reordering of com-mutative events into a canonical order and collapsingsequences of repeated events to a minimal length wouldreduce irrelevant variations and enable a more preciseanalysis. Our approach could be applied prior to a clus-tering approach to potentially reduce its cost and en-hance its power.

In preliminary work, we identified the opportunityto reduce irrelevant variations in traces and describedsome of the necessary mechanisms to support it [6].This work builds on that, providing the following ad-ditional contributions: (i) a more precise and extensivedefinition and analysis of the approach, (ii) an imple-mented infrastructure for exploring various instantia-tions of the approach; and (iii) an assessment of theapproach potential to improve the precision of two dy-namic client analyses.

2. Approach Overview

We start by illustrating how a fault isolation clientanalysis works, and how it would benefit from our tracenormalization approach.

Many fault isolation analyses compare traces ofpassing and failing program runs, and use their differ-ences to pinpoint a likely fault’s location. The fault iso-lation analysis we study here is one instance of that typeof analysis; it was introduced by Dallmeier et al. [4]to perform lightweight bug localization by assigning ascore to each class based on its probability of contain-ing the fault. In our example, a class score is calculatedby counting the number of method call sequences, initi-

public class XMLElement { ...private Vector children;private String name;private String content;

}public class Builder { ...

private XMLElement root;}public class Parser { ...

protected void processEle() { // Processes a regular element....// if statement is faulty – the operator || should be &&if ((ch == ′ <′) || (! f romEntity[0]))

...this.builder.endElement(...);

else(...)...this.builder.addPCData(true);

...} }public class Valid {

protected Hashtable attValues;protected Hashtable entities;...public void parseDT D() { ... // Parses the DTD.

for(...) { ...if (...)

this.processEnt(true);else if (...)

this.processAtt(true);...

} }

Figure 1. A snippet of NanoXML Programated in that class, that differentiate passing runs from afailing run.

Consider the two classes StdXMLParser andNonValidator (later referred as Parser and Valid forbrevity) shown in Figure 1 of NanoXML, a softwareartifact we study later in the paper. Note that theParser class has a seeded fault in a predicate withinprocessEle method. Now imagine that two execu-tions of NanoXML lead to the generation of the twotraces in Figure 2a comprised of method invocationsand returns, r, where t f corresponds to a failing runand tp to a passing run. Assuming we only con-sider call sequences using a sliding window of size2, tp contains the call sequences Valid.parseDT D-Valid.processEnt, Valid.processEnt-Valid.processAttand Parser.processEle-Builder.addPCDATA. Mean-while, t f contains the call sequences Valid.parseDT D-Valid.processAtt, Valid.processAtt-Valid.processEntand Parser.processEle-Builder.endElement.

Given this setting, Dallmeier et al.’sfault isolation client analysis will assign ascore of four to the Valid class because se-quences Valid.parseDT D-Valid.processEnt andValid.processEnt-Valid.processAtt appear in tp butnot in t f , and Valid.parseDT D-Valid.processAtt andValid.processAtt-Valid.processEnt appear in t f but

68

Page 3: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

…, Valid.parseDTD, Valid.processEnt, r, Valid.processAtt, r, …, Parser.processEle, Builder.addPCDATA, r, r, …

…, Valid.parseDTD, Valid.processAtt, r, Valid.processEnt, r, …, Parser.processEle, Builder.endElement, r, r, …

tp

tf

tp

tf

Valid.processEnt - r - Valid.processAtt - r

Valid.processAtt, r, Valid.processEnt, r

…, Util.read, Util.scanEntity, return, return, …, Parser.processElement, Builder.addPCDATA, return, return, …

…, Util.read, Util.scanEntity, return, return, …, Parser.processElement, Builder.endElement, return, return,

tp

tf

Call sequence sets

Valid.parseDTD – Valid.processEnt, Valid.processEnt – Valid.processAtt

Parser.processEle – Builder.addPCDATA

a. Traces

b. Program States

c. Segments

d. Commutative Transformation

e. Normalized Traces

Valid.processEnt , r, Valid.processAtt, r

…, Valid.parseDTD, Valid.processAtt, r, Valid.processEnt, r, …, Parser.processEle, Builder.addPCDATA, r, r, …

…, Valid.parseDTD, Valid.processAtt, r, Valid.processEnt, r, …, Parser.processEle, Builder.endElement, r, r, …

Valid.parseDTD – Valid.processAtt, Valid.processAtt – Valid.processEnt

Parser.processEle – Builder.addPCDATA

Valid.parseDTD – Valid.processAtt, Valid.processAtt – Valid.processEnt

Parser.processEle – Builder.endElement

attValues entities

root

Page, 0 null null

Page, 0 Writer,w null

Page, 0 Writer,w db:null → null

Page, 0 Writer,w db:null → null

Page, 0 Writer,w db:null → null

attValuesentities

root

null null null

Valid.processAtt - r - Valid.processEnt - r Page, 0 Writer,w null

null null null

Valid.parseDTD – Valid.processAtt, Valid.processAtt – Valid.processEnt

Parser.processEle – Builder.endElement

null null null

Page, 0 Writer,w null

null Writer,w null

Page, 0 Writer,w db:null → null

Page, 0 Writer,w db:null → name:null

Page, 0 Writer,w db:null → name:foo

Figure 2. Approach steps applied to the example

not in tp. The same process leads to a score of twofor the Parser class, which means that Valid class ismore likely to contain the fault than Parser class, eventhough the fault is in Parser.

Although not immediately obvious, the abovetraces contain irrelevant variations to the fault isolationanalysis that cause this client to return an imprecisescore value, skew the class rankings, and potentiallyprovide a larger set of fault locations for a developerto explore. We will now illustrate how our approachcan identify those irrelevant variations, and normalizethe traces to reduce them, improving the fault isolationclient effectiveness.

Our approach requires the capture of one addi-tional piece of information: program state. We discussthe different forms of program state that may be cap-tured shortly, but for this example, assume that the rel-evant program state we capture for this program andanalysis consists of the root variable of the Builderclass representing the structure of the parsed XML (thiskey data structure is implemented by the XMLElementclass in Figure 1); and two hashtables containing the

XML DTD Attributes (attValues) and Entities (entities)information. This capture is performed at the end ofeach method return as illustrated in Figure 2b. On ourexample, at the last event occurrence in the trace tp, theprogram state captured shows one entry in the entitieshashtable, one entry in the attValues hashtable, and aroot XML element named db which has one child ele-ment. The child element has the value name as its nameand the value f oo as its content with no other children.

Given traces and the captured program states, ourapproach consists of three steps. First, we identify seg-ment sets within the traces, where a segment set con-tains event sequences within traces that are bounded bythe same program states, Si and Se. The event sequencesbounded by Si and Se are called segments. We observein Figure 2c, that at least one segment set can be iden-tified from the two traces which contains the segments:

〈Valid.processEnt,r,Valid.processAtt,r〉〈Valid.processAtt,r,Valid.processEnt,r〉

where the Si and Se are marked with bolded squares.Second, we identify irrelevant variations in each

69

Page 4: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

segment set. The identification is based on the observa-tion that segments in a segment set, start in the same Si,and in spite of including different sequences of events,end up in the same Se. Hence, parts of those event se-quences may constitute irrelevant variations. We tar-get variations due to commutative events or event se-quences, and collapsible event occurrences. Informally,we allow two events (or event sequences) to commutewithin a segment set when those events have appearedin different orders in at least two segments from that set.A sequence of events in a segment is collapsible if it canbe reduced in length to match another segment. For ourexample, we note in Figure 2d that the two segmentscontain an irrelevant variation caused by commuting theValid.processEnt-r and Valid.processAtt-r events.

The last step is to normalize the segments in eachsegment set. Normalization involves transforming eachsegment to order commutative events in a canonical or-der and to collapse sequences of repeated events to aminimal length. The normalization process operates ina greedy manner: each trace and its corresponding pro-gram states are parsed to identify an Si for a segment.When such Si is found, we look for the first Se suchthat there exists at least one irrelevant variant that canbe normalized in the segment bounded by states Si andSe. Then we apply the normalization operations as de-fined previously. The process continues from Si to findanother segment until it reaches the end of the trace.

Continuing with our example, we may decide thatall occurrences of

〈Valid.processEnt,r,Valid.processAtt,r〉should be normalized to

〈Valid.processAtt,r,Valid.processEnt,r〉As shown in Figure 2e, one instance of this normaliza-tion can be applied in t f .

When considering this normalized versionof t f , since the Valid class contains the call se-quences Valid.parseDT D-Valid.processEnt andValid.processEnt-Valid.processAtt in both traces, thefault isolation technique assigns it a score of zero.Meanwhile, the two differentiating call sequencesfor Parser remain unchanged in the presence of thenormalized traces, which means that it would still havea score of 2, making Parser more likely than Valid tocontain a fault. Given the fault location, we see howthe normalized traces enhance the precision of the faultisolation client analysis.

2.1. Applicability and TradeoffsA dynamic client analysis (DA), such as the fault

isolation analysis discussed above, requires for the pro-gram under analysis P to be instrumented to log events

Σ. As the program is executed, a set of traces T ={t1, t2, . . . , tm} is collected, where trace ti is a vector ofevents, e ∈ Σ, captured during one execution of P. Ingeneral, a DA operates on T to calculate properties ofP (e.g., invariants in P) or P’s components (e.g., faultlikelihood of P’s classes).

Conceptually, our normalization approach aims totransform the traces in T to obtain a new trace set T ′such that the DA(T ′) is more precise than DA(T ) andthat DA(T ′) has recall that is at least that of DA(T ).

The specific formulation of precision and recall,however, will depend on the DA. For the fault isolationtechnique, it is natural to compare techniques based onthe extent to which the class ranking produced increasesthe ease with which a fault is found. To accommodatethis type of rank comparison, we set a precision thresh-old by considering only the top X classes in the rankingas the information produced by the analysis. A tech-nique has 100% recall if the faulty class is within thetop X classes. For other client analyses, the formula-tion of precision and recall will be different. For exam-ple, the dynamic change impact analysis presented inSection 5 identifies sets of methods affected by a pro-gram change. Consequently, its precision is improvedby eliminating information from T that allows it to re-port fewer methods that are not impacted by the change.It retains the recall of the original analysis as long as itreports at least as many actual changed methods. Weprovide a more detailed formulation of precision andrecall for fault isolation and dynamic impact analysis inSection 5.

Identifying transformations that can increase pre-cision while retaining recall is challenging. Our ap-proach discriminates trace variations that do not con-tribute to a DA’s recall, from those that add irrele-vant variations that reduce DA’s precision. It performsthis discrimination by identifying alternative event se-quences (bounded by the same program states) that dif-fer in the number or in the order of their events. We notethat while our goal is to retain recall, there may be situa-tions in which it is acceptable to sacrifice some recall tosignificantly increase precision. This is especially truewhen precision is achieved by greatly reducing the sizeof the trace pool or the amount of spurious informationproduced by the analysis.

In addition to precision and recall, an assessmentof trace normalization must consider the cost of captur-ing program states 〈si,1, ...,si, j, ...,si,m〉 for each traceti, where si, j is the program state generated by ti atcheckpoint j. A program state can be viewed as an n-tuple of memory values, where each value correspondsto a “state” of a program’s component, and a program’scomponent can be a scalar variable, a reference variable

70

Page 5: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

such as an object, a thread, or a program location.Clearly, if not done carefully, capturing program

states during an execution may introduce significantoverhead in the data collection. There are, however,several mechanisms at our disposal to control this over-head cost. First, we can reduce the space of the pro-gram states to capture, where the size of the programstate space is determined by the number of programcomponents and the range of their possible values [8].Second, we can reduce the number of checkpoints, per-haps with the assistance of a static analysis technique[8, 22]. Third, we can use less expensive informationto derive approximations of program state. For exam-ple, collected branch traces can be used to symbolicallyexecute a program and approximate a range of programstates visited during the execution [14]. Fourth, we canuse abstractions of program state that are less expensiveto collect, such as stack content or event spectra [1].

In the end, the choice of what program states tocapture and when to do it involves balancing the costof data collection and processing, the potential benefitsof normalization, and the implications for the precisionand recall of client analyses. Capturing a more com-plete program state, for example, may result in a moreprecise analysis, but would also require more expensivedata collection and processing, and lead to the identi-fication of less segment sets and less opportunities fornormalization. We believe that there are opportunitiesto reduce the cost of normalizing a trace pool across dif-ferent client analyses by choosing and abstracting partsof the program state that contain at least as much infor-mation as those required by client analyses.

In addition to the program states, the normaliza-tion approach itself may sacrifice recall in order to sig-nificantly reduce the irrelevant variations in a trace pool,and consequently improve the precision of a client anal-ysis. Section 3 will show how the normalization oper-ations offer multiple degrees of freedom in controllingthis tradeoff. For example, the most restrictive appli-cation of commutative event sequences requires for thecandidate segments to have identical suffixes, prefixes,and infixes. Such constraints can be relaxed to enablea more liberal normalization that gains preciseness butmay sacrifice recall.

Overall, the choice of which parts of program stateto capture, and what normalization operations and howthey are applied to reduce trace segments, will influencethe precision and recall of an analysis operating on thereduced trace pool. Still, as we shall see in our study oftwo client analyses that operate on traces with methodcall and return information, the normalized traces re-tained most if not all of the recall, achieved greater orequal precision, and reduced client analysis cost.

3. Defining the Key ConceptsSection 2 introduced the concepts of segments,

segment sets, and commutative and collapsible events,which form the basis of our trace normalization ap-proach. We define them more precisely here.

Intuitively, a trace segment defines a region of pro-gram execution that transits from an initial programstate, Si, to an ending program state, Se. Collectingsuch segments and analyzing their similarities and dif-ferences is the foundation of trace normalization. Seg-ments are sub-traces consisting of at least two eventsbounded by a designated pair of states (for commutativ-ity and collapsibility to be applicable).

Definition 3.1 (Segment) Given two program states Si

and Se, a trace t has a segment 〈SiβSe〉, if t = αSiβSeγ ,where α,γ ∈ Σ∗ and β ∈ Σk where k > 1.

Segment sets are non-singleton sets of segmentsbounded by common program states. Note that a seg-ment set can contain multiple segments belonging tojust one trace.

Definition 3.2 (Segment Set) Given two programstates Si and Se and a trace set T ,

SEGSETSi,Se = {〈SiβSe〉|∃t ∈T : 〈SiβSe〉is a segment of t}and |SEGSETSi,Se | > 1

The identification of irrelevant variations is basedon the notion of commutative and collapsible events.

Definition 3.3 (Commutative Events) Two events, Aand B, commute in SEGSETSi,Se if

∃s ∈ SEGSETSi,Se∃α,β ,γ ∈ Σ∗ : s = 〈SiαAβBγSe〉∃s′ ∈ SEGSETSi,Se∃α ′,β ′,γ ′ ∈ Σ∗ : s′ = 〈Siα ′Bβ ′Aγ ′Se〉

This definition requires that for events A and B tocommute within a segment set, both A and B should ap-pear in at least two segments in the set with differingorders. It is worth noting that this definition allows ar-bitrary freedom in matching a common prefix, infix, andsuffix of the commuting events within a segment. Thisdefinition can be easily extended for fixed sequences ofevents A = a1, . . . ,an and B = b1, . . . ,bm.

Definition 3.4 (Collapsible Events) A sequence ofconsecutive occurrences of event A, Ak where k > 1, iscollapsible within a segment set SEGSETSi,Se if

∃1 < n < k∃α,γ ∈ Σ∗ : 〈SiαAnγSe〉 ∈ SEGSETSi,Se

Definition 3.3 and 3.4 require the existence of atleast two segments within a set with the desired prop-erty. For a broader notion of normalization, one canrequire that more than two segments share the property.

71

Page 6: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

Comparison mechanisms

Sensitivity Control

Trace + Program

State

Segment Sets

Comparator

Analyzer Sets of Segset

Variants

Trace Normalizer

Client Analysis

Matching States

Decomposer

Pool of Traces + Program

States

Normalized Pool of Traces

Phase 1 Phase 2

Figure 3. Trace Normalization Infrastructure

4. Infrastructure Support

We have built an infrastructure to support trace nor-malization that allows us to explore several aspects ofthe approach. Figure 3 provides an overview of theinfrastructure which operates in two phases. The firstphase involves the comparator and decomposer com-ponents. These components incrementally process onetrace at a time, generating a collection of segment setsfrom a pool of traces and program states. In the secondphase, these segments sets are processed by the ana-lyzer and the normalizer to transform the original poolof traces by removing their irrelevant variations. Wenow discuss the infrastructure in more detail.

4.1. Comparator

The comparator identifies a set of candidate pro-gram states that may demark the beginning (Si) or theend (Se) of segments within traces. The comparatortakes a pool of traces and their associated programstates, and generates a map from a program state toa pair consisting of a trace identifier and a set of off-sets within that trace. The offsets encode the locationswithin the trace that correspond to the checkpoints inwhich the program state is captured. To do this, tracesare scanned and whenever a candidate state is found,either new entries in the map are created or existing en-tries are updated with new trace and offset values. Themap entries of candidate states that do not appear inmultiple traces or locations are deleted.

4.2. Decomposer

When presented with a trace pool and the programstate-trace map provided by the comparator component,the decomposer utilizes Definitions 3.1 and 3.2 to con-struct the segment sets: for every pair of states, Si andSe, in the map, the decomposer looks for all the tracesin which both of the states appear. Then, for eachsuch trace, the decomposer considers the event seg-

ments bounded below by Si and bounded above by Se

(which is identified by stored offsets in the map), andadds the segment to the SEGSETSi,Se only if the lengthof the segment is greater than one.

An unoptimized implementation of the decom-poser may be quadratic in the number of non-trivialprogram states and in the length of a trace; and linearin the number of traces. As previously discussed, thenon-trivial cost of this algorithm can be mitigated byadjusting the subsets of program heap to capture andthe trigger mechanisms of capturing them, and by takenadvantage of the opportunities for parallelization.

4.3. Analyzer

For each segment set the analyzer uses Defini-tions 3.3 and 3.4 to identify patterns that expose poten-tial sources of irrelevant variation among its segments.Specifically, we identify commutative events and col-lapsible events within segments. The analyzer can beconfigured to infer only the commutative properties, thecollapsible properties, or both depending on the pro-gram states utilized to decompose the traces into seg-ments.

Note that the two definitions allow the analyzer tobe configured through the parameters that define the de-sired length of common prefix, infix, and suffix of thecommuting events within the segments in a segment set.The length of prefix, infix, and suffix determine howconservative the approach will be in terms of normal-izing the potentially irrelevant variants. Generally, in-creasing the length of matching sequences will lead tothe identification of less irrelevant variants. However,it may also mean less spurious irrelevant variants. Inour implementation, we vary the value of the prefix andsuffix lengths when evaluating our approach. We set thelength of infixes for commutative variants to 0 to ensurethat only adjacent events are considered to be commu-tative.

When inferring the commutative and collapsiblevariants within a segment set, there is also some flex-ibility in the required number of segments in the set thatexhibit the potentially irrelevant patterns. Intuitively,the greater the number of segments in a set that con-tain a pattern, the more confident one might be that thepattern is truly irrelevant. For example, if half of thesegments have an A followed by a B and half of thesegments have them in reverse orders, then we can beconfident about the commutativity of A and B. In ourstudy, we did not explore this type of variant. Instead,we used Definitions 3.3 and 3.4 as stated.

72

Page 7: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

4.4. Trace Normalizer

This component generates a normalized version ofa trace by utilizing the original trace and the set of vari-ants associated with each segment set as output by theanalyzer. The normalizer parses through the trace andits corresponding program state for a trace segment, at-tempting first to identify an Si. When such Si is found,the normalizer scans forward for the first Se such thatthere exists at least one irrelevant variant that can benormalized in SEGSETSi,Se . Only the irrelevant variantthat is identified within this segment set can be appliedto the corresponding segment. The normalizer will thenattempt to perform the transformation process.

The transformation process based on commutativeevents selects an order of those events and transformseach segment in the set so that occurrences of thoseevents within that segment set obey that order. Certainordering heuristics, such as sorting the instances of thesegment variants and always choosing the first instanceof event sequence for commuting operation, should beapplied to maintain the consistency of the normalizationoperations (we use a canonical ordering in our imple-mentation). Transformation process on the collapsibleevents selects the shortest sequence of repeated eventsin a segment in a segment set and replaces all longersequences in segments in the set with it.

Once all possible variants are normalized withinthe segment, the normalizer looks for a new potential Si

from Se and restarts the process. Once the target traceis normalized, a string comparison is performed againstthe other already-reduced traces to avoid duplication.

5. Empirical Study

This study investigates the performance of our tracenormalization approach with respect to two client anal-yses: 1) lightweight fault isolation as described byDallmeier et al. [4] and illustrated in Section 2, and 2)dynamic change impact analysis as described by Lawet al. [15]. This second client is meant to assist in de-termining how a change to a program method impactsother methods by analyzing execution traces; the anal-ysis returns an impact set comprised of every methodcalled after the changed method and all methods thatare on the stack when the changed method returns.

We conjecture that, in the case of fault isolation,our approach will lead to a better localization, whilein the case of change impact analysis it will result ina more precise impact set.

5.1. Independent Variables

We manipulate the normalization technique. Atechnique is either an instantiation of our approach that

combines the program state with a checkpointing trig-ger, or the CONT ROL (the raw set of traces without anyprocessing). We consider two types of program states:1) stack content snapshot (STCK) which records themethod calls residing in the stack as a type of programstate abstraction, and 2) a whole heap capture (HEAP)that records the values of all program variables in theheap(as generated by JPF [13]).

We consider two triggering strategies to determinewhen to capture the program state: after every observ-able event in the trace (ALL) and at every side-effect-free event (INS) (for this we just consider the inspectormethods invocations and returns). The combinations ofthe two types of program states and two trigger typesresult in four different normalization techniques. Foreach client analysis we compared the result of applyingeach of those techniques against the CONT ROL.

5.2. Dependent Variables and Metrics

We measure the effect of the trace pool resultingfrom each technique on each of the two client analysis.

Fault Isolation. Irrelevant variations in traces can in-crease the number of call sequences characterizing thetraces of passing and failing runs and lead to a less ac-curate class ranking (lower ranking of faulty classes).The probability of a class to contain a fault is definedas the sum of the weight of each call sequence dividedby the number of distinct call sequences associated withthe class [4]. Given n, the number of passing runs, andk(c), the number of passing runs that contain a call se-

quence c, the weight of a call sequence, w(c), is k(c)n

if c does not appear in the failing run and 1- k(c)n other-

wise. The classes are ranked in decreasing order of theirprobability score.

To evaluate the tradeoffs between precision and re-call when normalized traces are used with the fault iso-lation client, we only consider the top X ranked classes,reflecting what a developer would do when looking ata fault likelihood ranking. This is similar to what Liuet al. have done [18]. Note that this method of evalu-ation is different with the one used by Dallmeier et al.where the approach is evaluated with the accuracy of thefaulty class prioritization in mind. For this study we setX to 1, 2, 4, and 6 classes, which correspond to a pre-cision of 94.7% (1- 1

19 *100), 89.4%, 78.9%, and 68.4%respectively, when we consider 19 NanoXML classes.For each precision value, if the faulty class is includedwe assign a recall value of 100%, 0 otherwise.

Dynamic Change Impact Analysis. A dynamicchange impact analysis is more precise when it is ableto discard methods that are not impacted by the change,and it has higher recall value when it retains methods

73

Page 8: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

that are truly impacted by the change.Computing such precision and recall measures re-

quires the exact impact set. Since we do not know whatare the exact methods impacted by a change, we usedthe impact set of CONT ROL as an initial approxima-tion, and then we prune the methods from that set thatdid not appear in a conservative static impact analysis[21]. We call this our QEXACT (quasi exact) set; inreality it is only an approximation to the exact informa-tion so our precision and recall measures are, in somesense, relative. For a given technique, tech:

precisiontech =|QEXACT ∩ IStech|

|IStech|recalltech =

|QEXACT ∩ IStech||QEXACT |

where IStech is the impact set of the technique tech.

5.3. Artifact

The object of study is NanoXML, a XML parsinglibrary for Java used in previous studies of both clientanalyses conducted by other researchers [4, 20].

NanoXML is available for download from the Sub-ject Infrastructure Repository (SIR) [7] which providesa system test suite with 214 test cases and 6 versionsof NanoXML with seeded faults. For the fault isolationclient we randomly selected a version of NanoXML’scomponent library which has 19 Java classes and 7seeded faults located in 5 distinct classes. For dynamicimpact analysis we utilize all the versions available andidentify the changed methods between them.

5.4. Study Setup and Design

There are several steps involved in setting up thestudy. First, obtaining event traces corresponding tothe artifact’s test cases. In this study, we chose methodcalls and returns as our trace events because both clientanalyses operate at that granularity. We instrumentedNanoXML to generate such traces. This process yieldsus six pools of traces of size 214, where each pool cor-responds to a version of NanoXML. The average lengthof the traces is 2881 events, with minimum and maxi-mum length of 21 and 4790 events respectively.

The second step consists of generating the snap-shots of the various program state types and checkpoint-ing triggers to be interleaved into the event traces. TheSTCK snapshots can be derived from the event tracesby simulating the pushing and popping of the stack us-ing the observed method calls and returns. Each uniquestack content is then assigned with an integer identifier.Storing the STCK snapshot of a trace requires on av-erage 9KB of space. The generation of HEAP snap-shots is more complex and for convenience in our study

Table 1. Seg. Sets Information of NanoXMLSTCK-ALL HEAP-ALL

# of Segment Sets 176 141Avg. # of segments 14 6Max. # of segments 57 27Min. # of segments 2 2# Normalized traces 44 37

we utilize Java Pathfinder (JPF)[13] to capture it. Weimplemented a listener on top of JPF that outputs thehashed program state as we run NanoXML. The run-time overhead of capturing HEAP when the ALL triggeris used is 14% over the regular test suite execution time.The hashed program state of a trace can be representedas a sequence of integers that requires approximately23KB of space to store it.

The third step of the process is to decompose thetraces into segment sets. Table 1 characterizes the re-sults of this process through the number of segment setsand segments that can be derived from NanoXML tracesfor the STCK and HEAP snapshots when ALL trigger-ing technique is used. The last row of Table 1 reportsthe number of distinct traces to which some normaliza-tion operations are applied when each technique is used.Note that in spite of NanoXML relatively small tracepool, 21% and 17% of the traces are normalized whenusing STCK-ALL and HEAP-ALL, respectively.

The last step consists of finding irrelevant varia-tions in the segment sets and applying the normaliza-tions across the trace pool. When identifying opportu-nities for normalization, we consider exact suffix andprefix matching. Although we explored the use of dif-ferent suffix and prefix sizes, the results for this artifactwere not different enough to make them worth includ-ing in the limited space available.

Each client analysis was implemented as describedby their authors, including their recommended param-eters. For the fault isolation client, we implemented itwith call sequences of outgoing methods using a slidingwindow of size 5 for each instantiated object.

5.5. Results: Fault Isolation Effort

To study the fault isolation client, we activated onefault at a time to avoid introducing additional sourcesof variation, ran all the tests, and then used the col-lected traces either directly or after normalization to cal-culate the class rankings. With the class rank, we thencomputed the recall value for each predefined precisionthreshold as discussed in Section 5.2.

Figure 4 plots the precision and recall values aver-aged across the 7 seeded faults for the CONT ROL andthe four normalization techniques (the values are joinedwith lines to show tendencies). As expected, indepen-dently of the technique, the recall value decreased asprecision increased. As precision increased, the tech-

74

Page 9: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

68.4% 78.9% 89.4% 94.7%

Precision

0%

20%

40%

60%

80%

100%R

ecal

l

Control STCK-ALL STCK-INS HEAP-ALL HEAP-INS

Figure 4. Fault Isolation Recall and Precision

niques lost the ability to retain the truly faulty class.When the precision is 89.4%, only the HEAP-basedtechniques retain the ability to identify more than 20%of the faulty classes. When the precision is 94.7%, onlyHEAP-ALL technique can identify the faulty class.

The most interesting observation from Figure 4 isthat when the client analysis consumes the trace poolnormalized with any of the four techniques, it performsat least as well but often better than when it consumesthe CONT ROL. Given the same level of precision, thefour techniques have equal or greater recall (and in-versely, given the same recall, all techniques provideequal or better precision). The differences are appar-ent for the four techniques when the precision reaches78.9%. At 89.4% precision, HEAP-ALL has 28% morerecall value than CONT ROL. This shows that normal-izing the traces increases the performance of the faultisolation client.

Normalizing traces has also a positive side effect inthe reduction of the number of call chains that wouldhave to be compared by the fault isolation analysis. Forexample, the HEAP based techniques reduced the num-ber of call chains by an average of 3% (32 call chains)when compared with the CONT ROL technique. Al-though the number is modest, note that a reduction ofsize one in the call chain set of the passing runs is com-pounded by the size of the failing runs produced by theclient analysis, thereby increasing the efficiency of theclient analysis.

5.6. Results: Dynamic Change Impact AnalysisWe studied 6 versions of NanoXML, which re-

sulted in 5 sets of changed methods between the con-secutive versions. We executed the program test suiteon each changed version to generate their correspond-ing trace pools, and then applied our four normalizationtechniques to the pool of traces for each changed ver-

Control

STCK-ALL

STCK-INS

HEAP-ALL

HEAP-INS

95% 96% 97% 98% 99% 100%

Precision

95%

96%

97%

98%

99%

100%

Rec

all

Figure 5. Dynamic Change Impact Analysis Re-call and Precision

sion. Next, we performed whole program path-baseddynamic impact analysis [15] on the traces collectedfrom execution each program version with its corre-sponding set of changes.

For each program version, we calculated the pre-cision and recall value of the dynamic impact analysiswhen consuming the CONT ROL and the four normal-ized trace pools. The averaged value of the precisionand recall across the method changed sets is plotted inFigure 5. We observe that CONT ROL technique per-forms really well, providing 100% recall and 96% pre-cision. Such high precision values, leave small room forimprovement. Still, our normalization techniques tradesimilar slight percentages of recall for precision. Withthe HEAP-ALL technique, we were able to improve theprecision by almost 3% when compared to CONT ROL(HEAP-ALL identified up to 4 methods that were notin QEXACT while CONT ROL resulted in the identifi-cation of up to 8 methods erroneously labeled as im-pacted). This precision gain sacrificed the retention of3 truly impacted methods. More generally, we observesimilar trends to those of the fault isolation client, whereHEAP-based approaches have higher precision and re-call values than the STCK-based. This is as expectedsince a HEAP snapshot is more likely to detect vari-ables dependencies than a STCK snapshot, at the costof a more expensive generation.

6. Conclusions and Future Work

In this work we have introduced a novel approachto reducing irrelevant trace variations that may improvedynamic analyses precision while retaining recall. Ourformulation of trace normalization provides a numberof degrees of freedom for controlling their cost and ef-fectiveness. We have implemented several instances of

75

Page 10: [IEEE 2008 IEEE International Symposium on Software Reliability Engineering (ISSRE) - Seattle, WA, USA (2008.11.10-2008.11.14)] 2008 19th International Symposium on Software Reliability

the approach and, although of limited scope, our resultsindicate that the removal of irrelevant trace variation canbe beneficial for two families of analyses. Our approachenables a fault isolation client to improve rankings ofthe classes’fault likelihood. For dynamic change impactanalysis client, the approach enables the generation ofsmaller change impact sets while still retaining most ofthe valuable information.

The experiences gained while instantiating and as-sessing the techniques suggest several directions for fu-ture work. We will investigate the notion of subsump-tion among segments which may make the decomposi-tion more efficient. We will explore other properties andrelax some of the constraints we instituted. For exam-ple, we are interested in investigating event sequencesthat differ only because of event folding or unfolding.We will explore the opportunities to perform the nor-malization on-line and in combination with other anal-yses, such as side-effect analysis.

We also plan to perform similar studies on artifactswith larger trace pools to assess how the different in-stantiations of the approach scale and perform, and es-pecially to evaluate the cost and overhead of the ap-proach which we only briefly investigated in this paper.Finally, we will continue extending the family of clientsthat can be successfully coupled with the approach.

AcknowledgmentsThis work was supported in part by the NSF

through CAREER award 0347518, and awards0429149, 0444167, 0454203, 0541263, and 0720654,by the ARO:DURIP award W911NF-04-1-0104. Anyopinions, findings, and conclusions or recommenda-tions expressed in this material are those of the authorsand do not necessarily reflect the views of the fundingagencies and companies.

References

[1] T. Ball and J. Larus. Optimally profiling and tracingprograms. ACM Trans. on Programming Languages andSystems, 16(4):1319–1360, 1994.

[2] J. F. Bowring, J. M. Rehg, and M. J. Harrold. Activelearning for automatic classification of software behav-ior. In Int. Symp. on Soft. Testing and Analysis, pages195–205, 2004.

[3] H. Cleve and A. Zeller. Locating causes of program fail-ures. In Int. Conf. on Soft. Eng., pages 342–351, 2005.

[4] V. Dallmeier, C. Lindig, and A. Zeller. Lightweight de-fect localization for java. In European Conf. OO Pro-gramming, pages 528–550, 2005.

[5] W. Dickinson, D. Leon, and A. Podgurski. Finding fail-ures by cluster analysis of execution profiles. In Int.Conf. on Soft. Eng., pages 339 – 348, May 2001.

[6] M. Diep, S. Elbaum, and M. Dwyer. Reducing irrelevanttrace variations. In Int. Conf. on Automated Soft. Eng. -Short Paper, pages 477–480, 2007.

[7] H. Do, S. Elbaum, and G. Rothermel. Supporting con-trolled experimentation with testing techniques: An in-frastructure and its potential impact. Empirical Soft.Eng., 10(4):405–435, 2005.

[8] S. Elbaum, H. N. Chin, M. B. Dwyer, and J. Dokulil.Carving differential unit test cases from system testcases. In Found. of Soft. Eng., pages 253–264, 2006.

[9] S. Elbaum, S. Kanduri, and A.Andrews. Anomalies asprecursors of field failures. In Int. Symp. of Soft. Relia-bility Eng., pages 108–118, 2003.

[10] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin.Dynamically discovering likely program invariants tosupport program evolution. IEEE Trans. Soft. Eng.,27(2):99–123, February 2001.

[11] P. Francis, D. Leon, M. Minch, and A. Podgurski. Tree-based methods for classifying software failures. In Int.Symp. of Soft. Reliability Eng., Nov. 2004.

[12] M. Haran, A. Karr, M. Last, and A. Sanil. Techniques forclassifying executions of deployed software to supportsoftware engineering tasks. IEEE Trans. Softw. Eng.,33(5):287–304, 2007.

[13] Java PathFinder. http://javapathfinder.sourceforge.net/.[14] J. C. King. Symbolic execution and program testing.

Commun. of ACM, 19(7):385–394, 1976.[15] J. Law and G. Rothermel. Whole program path-based

dynamic impact analysis. In Int. Conf. on Soft. Eng.,pages 308–318, 2003.

[16] D. Leon, W. Masri, and A. Podgurski. An empiricalevaluation of test case filtering techniques based on exer-cising complex information flows. In Int. Conf. on Soft.Eng., May 2005.

[17] C. Liu and J. Han. Failure proximity: a faultlocalization-based approach. In Found. of Soft. Eng.,pages 46–56, 2006.

[18] C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiff. Sober:statistical model-based bug localization. In Found. ofSoft. Eng., pages 286–295, 2005.

[19] A. Mazurkiewicz. Trace theory. In Advances in Petrinets 1986, part II on Petri nets: applications and re-lationships to other models of concurrency, pages 279–324, 1987.

[20] A. Orso, T. Apiwattanapong, J. Law, G. Rothermel, andM. J. Harrold. An empirical comparison of dynamicimpact analysis algorithms. In Int. Conf. on Soft. Eng.,pages 491–500, 2004.

[21] Santos Laboratory. Indus.http://indus.projects.cis.ksu.edu/.

[22] G. Xu, A. Rountev, Y. Tang, and F. Qin. Efficient check-pointing of java software using context-sensitive captureand replay. In Found. of Soft. Eng., pages 85–94, 2007.

76