IEEE TRANSACTIONS ON SOFTWARE ENGINEERING ...cse.unl.edu/~elbaum/papers/journals/TSE-deployed05.pdfOur analysis of 1,200 user sessions on a 155 KLOC deployed system substantiates the

Profiling Deployed Software: AssessingStrategies and Testing Opportunities

Sebastian Elbaum, Member, IEEE, and Madeline Diep

Abstract—An understanding of how software is employed in the field can yield many opportunities for quality improvements. Profiling

released software can provide such an understanding. However, profiling released software is difficult due to the potentially large

number of deployed sites that must be profiled, the transparency requirements at a user’s site, and the remote data collection and

deployment management process. Researchers have recently proposed various approaches to tap into the opportunities offered by

profiling deployed systems and overcome those challenges. Initial studies have illustrated the application of these approaches and

have shown their feasibility. Still, the proposed approaches, and the tradeoffs between overhead, accuracy, and potential benefits for

the testing activity have been barely quantified. This paper aims to overcome those limitations. Our analysis of 1,200 user sessions on

a 155 KLOC deployed system substantiates the ability of field data to support test suite improvements, assesses the efficiency of

profiling techniques for released software, and the effectiveness of testing efforts that leverage profiled field data.

Index Terms—Profiling, instrumentation, software deployment, testing, empirical studies.

�

1 INTRODUCTION

SOFTWARE test engineers cannot predict, much lessexercise, the overwhelming number of potential scenar-

ios faced by their software. Instead, they allocate theirlimited resources based on assumptions about how thesoftware will be employed after release. Yet, the lack ofconnection between in-house activities and how the soft-ware is employed in the field can lead to inaccurateassumptions, resulting in decreased software quality andreliability over the system’s lifetime. Even if estimations areinitially accurate, isolation from what happens in the fieldleaves engineers unaware of future shifts in user behavioror variations due to new environments until too late.

Approaches integrating in-house activities with field

data appear capable of overcoming such limitations. These

approaches must profile field data to continually assess and

adapt quality assurance activities, considering each de-

ployed software instance as a source of information. The

increasing software pervasiveness and connectivity levels1

of a constantly growing pool of users coupled with these

approaches offers a unique opportunity to gain a better

understanding of the software’s potential behavior.Early commercial efforts have attempted to harness this

opportunity by including built-in reporting capabilities in

deployed applications that are activated in the presence of

certain failures (e.g., Software Quality Agent from Netscape

[25], Traceback [17], Microsoft Windows Error Reporting

API). More recent approaches, however, are designed to

leverage the deployed software instances throughout their

execution, and also to consider the levels of transparency

required when profiling users’ sites, the management of

instrumentation across the deployed instances, and issues

that arise from remote and large scale data collection. For

example:

. The Perpetual Testing project produced the residualtesting technique to reduce the instrumentationbased on previous coverage results [30], [32].

. The EDEM prototype provided a semiautomatedway to collect user-interface feedback from remotesites when it does not meet an expected criterion [15].

. The Gamma project introduced an architecture todistribute and manipulate instrumentation acrossdeployed software instances [14].

. The Bug Isolation project provides an infrastructureto efficiently collect field data and identify likelyfailure conditions [22].

. The Skoll project presented an architecture and a setof tools to distribute different job configurationsacross users [23], [34], [35].

Although these efforts present reasonable conjectures,

we have barely begun to quantify their potential benefits

and costs. Most publications have illustrated the application

of isolated approaches [15], introduced supporting infra-

structure [35], or explored a technique’s feasibility under

particular scenarios (e.g., [12], [18]). (Previous empirical

studies are summarized in Section 2.5.) Given that

feasibility has been shown, we must now quantify the

observed tendencies, investigate the tradeoffs, and explore

whether the previous findings are valid at a larger scale.This paper’s contributions are two fold. First, it presents

a family of three empirical studies that quantify the

efficiency and effectiveness of profiling strategies for

released software. Second, through those studies, it

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 8, AUGUST 2005 1

. The authors are with the Department of Computer Science andEngineering, University of Nebraska-Lincoln, NE 68588-0115.E-mail: {elbaum, mhardojo}@cse.unl.edu.

Manuscript received 27 Oct. 2004; revised 31 Jan. 2005; accepted 28 Feb.2005; published online DD Mmmm, YYYY.Recommended for acceptance by J. Knight.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TSE-0249-1004.

1. Nielsen reported that 35 million Americans had broadband Internetaccess in 2003 [26].

0098-5589/05/$20.00 � 2005 IEEE Published by the IEEE Computer Society

proposes and assesses several techniques that employ fielddata to drive test suite improvements and it identifiesfactors that can impact their potential cost and benefits.

In the following section, we abstract the essentialattributes of existing profiling techniques for releasedsoftware to organize them in strategies and summarizethe results of previous empirical studies. Section 3 describesthe research questions, object preparation, design andimplementation, metrics, and potential threats to validity.Section 4 presents results and analysis. Section 5 providesadditional discussion and conclusions.

2 PROFILING STRATEGIES FOR RELEASED

SOFTWARE

Researchers have enhanced the efficiency of profilingtechniques through several mechanisms:

1. performing up-front analysis to optimize the amountof instrumentation required [3],

2. sacrificing accuracy by monitoring entities of high-er granularity or by sampling program behavior[10], [11],

3. encoding the information to minimize memory andstorage requirements [31], and

4. repeatedly targeting the entities that need to beprofiled [1].

The enumerated mechanisms gain efficiency by reducingthe amount of instrumentation inserted in a program.Such reduction is achieved through the analysis of theproperties the program exhibited in a controlled in-houseenvironment.

Profiling techniques for released software [14], [15], [30],[35], however, must consider the efficiency challenges andthe new opportunities introduced by a potentially largepool of users. Efforts to profile deployed software are facedwith the challenge of operating under the threshold of usernoticeability. Collecting profile data from thousands ofinstances is faced with the challenge of handling andanalyzing overwhelming amounts of data. The ultimatechallenge is to create profiling strategies that overcomethose challenges and, in our case, effectively leverage thefield data to improve the testing activities.

This paper investigates three of such profiling strategies.The strategies abstract the essential elements of existingtechniques helping to provide an integrated background,and also facilitating formalization, analysis of tradeoffs, andcomparison of existing techniques and their implementa-tion. The next section describes the full strategy whichconstitutes our baseline technique, the following threesections describe the profiling strategies for released soft-ware, and Section 2.5 summarizes the related applicationsand empirical studies.

2.1 Full Profiling

Given a program P and a class of events to monitor C, thisapproach generates P 0 by incorporating instrumentationcode into P to enable the capture of ALL events in C and atransfer mechanism T to transmit the collected data to theorganization at the end of each execution cycle (e.g., after

each operation, when a fatal error occurs, or when the userterminates a session).

Capturing all the events at all user sites demandsextensive instrumentation, increasing program size andexecution overhead (reported to range between 10 percentto 390 percent [2], [22]). However, this approach alsoprovides the maximum amount of data for a given set Cserving as a baseline for the analysis of the other threetechniques specific to release software. We investigate theraw potential of field data through the full technique inSection 4.1.

2.2 Targeted Profiling

Before release, software engineers develop an under-standing about the program behavior. Factors such assoftware complexity and schedule pressure limit the levelsof understanding they can achieve. This situation leads todifferent levels of certainty about the behavior of programcomponents. For example, when testers validate a pro-gram with multiple configurations, they may be able togain certainty about a subset of the most used configura-tions. Some configurations, however, might not be (fully)validated.

To increase profiling transparency in deployed instances,software engineers may target components for which thebehavior is not sufficiently understood. Following with ourexample about a software with multiple configurations,engineers aware of the overhead associated with profilingdeployed software could aim to profile just those leastunderstood or most risky configurations.

More formally, given program P , a class of events toprofile C, a list of events observed (and sufficientlyunderstood) in-house Cobserved where Cobserved � C, thistechnique generates a program P 0 with additional instru-mentation to profile all events in Ctargeted ¼ C � Cobserved.Observe that jCtargetedj determines the efficiency of thistechnique by reducing the necessary instrumentation, butalso bounding what can be learned from the field instances.In practice, the gains generated through targeted profilingare intrinsically tied to the engineers ability to determinewhat is worth targeting. Note also that as the certaintyabout the behavior of the system or its componentsdiminishes, jCobservedj ! 0, this strategy’s performance ap-proximates full.

As defined, this strategy includes the residual testingtechnique [30] where Cobserved corresponds to statementsobserved in-house, and it includes the distribution schemeproposed by Skoll [23], [35], [34] where Ctargeted correspondsto target software configurations that require field testing.

2.3 Profiling with Sampling

Statistical sampling is the process of selecting a suitable partof a population for determining the characteristics of thewhole population. Profiling techniques have adapted thesampling concept to reduce execution costs by repeatedlysampling across space and time. Sampling across spaceconsists of profiling a subset of the events following acertain criterion (e.g., hot paths). Sampling across timeconsists of obtaining a sample from the population of eventsat certain time intervals [4], [11], or better yet, at uniformlyrandom execution intervals [22].

2 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 31, NO. 8, AUGUST 2005

When considering released software, we find an addi-tional sampling dimension: the instances of the programrunning in the field. We could, for example, sample acrossthe population of deployed instances, profiling the behaviorof a group of users. This is advantageous because it wouldonly add the profiling overhead to a subset of thepopulation. Still, the overhead on this subset of instancescould substantially affect user’s activities, biasing thecollected information.

An alternative sampling mechanism could considermultiple dimensions. For example, we could stratify thepopulation of events to be profiled following a givencriterion (e.g., events from the same functionality) and thensample across the subgroups of events, generating a versionof the program with enough instrumentation to capture justthose sampled events. Then, by repeating the samplingprocess, different versions of P 0 can be generated fordistribution. Potentially, each user could obtain a slightlydifferent version aimed to capture a particular eventsample.2

More formally, given program P and a class of eventsto monitor C, the stratified sampling strategy performsthe following steps: 1) it identifies s strata of eventsC1; C2; . . . ; Ci; . . . ; Cs, 2) it selects a total of N events fromC by randomly picking ni from Ci, where ni is proportionalto the stratum size, and 3) it generates P 0 to capture theselected events. As N gets smaller, P 0 contains lessinstrumentation, enhancing transparency but possiblysacrificing accuracy. By repeating the sampling and gen-eration process, P 00; P 000; . . . ; Pm are generated, resulting inversions with various suitable instrumentation patternsavailable for deployment.3

There is an important tradeoff between the event samplesize N and the number of deployed instances. MaintainingN constant while the number of instances increases resultsin constant profiling transparency across sites. As thenumber of deployed instances increases, however, the levelof overlap in the collected data across deployed sites is alsolikely to increase. Section 4.2 investigates how to leveragethis overlap to reduce N , gaining transparency at eachdeployed site by collecting less data while compensating byprofiling more deployed instances.

As defined, our sampling strategy provides a statisticalframework for the various algorithms implemented by theGamma research effort [18], [29]. One distinct advantage ofthis framework, however, is that it generalizes the proce-dure to sample across deployed sites which, as we shall seein Section 4.2, offers further opportunities to improveefficiency.

2.4 Trigger Data Transfers on Anomalies

Transferring data from deployed sites to the organizationcan be costly for both parties. For the user, data transferimplies at least additional computation cycles to marshal

and package data, bandwidth to actually perform thetransfer, and a likely decrease in transparency. For anorganization with thousands of deployed software in-stances, data collection might become a bottleneck. Even ifthis obstacle could be overcome with additional collectiondevices (e.g., a cluster of collection servers), processing andmaintaining such a data set could prove expensive.Triggering data transfers in the presence of anomalies canhelp to reduce these costs.

Employing anomaly detection implies the existence of abaseline behavior considered nominal or normal. Whentargeting released software, the nominal behavior is definedby what the engineers know or understand. For example,engineers could define an operational profile based on theexecution probability exhibited by a set of beta testers [24].A copy of this operational profile could be embedded intothe released product so that deviations from its valuestrigger a data transfer. Deviations from this profile couldindicate an anomaly in the product or in the usage of theproduct. Sessions that fit within the operational profilewould only send a confirmation to the organization,increasing the confidence in the estimated profile; sessionsthat fall outside an specified operational profile range arecompletely transferred to the organization for furtheranalysis (e.g., diagnose whether the anomaly indicates apotential problem in the product or an exposure to anenvironment that was not tested, update the existingprofile).

We now define the approach more formally. Givenprogram P , a class of events to profile C, an in-housecharacterization of those events Chouse, a tolerance todeviations from the in-house characterization ChouseTolerance,this technique generates a program P 0 with additionalinstrumentation to monitor events in C, and some form ofpredicate Predhouse involving Chouse, ChouseTolerance, and Cfield

that determines when the field behavior does not conform tothe baseline behavior. When Predhouse does not hold, sessiondata is transferred to the organization. Note that thisdefinition of trigger by anomaly includes the type ofbehavioral “mismatch” trigger mechanism used by EDEM[15] (Predhouse ¼ ðCfield 2 ChouseÞ;ChouseTolerance ¼ 0).

There are many interesting tradeoffs in defining anddetecting deviations from Chouse. For example, there is atradeoff between the level of investment on the in-housesoftware characterization and the number of false negativesreported from the field. Also, there are many algorithms todetect anomalies, trading detection sensitivity and effec-tiveness with execution overhead. We investigate some ofthese tradeoffs in Section 4.3.

2.5 Applications and Previous Empirical Studies

The efforts to develop profiling techniques for releasedsoftware have aimed for different goals and have beensupported by different mechanisms. Hilbert and Redmiles[15], [16] utilized walk-through scenarios to demonstratethe concept of Internet mediated feedback. The scenariosillustrated how software engineers could improve userinterfaces through the collected information. The scenariosreflected the authors’ experiences in a real context, but theydid not constitute empirical studies.

ELBAUM AND DIEP: PROFILING DEPLOYED SOFTWARE: ASSESSING STRATEGIES AND TESTING OPPORTUNITIES 3

2. An alternative procedure could first stratify the user population, andthen sample across events. However, we conjecture that finding strata ofknown events is easier than finding subgroups of an unknown or shiftinguser population.

3. In this work, we focus on two particular dimensions: space anddeployed instances. However, note that sampling techniques on time couldbe applied on the released instances utilizing the same mechanism.

Pavlopoulou and Young empirically evaluated theefficiency gains of the residual testing technique onprograms of up to 4KLOC [30]. The approach consideredinstrumentation probe removal, where a probe is thesnippet of code incorporated into P to profile a singleevent. Executed probes were removed after each test wasexecuted, showing that instrumentation overhead to cap-ture coverage can be greatly reduced under certainconditions (e.g., similar coverage patterns across test cases,non-linear program structure). Although incorporating fielddata into this process was discussed, this aspect was notempirically evaluated.

The Gamma group performed at least two empiricalstudies to validate the efficiency of their distributedinstrumentation mechanisms. Bowring et al. [18] studiedthe variation in the number of instrumentation probesand interactions, and the coverage accuracy for twodeployment scenarios. For these scenarios, they employeda 6KLOC program and simulated users with syntheticprofiles. A second study by the same group of researchersled by Orso [28] employed the created infrastructure todeploy a 60KLOC system and gathered profile informa-tion on 11 users (seven from the research team) to collect1,100 sessions. The field data was then used for impactanalysis and regression testing improvement. The find-ings indicate that field data can provide smaller impactsets than slicing and truly reflect the system utilization,while sacrificing precision. The study also highlighted thepotential lack of accuracy of in-house estimates, whichcan also lead to a more costly regression testing process.

The Skoll group has also started to perform studies tostudy the feasibility of their infrastructure to utilize userparticipation in improving quality assurance process ontwo large open source projects (ACE and TAO) [35]. Thefeasibility studies reported on the infrastructure’s cap-ability in detecting failures in several configurationsettings. Yilmaz et al. have also proposed an approachusing covering arrays to steer the sampling of theconfiguration settings to be tested [34]. Early assessmentsperformed through simulations across a set of approxi-mately 10 workstations revealed a reduction in thenumber of configurations to be tested while identifyingsome faulty configurations not found by the originaldevelopers [23].

Overall, when revisiting the previous studies in terms ofthe profiling strategies we find that:

1. the transfer on anomaly strategy has not beenvalidated through empirical studies,

2. the targeted profiling strategy has not been validatedwith deployment data and its efficiency analysisincluded just four small programs,

3. the strategy involving sampling has been validatedmore extensively in terms of efficiency but theeffectiveness measures were limited to coverage, and

4. each assessment was performed in isolation.

Our studies address those weaknesses by improving on thefollowing items:

. Target object and subjects. Our empirical studies areperformed on a 155KLOC program utilized by

30 users, providing the most comprehensive empiri-cal setting yet to study this topic.

. Comparison and integration of techniques withinthe same scenarios. We analyze the benefits andcosts of field data obtained with full instrumenta-tion and compared it against techniques utilizingtargeting profiling, sampling profiling, a combina-tion of targeting and sampling, and anomalydriven transfers.

. Assessment of effectiveness. We assess the potentialof field data in terms of coverage gains, additionalfault detection capabilities, potential for invariantrefinements, and overall data capture.

. Assessment of efficiency. We count the number ofinstrumentation probes required and executed, andalso measure the number of required data transfers(a problem highlighted but not quantified in [15]).

3 EMPIRICAL STUDY

This section introduces the research questions that serve toscope this investigation. The metrics, object of study, andthe design and implementation follow. Last, we identify thethreats to validity.

3.1 Research Questions

We are interested in the following research questions.

RQ1. What is the potential benefit of profiling deployedsoftware instances? In particular, we investigate thecoverage, fault detection effectiveness, and invariantrefinements gained through the generation of a test suitebased on field data.

RQ2. How effective and efficient are profiling techniquesdesigned to reduce overhead at each deployed site? Weinvestigate the tradeoffs between efficiency gains (asmeasured by the number of probes required to profilethe target software and the probes executed in the field),coverage gains, and data loss.

RQ3. Can anomaly-based triggers reduce the number ofdata transfers? What is the impact on the potential gains?We investigate the effects of triggering transfers when adeparture from an operational profile is detected, whenan invariant is violated, and when a stochastic modelpredicts that the sequence of events is likely to lead to afailure state.

3.2 Object

We selected the popular4 program Pine (Program forInternet News and Email) as the object of the experiment.Pine is one of the numerous programs to perform mailmanagement tasks. It has several advanced features such assupport for automatic incorporation of signatures, Internetnewsgroups, transparent access to remote folders, messagefilters, secure authentication through SSL, and multipleroles per user. In addition, it supports tens of platforms andoffers flexibility for a user to personalize the program bycustomizing configuration files. Several versions of Pinesource code are publicly available. For our study we


4. Pine had 23 million users worldwide as of March 2003 [27].

primarily use the Unix build, version 4.03, which contains1,373 functions and over 155 thousand lines of codeincluding comments.

3.2.1 Test Suite

To evaluate the potential of field data to improve the in-house testing activity we required an initial test suite onwhich improvements could be made. Since Pine does notcome with a test suite, two graduate students, who werenot involved in the current study, developed a suite byderiving requirements from Pine’s man pages and user’smanual, and then generating a set of test cases thatexercised the program’s functionality. Each test case wascomposed of three sections: 1) a setup section to setfolders and configurations files, 2) a set of Expect [21]commands that allows to test interactive functionality, and3) a cleanup section to remove all the test specific settings.The test suite consisted of 288 automated test cases,containing an average of 34 Expect commands. The testsuite took an average of 101 minutes to execute on a PCwith an Athlon 1.3 processor, 512MB of memory, runningRedhat version 7.2. When function and block levelinstrumentation was inserted, the test suite executionrequired 103 and 117 minutes respectively (2 percent and14 percent overhead). Overall, the test suite covered 835 of1,373 functions (� 61 percent) in the target platform.

3.2.2 Faults

To quantify potential gains in fault detection effectivenesswe required the existence of faults. We leveraged a parallelresearch effort by our team [7] that had resulted in 43 seededfaults in four posterior versions of Pine: 4.04, 4.05, 4.10, and4.20. (The seeding procedure follows the one detailed in[8].) We then utilized these faulty versions to quantify thefault detection effectiveness of the test suites that leveragedfield data (analogous to a regression testing scenario).

3.2.3 Invariants

To quantify potential gains we also considered the refine-ment of dynamically inferred invariants that can beachieved with the support of field data. Since invariantsspecify the program operational behavior, we conjecturedthat novel operational behavior from the field could beemployed to trigger data transfers and refine the set ofpotential invariants generated in-house. Because Pine didnot have built-in invariants, we used our in-house test suiteto dynamically discover its likely invariants following asimilar procedure to that utilized by Ernst et al. [9] with theassistance of Daikon [13]. The resulting potential invariantswere located at the entry and exit point of methods andclasses called program points (PP). Utilizing Daikon’sdefault settings, and after removing “zero-sample” programpoints (program points that are never exercised by theinput) and redundant potential invariants, we obtained2,113 program points and 56,212 likely potential invariants.

3.3 Study Design and Implementation

The overall empirical approach was driven by theresearch questions and constrained by the costs ofcollecting data for multiple deployed instances. Through-out the study design and implementation process, we

strived to achieve a balance between the reliability andrepresentativeness of the data collected from deployedinstances under a relatively controlled environment, andthe costs associated with obtaining such a data set. As weshall see, combining a controlled deployment and collec-tion process with aposteriori simulation of differentscenarios and techniques helped to reach such a balance(potential limitations of this approach are presented underthreats to validity in Section 3.4).

We performed the study in three major phases: 1) objectpreparation, 2) deployment and data collection, and3) processing and simulation.

The first phase consisted of instrumenting Pine to enablea broad range of data collection. The instrumentation ismeant to capture functional coverage information,5 traces ofoperations triggered by the user through the menus,accesses to environmental variables, and changes in theconfiguration file occurring in a single session (a session isinitiated when the program starts and finishes when theuser exits). In addition, to enable further validationactivities (e.g., test generation based on user’s session data),the instrumentation also enables the capture of varioussession attributes associated with user operations (e.g.,folders, number of emails in folders, number and type ofattachments, errors reported on input fields). At the end ofeach session, the collected data is time-stamped, marshaled,and transferred to the central repository. For anonymizationpurposes, the collected session data is packaged and labeledwith the encrypted sender’s name at the deployed site andthe process of receiving sessions is conducted automaticallyat the server to reduce the likelihood of associating a datapackage with its sender.

We conducted the second phase of the study in twosteps. First, we deployed the instrumented version of Pine atfive “friendly” sites. For two weeks, we used this pre-liminary deployment to verify the correctness of theinstallation scripts, data capture process and content,magnitude and frequency of data transfer, and thetransparency of the deinstallation process. After this initialrefinement period, we proceeded to expand the sample ofusers. The target population corresponded to the approxi-mately 60 students in our Department’s largest research lab.After promoting the study for a period of two weeks,30 subjects volunteered to participate in the study (mem-bers from our group were not allowed to participate). Thestudy’s goal, setting, and duration (45 days) was explainedto each one of the subjects, and the same fully instrumentedversion of the Pine’s package was made available for themto install. At the termination date, 1,193 user sessions hadbeen collected, an average of one session per user per day(no sessions were received during six days due to datacollection problems).

The last phase consisted of employing the collected datato support different studies and simulating differentscenarios that could help us answer the research questions.The particular simulation details such as the manipulatedvariables, the nuisance variables and the assumptions, vary


5. We made a decision to capture functional level data in the fieldbecause of its relative low overhead (see Section 3.2.1 for details onoverhead).

depending on the research question, so we address themindividually within each study.

3.4 Threats to Validity

This study, like any other, has some limitations that couldhave influenced the results. Some of these limitations areunavoidable consequences of the decision to combine anobservational study with simulation. However, given thecost of collecting field data and the fundamental explora-tory questions we are pursuing, these two approachesoffered us a good balance between data representativenessand power to manipulate some of the independentvariables.

By collecting data from 30 deployed instances of Pineduring a period of 45 days, we believe to have performedthe most comprehensive study of this type. Our program ofstudy is representative of many programs in the market,limiting threats to external validity. In spite of the gaps inthe data collection process due to data corruption andserver “down-time,” we believe the large number ofsessions collected overshadows the data loss and does notconstitute a significant threat.

Regarding the selected pool of users, the subjects arestudents in our Department, but we did not exercise anycontrol during the period of study, which gives usconfidence that they behaved as any other user wouldunder similar circumstances. Still, more studies with otherprograms and subjects are necessary to confirm theresults we have obtained. For example, we must includesubjects that exercise the configurable features of Pine andwe need to distribute versions of the program for variousconfigurations.

Our simplifying assumptions about the deploymentmanagement is another threat to external validity.Although the assumptions are reasonable and could berestated for further simulation if needed, empirical studiesspecifically including various deployment strategies arerequired [33]. Furthermore, our studies assume that theincorporation of instrumentation probes and anomalybased triggers do not result in additional program faultsand that repeated deployments are technically andeconomically feasible in practice.

We are also aware of the potential impact of observa-tional studies on a subject’s behavior. Although we clearlystated our goals and procedures, subjects could have beenafraid to send certain type of messages through our versionof Pine. This was a risk we were willing to take to increasethe chances of exploring different scenarios throughsimulation. Overall, gaining users trust and willingness tobe profiled is a key issue for the proposed approaches tosucceed and should be the focus of future studies.

The in-house validation process helped to set a baselinefor the assessment of the potential of field data (RQ1) andthe anomaly based triggers for data transfers (RQ3). Assuch, the quality of the in-house validation process is athreat to internal validity which could have affected ourassessments. We partially studied this factor by carvingweaker test suites from an existing suite (Section 4.1) and byconsidering different number of users to characterize theinitial operational profiles (Section 4.3). Still, the scope ofour findings are affected by the quality of the initial suite.

When implementing the techniques we had to makecertain choices. Some of these choices may limit thegenerality of our findings. For example, we implementedsix test case generation techniques that utilized differentsource data or follow different procedures, we employedthe invariant discovery tool with one particular configura-tion setting, and we choose one particular type of clusteringalgorithm. Different sources of data, settings, or algorithmscould have resulted in a different outcome. To limit thisthreat, we considered the cost-benefits and potentialapplicability resulting from our choices.

Last, the metrics we have selected are just a subset of thepotential metrics that could have been used to assess thepotential of field data and the performance of profilingstrategies. For example, our study was not designed tocapture bandwidth consumption, throughput reductiondue to overhead, or deployment costs. Instead, at thisempirical stage, our assessment consisted of qualitymeasures (e.g., fault detection, coverage gain, invariantrefinement, data loss) and their tradeoffs with performancesurrogate measures such as the number of probes required,number of probes executed, or the number of transfersrequired.

4 STUDIES’ SETTING AND RESULTS

The following sections address our research questions. Eachsection contains two parts, one explaining the simulationsetting and one presenting the findings.

4.1 RQ1 Study: Field Data Driven Improvement

This first study addresses RQ1, aiming to provide a betterunderstanding of the potential of field data to improve atest suite. This study assesses three different mechanisms toharness the potential of field data for test suite improve-ment as measured by coverage, fault detection, andinvariants detection gains.

4.1.1 Simulation Setting

For this study, we assume a full profiling strategy wasemployed to collect coverage data on each deployedinstance. We also assume that the potential benefits areassessed at the end of the collection process without loss ofgenerality (the same process could be conducted moreoften).

We consider three groups of test case generationprocedures. These groups appeared in Table 1. First, wedefined a procedure that can generate test cases to reach allthe entities executed by the users, including the ones missedby the in-house test suite. We call this hypotheticalprocedure “Potential” because it sets an upper bound onthe performance of test suites generated based on field data.

Second, we consider four automated test case genera-tion procedures that translate each user session into a testcase. The procedures vary in the data they employ torecreate the conditions at the user site. For example, togenerate a test case with the A3 procedure the followingfield data is employed: 1) a sequence of commandsreplaying the ones employed by the user and dummyvalues to complete necessary fields (e.g., email’s destina-tion, subject, content), 2) the user initial Pine configuration,


and 3) an initial setup where the configuration file andmailbox content is matched as closely as possible to theone in the session. Per definition, all the automated testcase generation procedures result in 1,193 test cases (onetest case per user session).

Third, the author of this paper who was not familiar withthe location of the seeded faults enhanced the two bestautomatic test case generation procedures (A3 and A4)through the analysis of specific trace data. As it wouldhappen in a testing environment, we constrained theanalysis to traces from field sessions that provided coveragegains not captured by the automated procedures. The testderivation process consisted of the automatic identificationof operations and functions that made those field tracesdifferent, the selection of existing test cases exercisingfeatures like the ones observed in the field traces, andthe modification of those test cases so that similar tracescould be generated in-house. This enhancement resulted in572 additional test cases for M1 and 315 additional testcases for M2.

Note that the automated generated tests and the onesgenerated with the tester’s assistance are not meant to

exactly reproduce the behavior observed in the field; theseprocedures simply attempt to leverage field data to enhancean existing test suite with cases that approximate fieldbehavior to certain degree.

4.1.2 Results

Coverage. Fig. 1 shows how individual sessions contributecoverage. During the 1,193 collected user sessions, 128 func-tions (9.3 percent of the total) that were not exercised by theoriginal test suite were executed in the field. Intuitively, asmore sessions are gathered we would expect it wouldbecome harder to cover new functions. This is corroboratedby the growth of the cumulative coverage gain line.Initially, field behavior exposes many originally uncoveredfunctions. This benefit, however, diminishes over timesuggesting perhaps that field data can benefit testing themost at the early stages after deployment.

Table 2 presents the gains for the three groups of testgeneration procedures. The first row presents the potentialgain of field data if it is fully leveraged. Rows 2-5 presentthe fully automated translation approach to exploit fielddata, which can generate a test suite that provides at most


TABLE 1Test Case Generation Procedures Based on Field Data

Fig. 1. Coverage gain per session and cumulative.

3.5 percent function coverage gain. The more elaborate testcase generation procedures that require tester’s participa-tion (rows 6 and 7 in Table 2), provide at most 6.9 percentgains in terms of additional functions covered.

Although we did not capture block coverage at thedeployed instances (na cell in Table 2), we used the testcases generated to measure the block coverage gain. Thecoverage gains at the block level exercised from 4.8 percentto 14.8 percent of additional blocks with the utilization offield data, providing additional evidence about the value ofdata from deployed instances.

Still, as the in-house suite becomes more powerful andjust the extremely uncommon execution patterns remaineduncovered, we expect that the realization of potential gainswill become harder because a more accurate reproductionof the user session may be required. For example, a usersession might cover a new entity when special charactersare used in an email address. Reproducing that class ofscenario would require capturing the content of the email,which we refrained from doing for privacy and perfor-mance concerns. This type of situation may limit thepotential gains that can be actually exploited.

Fault Detection. The test suites developed utilizing fielddata also generated gains over the in-house test suite interms of fault detection capabilities. Table 3 reports thefaults seeded in each version of Pine, the faults detectedthrough the in-house test suite, and the additional faultsfound through the test cases generated based on field data.The test suite generated through the automatic proceduresdiscovered up to 7 percent of additional faults over the in-house test suite, and the procedures requiring the tester’sparticipation discovered up to 26 percent of additionalfaults. Field data raised the overall testing effectiveness onfuture versions (regression testing), as measured by faultdetection, from 65 percent to 86 percent.

Likely Invariants Refinement. Table 4 describes thepotential of field data to drive likely invariants’ refine-ments. We compared the potential invariants resultingfrom the in-house test suite versus the potential invar-iants resulting from the union of the in-house test suiteand the test suites generated with the procedures definedin Table 1. The comparison is performed in terms of thenumber of potential invariants and the number ofprogram points removed and added with Daikon’sdefault confidence settings.

To determine whether the refinements were due to justadditional coverage, we also discriminated between invar-iants and program points added in the functions covered in-house, and the ones in newly executed functions. Potentialinvariants can also be generalized to always hold true forevery possible exit point of a method in a program insteadof only to hold true for a specific exit point in a method. Wedenote those as generalized PP and specific PP respectively,and use them to further discriminate the refinementobtained with field data.

Table 4 reveals that all the test case generationprocedures based on field data help to refine the existingpotential invariants in the programs. The procedures moreeffective in terms of coverage and fault detection were alsothe most effective in providing likely invariants’ refine-ments. For example, with the assistance of M2we were ableto discard 11 percent of the 56,12 potential invariantsdetected with the in-house test suite, add 66 percent newpotential invariants (6 percent coming from newly coveredfunctions), and add 21 percent generalized program points.Even with the simplest automated procedure A1, 4 percentof the potential invariants were removed, and 18 percentwere added. Most differences between consecutive techni-ques (e.g., A1 versus A2) were less than 12 percent, with theexception of the “added invariants” criteria for whichincreases were quadratic in some cases. For example, whenutilizing A4 instead of A3, there is an increase of 14 percentin the number of potential invariants added in the coveredfunctions. This indicates that a closer representation of theuser setting is likely not only to result in a refined set ofpotential invariants, but also to drastically increase thecharacterization breath of the system behavior.

Impact of initial test suite coverage. Intuitively, initialtest suite coverage would affect the potential effectivenessof the field data to facilitate any improvement. Forexample, a weaker initial test suite is likely to benefitsooner from field data. To confirm that intuition, we carvedseveral test suites from the initial suite with varying levelsof coverage. Given the initial test suite T0 of size n0, we


TABLE 2Coverage Data

TABLE 3Fault Detection Effectiveness

randomly chose n1 test cases from T0 to generate T1, where

n1 ¼ n0 � potencyRatio and potencyRatio values range from

0 to 1. We arbitrarily picked a fixed potencyRatio of 0.96

(equivalent to decreasing Ti by 10 test cases to create Tiþ1),

repeating this process using test suite Ti to generate Tiþ1,

where Ti � Tiþ1. This process resulted in 29 test suites with

decreasing coverage levels.We then consider the impact of test suite coverage on

potential gains. Fig. 2 shows how T0 initial coverage is

60 percent, potential gain is 9.3 percent, and overall

coverage is 69 percent. On the other hand, T28 provides an

initial coverage of 28 percent, potential gains of 35 percent,

resulting in an overall coverage of 63 percent. Suites with

lower initial coverage are likely to benefit more from field

data. Note, however, that the overall coverage (initial plus

gain) achieved by a weaker test suite is consistently inferior

to that of a more powerful test suite. This indicates that,

even if the full potential of field data is exploited to generate


TABLE 4Likely Invariants Refinement

Fig. 2. Initial and gained coverage per test suite.

test cases, more sessions with specific (and perhapsunlikely) coverage patterns may be required to match theresults from a stronger in-house suite.

4.2 RQ2 Study: Targeted and Sampled Profile

This study addresses RQ2, providing insights on thereduction of profiling overhead achieved by instrumentinga subset of the entities or sampling across the space ofevents and deployed instances.


We investigate the performance of two profiling techniques:targeted profiling (tp) and sampling profiling (sp). Theresults from the full profiling technique serve as thebaseline.

For tp, we assume the software was deployed withenough instrumentation to profile all functions that werenot executed during the testing process (considering ourmost powerful test suite T0). Since we stored the coverageinformation for each session, we can simulate the conditionsin which only a subset of those elements are instrumented.We also assume the same instrumented version is deployedat all sites during the duration of the study.

For sp, we have to consider two parameters: the numberof strata defined for the event population and the number ofversions to deploy. For our study, we defined as manystrata as program components (each of the 27 filesconstitutes a component). Regarding the versions for P ,we generate multiple versions by incorporating differentsets of instrumentation probes in each version. If we assumeeach user gets a different instrumented version of theprogram, then the maximum number of versions is 30 (oneversion per user). The minimum number of versions toapply the sampling across users strategy is two. We alsosimulated having 5, 10, 15, 20, and 25 deployed versions toobserve the trends among those extremes. We denote thesesp variations as sp2, sp5, sp10, sp15, sp20, sp25, and sp30. Toget a better understanding of the variation due to sampling,we performed this process on each sp variation five times,

and then performed the simulation. For example, for sp2,we generated five pairs of two versions, where each versionin a pair had half of the functions instrumented, and theselection of functions was performed by randomly sam-pling functions across the identified components.

The simulation results for these techniques were eval-uated through two measures. The number of instrumenta-tion probes required and the ones executed in the fieldconstitute our efficiency measures. The potential coveragegains generated by each technique constitute our effective-ness measure.

4.2.2 Results

Efficiency: number of probes required. The scatterplot inFig. 3 depicts the number of instrumentation probes and thepotential coverage gain for full, tp, and a family of sp

techniques. There is one observation for the full and tp

techniques, and the five observations for the sp (corre-sponding to the five samples that were simulated) havebeen averaged.

The full technique needs almost 1,400 probes to provide9.3 percent gain. By utilizing the tp technique and given thein-house test suite we had developed, we were able toreduce the number of probes to 32 percent or 475 probesfrom the full technique without loss in potential gain,confirming previous conjectures about its potential. Thesp techniques provided efficiency gains with a wide rangeof effectiveness. As more instrumented versions aredeployed, the profiling efficiency as measured by thenumber of instrumentation probes increases. With the mostaggressive implementation of the technique, sp30, thenumber of probes per deployed instance is reduced to 3percent of the full technique. By the same token, a moreextensive distribution of the instrumentation probes acrossa population of deployed sites diminished the potentialcoverage gains to approximately a third (when each usergets a uniquely instrumented version the coverage gains isbelow 3 percent) compared with the full technique.


Fig. 3. Coverage gains and probes for full, tp, sp, and hyb.

We then proceeded to combine tp and sp to create ahybrid technique, hyb, that performs sampling only onfunctions that were not covered by the initial test suite.Applying tp on sp2, sp5, sp10, sp15, sp20, sp25, and sp30yielded hyb2, hyb5, hyb10, hyb15, hyb20, hyb25, and hyb30,respectively. Hybrid techniques (depicted in Fig. 3 astriangles) were as effective as sp techniques in terms ofcoverage gains but even more efficient. For example, hyb2cut the number of probes by approximately 70 percentcompared to sp2. However, when a larger number ofinstrumented versions are deployed, the differences be-tween the hyb and the sp techniques become less obvious asthe chances of obtaining gains in an individual deployedsite diminishes.

Efficiency: number of probes executed in the field. Thenumber of inserted probes is a reasonable surrogate forassessing the transparency of the profiling activities that isindependent of the particular field activity. The reduction inthe number of inserted probes, however, does not necessa-rily imply a proportional reduction in the number of probesexecuted in the field. So, to complement the assessmentfrom Fig. 3, we computed the reduction in the number ofexecuted probes by analyzing the collected trace files.

Fig. 4 presents the boxplots for the tp and sp profilingtechniques. Tp shows an average overhead reduction ofapproximately 96 percent. The sp techniques’ efficiencyimproves as the number of deployed versions increases (thenumber of probes for each user decreases), with an averagereduction ranging from 52 percent for sp2 to 98 percentreduction in sp30. It is also interesting to note the largerefficiency variation of sp techniques when utilizing fewerversions; having fewer versions implies more probes perdeployed version, which increases the chances of observingvariations across users with different operational profiles.

Sampling across users and events. Fig. 3 shows aconsiderable reduction in the number of probes through thefamily of stratified sampling techniques. As elaborated inSection 2.3, this strategy adds a sampling dimension bydistributing multiple versions with different sampledevents (different instrumentation probes) across the user

pool. We now investigate whether the reduction in thenumber of probes observed was due merely to the samplingacross the pool of events, or whether sampling across thepool of users had a meaningful effect in the observedreduction.

Since we are interested in observing trends, we decidedto focus on sp2 because it has the larger number of probesof all sp techniques. Under sp2, two versions weregenerated, each one with half of the total probes requiredby the full technique. We relabeled this technique assp2ð50%Þ. We also created five variations of this techniquewhere each of the two versions get a sample of the probesin sp2ð50%Þ: 80 percent results in sp2ð41%Þ, two-thirdsresults in sp2ð34%Þ, a half results in sp2ð25%Þ, a thirdresults in sp2ð16%Þ, and a fifth results in sp2ð10%Þ. Then,we generated the analogous techniques that sample justacross events within the same version, resulting in:sp1ð50%Þ, sp1ð41%Þ, sp1ð34%Þ, sp1ð25%Þ, sp1ð16%Þ, andsp1ð10%Þ. We performed this process for sp1 and sp2 fivetimes, and computed the average additional coverage gainand number of probes inserted.

The results are presented in Fig. 5. The gain decreases forboth sets of techniques as the number of probes decreases.The figure also shows that adding the sampling dimensionacross users accounted for almost half of the gain at the50 percent event sampling level. Sp2 was consistently moreeffective than sp1, even though the gap of effectivenessdiminished as the number of probes was reduced. Sp2provides a broader (more events are profiled over alldeployed instances) but likely shallower (less observationsper event) understanding of the event population than sp1.Still, as the number of deployed instances grows, sp2 islikely to compensate by gaining multiple observation pointsper event across all instances.

Sampling and data loss. To get a better understanding ofhow much information is lost when we sample acrossdeployed instances, we identified the top 5 percent (68)most executed functions under the full technique andcompared them against the same list generated whenvarious sampling schemes are used. We executed each


Fig. 4. Executed probes for tp, sp, and hyb.

sp technique five times to avoid potential sampling bias.The results are summarized in Table 5.

Utilizing sp2 reduces the number of probes in each

deployed instance by half, but it can still identify 65 of the

top 68 executed functions. When the sampling became more

aggressive, data loss increased. The magnitude of the losses

based on the sampling aggressiveness is clearly exemplified

when sp30 is used and only 61 percent of the top 5 percent

executed functions are correctly identified. Interestingly, the

number of probe reductions occurred at a higher rate than

the data loss. For sp30, the number of probes was reduced

to 3 percent when compared to full profiling. It was also

noticeable that more aggressive sampling did not necessa-

rily translate into higher standard deviations due to the

particular sample. In other words, the results are quite

independent from the functions that end up being sampled

on each deployed version.Overall, even the most extreme sampling across de-

ployed instances did better than the in-house test suite inidentifying the top executing functions. The in-house testsuite, however, was not designed with this goal in mind.Still, it is interesting to realize that sp30, with just 45 probes

per deployed version in a program with 1,373 functions, canidentify 61 percent of the most executed functions.

4.3 RQ3 Study: Anomaly-Based Triggers forTransfers

This study addresses RQ3, assessing the effect of detectinganomalies in program behavior to trigger data transfers.The objective is to trigger a field data transfer only when thereleased software is operating in ways we did notanticipate. We empirically evaluate various techniquesand assess their impact in the coverage and fault detectioncapabilities of test suites generated based on field data.


For this study, we concentrate on three families of anomalybased detection techniques: departures from operationalprofile ranges, likely invariant violations, and unknown orfailure-prone scenarios as defined by Markov models. Inour analysis, we also include the full profiling technique,which triggers transfers irrespective of anomalies, provid-ing a point of reference for comparison.

For each technique, we use the in-house test suite and arandom group of users that simulate beta-testers to derivethe nominal behavior against which field behavior isevaluated. We explore three levels of characterization forthe baseline utilizing 3 (10 percent), 6 (20 percent), and 15(50 percent) of users.

When an anomaly is detected, the captured fieldbehavior is sent to the server for the software engineers todetermine how this anomaly should be treated (e.g.,discard, reassess baseline, reproduce scenario). For eachtechnique we evaluate two variations: one that keepsChouseTolerance constant, and one that uses the informationfrom the field to continually refine ChouseTolerance (simulatingthe scenario in which engineers classify a behavior asnormal and refine the baseline).

We now described the simulation settings for eachtechnique individually.


Fig. 5. Sampling dimensions: gains for sp1 versus sp2.

TABLE 5Sp Data Loss versus Full

Anomaly Detection through Operational Profile

Ranges. An operational profile consists of a set ofoperations and their associated probabilities of occurrences[24]. For this anomaly detection technique, the set of eventsto profile C ¼ O, where O is the set of n operations o, andChouseTolerance can be defined by various ranges of acceptableexecution probability for the operations. To develop theoperational profile baseline for Pine, we followed themethodology describe by Musa [24]: 1) identify operationsinitiators; in our case, these were the regular users andsystem controller; 2) create a list of operations by traversingthe menus just as regular users would initiate actions; and3) perform several pruning iterations, trimming operationsthat are not fit (e.g., operation that cannot exist in isolation).The final list contained 34 operations.

Although there are many ways to define ChouseTolerance

based on operational profiles, in this paper we just reporton the technique that has given us the best cost-benefit sofar: Minmax.6 Minmax utilizes a predicate based on therange ½minimum;maximum�o, which contains the mini-mum and maximum execution probability per operationobserved through the in-house testing process and the beta-testers usage. The technique assumes a vector with theseranges is embedded within each deployed instance.7 Everytime an operation has a field execution probability fieldo,such that fieldo 62 ½minimum;maximum�o, an anomaly istriggered.

Given the variable users’ sessions duration, we decidedto control this source of variation so that the number ofoperations in each session did not impact fieldo (e.g.,sessions containing just one operation would assign100 percent execution probability to that operation andmost likely generate an outlier). The simulation considerspartitions of 10 operations (mean number of operationsacross all sessions) for each user so that a session with lessthan 10 operations is clustered with the following sessionfrom the same deployed site until 10 operations aregathered, while partitioning longer sessions to obtainpartitions of the same size.

Anomaly Detection Through Invariant Violation. Forthis anomaly detection technique, C ¼ Inv, where Inv is theset of potential program invariants inferred dynamicallyfrom the in-house test suite and the beta-testers, andChouseTolerance ¼ 0, that is, as soon as a potential invariant isviolated a transfer is triggered.

Section 3.2 described how we inferred program likelyinvariants from the in-house test suite. To detect profileinvariants in the field, such invariants could be incorpo-rated into Pine in the form of assertions. When an assertionis violated in the field, then a transfer is triggered. Contraryto operational profiles, we did not have assertions insertedduring the data collection process so we had to simulatethat scenario. We employed the trace files generated fromrunning the M2 test suite (the most powerful generationtechnique defined in Section 4.1) to approximate the tracefiles we would have observed in the field. Then, weanalyzed the invariant information generated from the

traces to determine what original potential invariants wereviolated in the field.

Anomaly Detection through Markov Models. For thisanomaly detection technique, we used a classificationscheme based on Markov models to characterize the fieldbehavior into three groups: pass, fail, and unknown. Whena field behavior is classified as either fail or unknown wetrigger a transfer. To create such a classifier, we followedthe process utilized by Bowring et al. [19], which combinesMarkov models to characterize program runs with cluster-ing algorithms to group those runs. The clustering algo-rithms we use were inspired by the work of Dickinson et al.[5], which investigated several grouping procedures to filterprogram runs worth analyzing to support observationaltesting [20]. (We use clustering as a step in the classificationof runs based on their most likely outcome.)

In this study, we targeted two sets of events, C ¼FunctionCall and C ¼ OperationTransition, leading to theMarkovF and MarkovOp classifiers. The process to create aclassifier began with the creation of a Markov model foreach event trace generated in-house and each tracegenerated by beta-testers. A Markov model’s probalibilitiesrepresented the likelihood of transitioning between thetrace events. Each model also had an associated labelindicating whether the trace had resulted in a pass or fail.We then used a standard agglomerative hierarchicalclustering procedure to incrementally join these modelsbased on their similarities. To assess the models’ simila-rities, we transformed the models’ probabilities to 0s and 1s(nonzero probabilities were considered 1) and computedthe Hamming distance across all models. We experimentedwith various clustering stopping criteria and settled for25 percent (two models cannot be joined if they differ inmore than a quarter of their events). As a result of thisprocess, the classifier of pass behavior for functions had67 models, and the one for operations had 53 models.Meanwhile, the classifier of fail behavior for functions had17 models and five models for operations.

We assume the classifiers can be consulted by theprofiling process before deciding whether session shouldbe transferred or not. Then, a field execution is associatedwith the model under which it has the greater probability tooccur. If the Markov model represents a fail type behavior,then the execution trace is classified as exhibiting a failbehavior. If the Markov model represents a pass typebehavior, then the execution trace is classified as exhibitinga pass behavior. However, if the field trace does notgenerate a probability above 0.1 for any model, then it isconsidered unknown.

4.3.2 Results

Table 6 summarizes the findings of this study. The firstcolumn identifies the triggering technique we employed.Column two contains the number of beta testers (and thepercentage over all users) considered to define the baselinebehavior considered normal. Column three reports thenumber of transfers required under each combination oftechnique and number of beta testers. Columns four andfive correspond to the number of functions covered and thenew faults found when we employed the test suitegenerated with field data.


6. For other techniques based on operational profiles see [6].7. Under Musa’s approach only one operational profile is used to

characterize the system behavior independent of the variation among users.

Independently of the anomaly detection technique, weobserve some common trends. First, all anomaly detection

techniques reduce, to different levels, the number oftransfers required by the full technique. Since an average

of 250kb of compressed data was collected in each session,reducing the number of transfers had a considerable impact

in the required packaging process and bandwidth. Thesetechnique also sacrifice, to different extent, the potentialgains in coverage and fault detection. Second, having a

better initial characterization of the system behaviorthrough additional beta-testers leads to a further reduction

in the number of transfers. For example, with a character-ization including just three beta testers, minmax triggers28 percent of the transfers required by full; when 15 beta

testers are utilized, that percentage is reduced to 2 percent.Third, utilizing feedback reduces the number of transfers.

For example, for the Inv technique and a characterization ofsix beta testers, adding feedback resulted in 35 percent lesstransfers. The cost of such reduction in terms of coverage

and fault detection varies across techniques.Minmax is the technique that offers the greatest

reduction in the number of transfers, requiring a minimum

of 1 percent of the transfers utilized by full (with 15 betatesters and feedback). The coverage obtained even with

such small number of transfers is within 20 percent of full.However, a test suite utilizing the data from those transfersonly detects two of nine faults. The utilization of feedback

increased ChouseTolerance as new minimum and maximumexecution probabilities for operations were observed in the

field. Increasing ChouseTolerance reduced the number oftransfers, but its impact was less noticeable as more betatesters were added. Note also that the efficiency of

minmaxf will wear down as ChouseTolerance increases.

Inv techniques are more conservative than minmax in

terms of transfer reduction, requiring 37 percent of the

transfers employed by full in the best case. However, a test

suite driven by the transferred field data would result in

equal coverage as full8 with over 50 percent of its fault

detection capability. Another difference with minmax is

that Inv can leverage feedback to reduce transfers even with

15 beta testers (from 54 percent to 37 percent of the

transfers). It is important to note that the overhead of this

technique is proportional to the number of likely invariants

incorporated into the program in the form of assertions;

adjustments to the type of invariants considered is likely to

affect the transfer rate.Markov techniques seem to offer better reduction

capabilities that the Inv techniques, and still detect a

greater number of faults. (Due to the lack of differences, we

only present the results for techniques utilizing feedback.)

For example, under the best potential characterization,

MarkovF needs 64 percent of the transfers required by full

sacrificing only one fault. It is also interesting to note that

while technique MarkovF performs worse than MarkovOp

in reducing the number of transfer, MarkovF is generally

more effective. Furthermore, per definition, MarkovF is as

good as full in function coverage gains. We did observe,

however, thatMarkovF efficiency is susceptible to functions

with very high execution probabilities that skew the

models. Higher observable abstractions like the operations

in MarkovO have more evenly distributed executions makes

the probability scores more stable.


TABLE 6The Effects of Anomaly-Based Triggers

8. We simulated the invariant anomaly detection in the field utilizingM2, which had a maximum functional coverage of 838. Still, per definition,this technique would result in a transfer when a new function in executedbecause that would result in a new PP.

Overall, minmax offers extreme reduction capabilitieswhich would fit settings where data transfers are too costly,only a quick exploration of the field behavior is possible, orthe number of deployed instances is large enough that thereis a need to filter the amount of data to be later analyzed in-house. Inv offers a more detailed characterization of theprogram behavior, which results in additional reports ofdeparture from nominal behavior, but also increased faultdetection. Markov-based techniques provide a better com-bination of reduction and fault detection gains, but their on-the-fly computational complexity may not make themviable in all settings.

5 CONCLUSIONS AND FUTURE WORK

We have presented a family of empirical studies toinvestigate several implementations of the profilingstrategies and their impact on testing activities. Ourfindings confirm the large potential of field data obtainedthrough these profiling strategies to drive test suite improve-ments. However, our studies also show that the gainsdepend on many factors including the quality of the in-house validation activities, the process to transformpotential gains into actual gains, and the particularapplication of the profiling techniques. Furthermore, thepotential gains tend to stabilize over time pointing to theneed for adaptable techniques that readjust the type andnumber of profiled events.

Our results also indicate that targeted and samplingprofiling techniques can reduce the number of instrumentationprobes by up to one order of magnitude, which greatly increasesthe viability of profiling released software. When coverage isused as the selection criteria, tp provides a safe approach (itdoes not lose potential coverage gains) to reduce theoverhead. However, practitioners cannot expect to noticean important overhead reduction with tp in the presence ofa poor in-house test suite, and it may not be realistic toassume that practitioners know exactly what to target whenother criteria are used. On the other hand, the samplingprofiling techniques were efficient independently of the in-house test suite attributes, but at the cost of someeffectiveness. Although practitioners can expect to over-come some of the sp effectiveness loss by profilingadditional sites, the representativeness achieved by thesampling process is crucial to guide the insertion of probesacross events and sites. It was also noticeable that thesimple sequential aggregation of tp and sp techniques tocreate hyb provided further overhead reduction.

We also found that anomaly based triggers can beeffective at reducing the number of data transfers, apotential trouble spot as the amount of data collected orthe number of profiled sites increases. It is reasonable forpractitioners utilizing some of the simple techniques weimplemented to reduce the number of transfers by an order ofmagnitude and still retain 92 percent of the coverage achieved byfull profiling. More complex techniques that attempt topredict the pass-fail behavior of software can obtain half ofthat transfer reduction and still provide 85 percent of thefault detection capability of full. Still, further efforts arenecessary to characterize the appropriateness of thetechniques under different scenarios and to study the

adaptation of more accurate, even if expensive, anomaly

detection techniques.Profiling released software techniques still have to tackle

many difficult problems. First, one of the most significant

challenges we found through our studies was to define how

to evaluate the applicability, cost, and benefit of profiling

techniques for released software. Our empirical methodol-

ogy provides a viable model combining observation with

simulation to investigate these techniques. Nevertheless, we

still need to explore how to evaluate these strategies at a

larger scale and include additional assessment measures

such as deployment costs and bandwidth requirements.

Second, we have yet to exploit the full potential of field

data. For example, our studies quantified simple mechan-

isms to transform field data into test cases. Given the

appropriate data, these mechanisms could be refined to

better approximate the behavior at deployed sites. Further-

more, in the presence of thousands of deployed instances,

this process must happen continually, which introduces

new challenges to filter, interpret, and aggregate a poten-

tially overwhelming amount of data. Last, profiling

deployed software raises issues not only about acceptable

overhead but also about user’s confidentiality and privacy.

Future techniques and studies must take these factors into

consideration.

ACKNOWLEDGMENTS

This work was supported in part by a CAREER Award

0347518 to University of Nebraska, Lincoln. The authors are

especially grateful to the users who volunteered for the

study. They would like to thank the participants of the First

Workshop on Remote Analysis and Measurement of Soft-

ware Systems, Gregg Rothermel, and Michael Ernst for

providing valuable feedback on earlier versions of this

paper, and to the ISSTA 2004 reviewers for their sugges-

tions to improve the paper. They are also thankful to the

Daikon group at MIT.

REFERENCES

[1] M. Arnold and B. Ryder, “A Framework for Reducing the Cost ofInstrumented Code,” Proc. Conf. Programming Language Design andImplementation, pp. 168-179, 2001.

[2] T. Ball and J. Larus, “Optimally Profiling and Tracing Programs,”ACM Trans. Programming Languages and Systems, vol. 16, no. 4,pp. 1319-1360, 1994.

[3] T. Ball and J. Laurus, “Optimally Profiling and Tracing Pro-grams,” Proc. Ann. Symp. Principles of Programming, pp. 59-70,Aug. 1992.

[4] B. Calder, P. Feller, and A. Eustace, “Value Profiling,” Proc. Int’lSymp. Microarchitecture, pp. 259-269, Dec. 1997.

[5] W. Dickinson, D. Leon, and A. Podgurski, “Finding Failures byCluster Analysis of Execution Profiles,” Proc. Int’l Conf. SoftwareEng., pp. 339-348, May 2001.

[6] S. Elbaum and M. Hardojo, “An Empirical Study of ProfilingStrategies for Released Software and Their Impact on TestingActivities,” Proc. Int’l Symp. Software Testing and Analysis, pp. 65-75, July 2004.

[7] S. Elbaum, S. Kanduri, and A. Andrews, “Anomalies asPrecursors of Field Failures,” Proc. Int’l Symp. Software ReliabilityEng., pp. 108-118, 2003.

[8] S. Elbaum, A.G. Malishevsky, and G. Rothermel, “Test CasePrioritization: A Family of Empirical Studies,” IEEE Trans.Software Eng., vol. 28, no. 2, pp. 159-182, Feb. 2002.


[9] M. Ernst, J. Cockrell, W. Griswold, and D. Notkin, “DynamicallyDiscovering Likely Program Invariants to Support ProgramEvolution,” IEEE Trans. Software Eng., vol. 27, no. 2, pp. 99-123,Feb. 2001.

[10] A. Glenn, T. Ball, and J. Larus, “Exploiting Hardware PerformanceCounters with Flow and Context Sensitive Profiling,” ACMSIGPLAN Notices, vol. 32, no. 5, pp. 85-96, 1997.

[11] S. Graham and M. McKusick, “Gprof: A Call Graph ExecutionProfiler,” ACM SIGPLAN, vol. 17, no. 6, pp. 120-126, June 1982.

[12] K. Gross, S. McMaster, A. Porter, A. Urmanov, and L. Votta,“Proactive System Maintenance Using Software Telemetry,” Proc.Workshop Remote Analysis and Monitoring Software Systems, pp. 24-26, 2003.

[13] MIT Program Analysis Group, The Daikon Invariant Detector UserManual. http://pag.csail.mit.edu/daikon/download/doc/daikon.html, Year?

[14] M. Harrold, R. Lipton, and A. Orso, “Gamma: ContinuousEvolution of Software After Deployment,” cc.gatech.edu/aristotle/Research/Projects/gamma.html, Year?

[15] D. Hilbert and D. Redmiles, “An Approach to Large-ScaleCollection of Application Usage Data over the Internet,” Proc.Int’l Conf. Software Eng., pp. 136-145, 1998.

[16] D. Hilbert and D. Redmiles, “Separating the Wheat from the Chaffin Internet-Mediated User Feedback,” 1998.

[17] InCert, “Rapid Failure Recovery to Eliminate Application Down-time,” www.incert.com, June 2001.

[18] J. Bowring, A. Orso, and M. Harrold, “Monitoring DeployedSoftware Using Software Tomography,” Proc. Workshop ProgramAnalysis for Software Tools and Eng., pp. 2-9, 2002.

[19] J. Bowring, J. Rehg, and M.J. Harrold, “Active Learning forAutomatic Classification of Software Behavior,” Proc. Int’l Symp.Software Testing and Analysis, pp. 195-205, 2004.

[20] D. Leon, A. Podgurski, and L. White, “Multivariate Visualizationin Observation-Based Testing,” Proc. Int’l Conf. Software Eng.,pp. 116-125, May 2000.

[21] D. Libes, Exploring Expect: A Tcl-Based Toolkit for AutomatingInteractive Programs. Sebastopol, Calif.: O’Reilly and Associates,Inc., Nov. 1996.

[22] B. Liblit, A. Aiken, Z. Zheng, and M. Jordan, “Bug Isolation viaRemote Program Sampling,” Proc. Conf. Programming LanguageDesign and Implementation, pp. 141-154, June 2003.

[23] A. Memon, A. Porter, C. Yilmaz, A. Nagarajan, D.C. Schmidt, andB. Natarajan, “Skoll: Distributed Continuous Quality Assurance,”Proc. Int’l Conf. Software Eng., pp. 449-458, May 2004.

[24] J. Musa, Software Reliability Engineering. McGraw-Hill, 1999.[25] Netscape, “Netscape Quality Feedback System,”home.netscape.

com/communicator/navigator/v4.5/qfs1.html, year?[26] Nielsen, “Nielsen Net Ratings: Nearly 40 Million Internet Users

Connect Via Broadband,” www.nielsen-netratings.com, 2003.[27] University of Washington, Pine Information Center, http://

www.washington.edu/pine/, Year?[28] A. Orso, T. Apiwattanapong, and M.J. Harrold, “Leveraging Field

Data for Impact Analysis and Regression Testing,” Foundations ofSoftware Eng., pp. 128-137, Sept. 2003.

[29] A. Orso, D. Liang, M. Harrold, and R. Lipton, “Gamma System:Continuous Evolution of Software after Deployment,” Proc. Int’lSymp. Software Testing and Analysis, pp. 65-69, 2002.

[30] C. Pavlopoulou and M. Young, “Residual Test CoverageMonitoring,” Proc. Int’l Conf. Software Eng., pp. 277-284, May 1999.

[31] S. Reiss and M. Renieris, “Encoding Program Executions,” Proc.Int’l Conf. Software Eng., pp. 221-230, May 2001.

[32] D. Richardson, L. Clarke, L. Osterweil, and M. Young, “PerpetualTesting Project,” http://www.ics.uci.edu/djr/edcs/PerpTest.html, Year?

[33] A. van der Hoek, R. Hall, D. Heimbigner, and A. Wolf, “SoftwareRelease Management,” Proc. European Software Eng. Conf.,M. Jazayeri and H. Schauer, eds., pp. 159-175, 1997.

[34] C. Yilmaz, M.B. Cohen, and A. Porter, “Covering Arrays forEfficient Fault Characterization in Complex ConfigurationSpaces,” Proc. Int’l Symp. Software Testing and Analysis, pp. 45-54,July 2004.

[35] C. Yilmaz, A. Porter, and A. Schmidt, “Distributed ContinuousQuality Assurance: The Skoll Project,” Proc. Workshop RemoteAnalysis and Monitoring Software Systems, pp. 16-19, 2003.

Sebastian Elbaum received the PhD and MSdegrees in computer science from the Universityof Idaho and a degree in systems engineeringfrom the Universidad Catolica de Cordoba,Argentina. He is an assistant professor in theDepartment of Computer Science and Engineer-ing at the University of Nebraska at Lincoln. Heis a member of the Laboratory for Empiricallybased Software Quality Research and Develop-ment (ESQuaReD). He is a coeditor for the

International Software Technology Journal, and a member of theeditorial board of the Software Quality Journal. He has served on theprogram committees for the International Conference on SoftwareEngineering, International Symposium on Testing and Analysis, Inter-national Symposium on Software Reliability Engineering, InternationalConference on Software Maintenance, and the International Symposiumon Empirical Software Engineering. His research interests includesoftware testing, analysis, and maintenance with an empirical focus. Heis a member of the IEEE and ACM.

Madeline Diep received the BS degree in 2001and the MS degree in 2004 in computer sciencefrom the University of Nebraska at Lincoln(UNL). She is currently a PhD student incomputer science at UNL, where she is amember of the Laboratory for Empirically basedSoftware Quality Research and Development(ESQuaReD). Her research interests includesoftware testing and verification, with emphasison profiling released software, software analy-

sis, and empirical studies.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING ...cse.unl.edu/~elbaum/papers/journals/TSE-deployed05.pdfOur analysis of 1,200 user sessions on a 155 KLOC deployed system substantiates the