26
Experiments with test case prioritization using relevant slices Dennis Jeffrey, Neelam Gupta * Department of Computer Science, The University of Arizona, Tucson, AZ 85721, USA Available online 25 May 2007 Abstract Software testing and retesting occurs continuously during the software development lifecycle to detect errors as early as possible and to gain confidence that changes to software do not introduce defects. Once developed, test suites are reused and updated frequently, and their sizes grow as software evolves. Due to time and resource constraints, an important goal during regression testing of software is to prioritize the execution of test cases in a suite so as to improve the chances of increasing the rate of fault detection. Prior techniques for test case prioritization are based on the total number of coverage requirements exercised by the test cases. In this paper, we present a new approach to prioritize test cases that takes into account the coverage requirements present in the relevant slices of the outputs of test cases. We have implemented three different heuristics based on our relevant slicing based approach to prioritize test cases and conducted experiments to compare the effectiveness of our techniques with those of the traditional techniques that only account for the total require- ment coverage. Our detailed experimental study and results provide interesting insights into the effectiveness of using relevant slices for test case prioritization in terms of ability to achieve high rate of fault detection. Published by Elsevier Inc. Keywords: Test case prioritization; Relevant slices based model; Test suite management; Experimental studies 1. Introduction Software testing is an important and expensive stage of software development. As software changes over time, test suites are developed and used to test the modified software to make sure that changes do not affect the existing func- tionality in unintended ways, and to test for new function- ality. This process is called regression testing. Due to time and resource constraints, it may not be possible to contin- ually execute all the tests in suites on every testing iteration. It is therefore important to prioritize (order) the execution of test cases in test suites so as to improve the chances of detecting more faults earlier in the testing process. Tech- niques for test case prioritization address this problem. In this paper, we present a new approach for prioritizing the execution of existing test cases with the goal of early detec- tion of faults in the regression testing process. Several techniques (Elbaum et al., 2002; Rothermel et al., 2001; Srivastava and Thiagarajan, 2002) have been developed to prioritize the execution of existing test cases to expose faults early during the regression testing process. The technique developed in Srivastava and Thiagarajan (2002) takes into consideration the impacted blocks, i.e., the set of blocks that have been changed between the old and the new version of software. A heuristic is used to pre- dict the impacted blocks that will be covered by each test case. The test cases are then prioritized in order of decreas- ing number of impacted blocks covered by the test case. Other test case prioritization techniques studied in Elbaum et al. (2002), Rothermel et al. (2001) are primarily based on variations of total requirement coverage and additional requirement coverage. For instance, total statement cover- age prioritization orders test cases in decreasing order of the number of statements they exercise. Additional state- ment coverage prioritization orders test cases in decreasing order of the number of additional statements they exercise, that have not yet been covered by the tests earlier in the prioritized sequence. The experimental studies reported in 0164-1212/$ - see front matter Published by Elsevier Inc. doi:10.1016/j.jss.2007.05.006 * Corresponding author. E-mail addresses: jeff[email protected] (D. Jeffrey), ngupta@cs. arizona.edu (N. Gupta). www.elsevier.com/locate/jss Available online at www.sciencedirect.com The Journal of Systems and Software 81 (2008) 196–221

Experiments with test case prioritization using relevant slices

Embed Size (px)

Citation preview

Available online at www.sciencedirect.com

www.elsevier.com/locate/jss

The Journal of Systems and Software 81 (2008) 196–221

Experiments with test case prioritization using relevant slices

Dennis Jeffrey, Neelam Gupta *

Department of Computer Science, The University of Arizona, Tucson, AZ 85721, USA

Available online 25 May 2007

Abstract

Software testing and retesting occurs continuously during the software development lifecycle to detect errors as early as possible andto gain confidence that changes to software do not introduce defects. Once developed, test suites are reused and updated frequently, andtheir sizes grow as software evolves. Due to time and resource constraints, an important goal during regression testing of software is toprioritize the execution of test cases in a suite so as to improve the chances of increasing the rate of fault detection. Prior techniques fortest case prioritization are based on the total number of coverage requirements exercised by the test cases. In this paper, we present a newapproach to prioritize test cases that takes into account the coverage requirements present in the relevant slices of the outputs of testcases. We have implemented three different heuristics based on our relevant slicing based approach to prioritize test cases and conductedexperiments to compare the effectiveness of our techniques with those of the traditional techniques that only account for the total require-ment coverage. Our detailed experimental study and results provide interesting insights into the effectiveness of using relevant slices fortest case prioritization in terms of ability to achieve high rate of fault detection.Published by Elsevier Inc.

Keywords: Test case prioritization; Relevant slices based model; Test suite management; Experimental studies

1. Introduction

Software testing is an important and expensive stage ofsoftware development. As software changes over time, testsuites are developed and used to test the modified softwareto make sure that changes do not affect the existing func-tionality in unintended ways, and to test for new function-ality. This process is called regression testing. Due to timeand resource constraints, it may not be possible to contin-ually execute all the tests in suites on every testing iteration.It is therefore important to prioritize (order) the executionof test cases in test suites so as to improve the chances ofdetecting more faults earlier in the testing process. Tech-niques for test case prioritization address this problem. Inthis paper, we present a new approach for prioritizing theexecution of existing test cases with the goal of early detec-tion of faults in the regression testing process.

0164-1212/$ - see front matter Published by Elsevier Inc.

doi:10.1016/j.jss.2007.05.006

* Corresponding author.E-mail addresses: [email protected] (D. Jeffrey), ngupta@cs.

arizona.edu (N. Gupta).

Several techniques (Elbaum et al., 2002; Rothermelet al., 2001; Srivastava and Thiagarajan, 2002) have beendeveloped to prioritize the execution of existing test casesto expose faults early during the regression testing process.The technique developed in Srivastava and Thiagarajan(2002) takes into consideration the impacted blocks, i.e.,the set of blocks that have been changed between the oldand the new version of software. A heuristic is used to pre-dict the impacted blocks that will be covered by each testcase. The test cases are then prioritized in order of decreas-ing number of impacted blocks covered by the test case.Other test case prioritization techniques studied in Elbaumet al. (2002), Rothermel et al. (2001) are primarily based onvariations of total requirement coverage and additional

requirement coverage. For instance, total statement cover-age prioritization orders test cases in decreasing order ofthe number of statements they exercise. Additional state-ment coverage prioritization orders test cases in decreasingorder of the number of additional statements they exercise,that have not yet been covered by the tests earlier in theprioritized sequence. The experimental studies reported in

1 In order to compute the APFD measure for the execution of aprioritized test suite, we plot the percentage of faults detected so far (y-axis) versus the fraction of test suite executed (x-axis). The area under thecurve interpolating the points in this plot is the APFD measure. More areaunder the curve (a higher APFD value) indicates that more faults weredetected when a smaller fraction of the prioritized suite was executed(more faults were detected earlier on during execution of tests in the suite).

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 197

Elbaum et al. (2002), Rothermel et al. (2001) are based onlyon the information collected during the execution of thetest cases for the previous version of software. Unlike thetechnique in Srivastava and Thiagarajan (2002), these tech-niques do not take the modifications made to the softwareinto account while prioritizing the tests.

Another related topic is that of test case selection (Agra-wal et al., 1993; Harrold et al., 1992; Rothermel and Harr-old, 1997) for regression testing. A regression test selectiontechnique chooses, from an existing regression test suite, asubset of test cases that are considered necessary to validatemodified software. The approach in Harrold et al. (1992)uses static program slicing to detect definition-use associa-tions that are affected by a program change. The work inRothermel and Harrold (1997) constructs control flowgraphs for a procedure or program and its modified version,and uses these graphs to select test cases, from the regres-sion test suite, that execute changed code. In Agrawalet al. (1993), execution slice based, dynamic slice basedand the relevant slice based approaches are proposed todetermine the test cases in the regression test suite on whichthe new and old programs may produce different outputs.

The regression test selection problem addressed in theabove work (Agrawal et al., 1993; Harrold et al., 1992;Rothermel and Harrold, 1997) differs from the work on testcase prioritization (Elbaum et al., 2002; Rothermel et al.,2001; Srivastava and Thiagarajan, 2002) in the followingimportant aspect. The goal in test case prioritization is toorder or prioritize the execution of test cases in the regres-sion test suite so as to execute those test cases early on dur-ing the regression testing, whose output is more likely tochange. Such a prioritization is expected to help in early

detection of faults during regression testing. However, thetechniques in Elbaum et al. (2002), Rothermel et al.(2001) order the test cases based on only their total/addi-tional coverage of requirements such as statements andbranches, but they do not take into consideration the state-ments or branches that actually influenced, or could poten-tially influence, the values of the program output. Neitherdo they take into consideration whether a test case tra-verses a modified statement or not while prioritizing thetest cases.

It is intuitive to expect that the output of a test case thatexecutes a larger number of statements, that actually influ-ence the output or have the potential to influence the out-put, is more likely to get affected by the modification thantests covering fewer such statements. Additionally, testsexercising modified statements should have higher prioritythan tests that do not traverse any modifications. In Jeffreyand Gupta (2006), we presented a new approach for prior-itizing test cases that is based not only on total statement(branch) coverage, but that also takes into account thenumber of statements (branches) executed that influenceor have potential to influence the output produced by thetest case. The set of statements that influence, or havepotential to influence, the output of a program when runon a particular test case correspond to the relevant slice

(Agrawal et al., 1993; Gyimothy et al., 1999; Korel andLaski, 1991) computed on the output of the program whenexecuted by the test case. Our approach is based on the fol-lowing observation. If a modification in the program has toaffect the output of a test case in the regression test suite, itmust affect some computation in the relevant slice of theoutput for that test case. Therefore, in Jeffrey and Gupta(2006), we presented a heuristic for prioritizing test casesthat assigns higher weight to a test case with larger numberof statements (branches) in its relevant slice of the output.In this paper, we propose two new techniques based on theabove approach. We have implemented our techniques andperformed a detailed experimental study using the pro-grams from the Siemens suite Hutchins et al., 1994 to eval-uate the effectiveness of our techniques in early detection offaults during regression testing. Our results show improve-ments over prior approaches based on total statement(branch) coverage and provide interesting insights intothe use of size of relevance slices of outputs in determiningthe relative importance of test cases in detecting a changein the output during regression testing.

The remainder of this paper is organized as follows. Thebackground for our work is explained in Section 2. Ourapproach and its motivation are explained in Section 3.Our detailed experimental study with the three heuristicsbased on using relevant slices for test case prioritization,along with results and analysis, are given in Section 4.Related work is discussed in Section 5, and our conclusionsare given in Section 6.

2. Background

The test case prioritization problem as defined in Roth-ermel et al. (2001) is as follows.

2.1. Problem statement

Given a test suite T, the set PT consisting of all the per-mutations of test cases in T, and a function f from PT tothe set of real numbers, find a T 0 2 PT such that("T 00)(T 00 5 T 0)[f(T 0) P f(T 00)].

The specific focus of this paper and that of Rothermelet al. (2001) is to find an ordering of test cases T 0 of a suiteT with the goal of increasing the likelihood of revealingfaults earlier in the testing process. A function f that quan-tifies the rate of fault detection for a test suite is theweighted ‘‘average percentage of faults detected’’ (APFD)value1 computed during the execution of a test suite, asdefined in Rothermel et al. (2001).

Fig. 1. An example program. Given input (a,b,c) = (1,5,4), the shadedlines indicate the relevant slice for output variable w at line 9; the darker-shaded lines distinguish those statements contained in the relevant slicethat are not also contained in the dynamic slice.

198 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

The problem of test case prioritization as stated above isan optimization problem and has exponential complexityin the worst case. In Elbaum et al. (2002), Rothermelet al. (2001), the presented techniques are actually greedyheuristics in which test cases are ordered based upon theirrequirement coverage (such as the total number of state-ments or branches exercised). In their experimental studiesin Elbaum et al. (2002), the Siemens suite (Hutchins et al.,1994) programs were used to evaluate the effectiveness ofthe above heuristics. Each program in the Siemens suiteis associated with a set of faulty versions (each containinga seeded error) and a pool of test cases. Thus, the effective-ness of the above test case prioritization heuristics in earlyexecution of test cases that affect the output was measuredby computing the rate at which the seeded faults (programmodifications) were exposed by the test cases in the prior-itized sequence. This rate was quantitatively measured bycomputing the APFD values for the prioritized test suites.

We use the notion of a relevant slice (Agrawal et al.,1993; Gyimothy et al., 1999; Korel and Laski, 1991) toidentify the set of statements or branches that either influ-ence, or have potential to influence, the value of the outputproduced by the test case.

2.2. Definition

Given a program P and its execution trace for a test caset, a statement s in the execution trace is in the relevant slice

of the output if (1) at least one of the outputs produced bythe execution of t is directly or indirectly data or controldependent upon s; or (2) s is either a predicate or a data-dependency of a predicate that may affect at least oneoutput produced by the execution of t if the predicate eva-luates to a different outcome.

Essentially, for a statement to be in the relevant slice, itmust have either affected, or have potential to affect (Agra-wal et al., 1993; Gyimothy et al., 1999) the output pro-duced by a test case.

3. Our Approach

Intuitively, if a test case exercises a modified statement,and that modified statement is also included in the relevantslice of the output, then most likely the output of the testcase will be affected due to the modification. This is becauseevery statement in the relevant either affects the output orhas potential to affect the output of the test case. Therecan be modifications such as if a value is incrementedand subsequently decremented in the code before the valueis used again in which the modification may not affect theoutput. To illustrate this intuition, consider the exampleprogram given in Fig. 1, and suppose the input to the pro-gram is a test case T defined as (a,b,c) = (1,5,4).

The execution trace of the program for test T has branchB1 evaluating to false and branch B3 subsequently evalu-ating to true, and therefore the program outputs thevalue 1.

Notice in the figure that the shaded lines (both thelightly-shaded and darker shaded lines) indicate the rele-vant slice of output variable w at line 9. The darker-shadedlines indicate those statements that are in the relevant slice,but not also in the dynamic slice, of variable w at line 9. Thereason branch B1 is not contained in the dynamic slice isbecause the value of w at line 9 is neither directly nor indi-rectly data or control-dependent on B1 for test T. However,B1 is contained in the relevant slice because although it didnot affect the value of the output in this case, it could poten-

tially affect the output if the branch outcome had evaluatedto true instead of false (because the true block of branch B1

contains a potential definition of variable w). Similarly,since B1 is data-dependent upon the definition of x in line3 (but no other statements are data-dependent upon x

defined in line 3), the line 3 is also contained in the relevantslice but not in the dynamic slice (ie., line 3 did not actuallyaffect, but has potential to affect, the value of variable w atline 9).

Now suppose that the statement at line 3 gets modifiedby mistake so that it changes from x :¼ a + 1 intox :¼ b + 1. Since line 3 is exercised and is also includedin the relevant slice of output variable w at line 9, we intu-itively expect that the error will be exposed by test case T.Indeed, with this modification it turns out that all threebranches are exercised and evaluate to true. As a result,the value 2 is outputted (instead of the original value 1)and the erroneous modification is exposed. This examplehighlights the fact that the statement at line 3 certainly doeshave potential to affect the program output.

While we might generally expect that an exercised mod-ification that is also included in the relevant slice will affectthe output of a program, there do exist cases in which exer-cising a modification, that is contained in the relevant slice,will actually not change the output value(s) of a programand therefore may not lead to a fault being exposed. Con-sider the case in Fig. 1 in which line 3 is again modified, butnow it changes from x :¼ a + 1 into y :¼ a + 1. With thismodification, the execution actually follows the originalpath through the code in which B1 evaluates to false and

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 199

B3 subsequently evaluates to true, and therefore the origi-nal computed value for w, 1, is outputted. This demon-strates that it is not always true that exercising amodified statement that has potential to influence programoutput will actually cause the output value(s) to change.This implies that there is no guarantee that a test case exer-cising more modified statements that are contained in therelevant slice will necessarily expose more faults (or changethe output) than another test case exercising fewer modifi-cations that are in the relevant slice. Nevertheless, in prac-tice we might expect a higher likelihood that exercising amodification that has potential to affect program output,will actually cause program output to change and thereforeexpose a fault.

Let us next consider the case in which a modified state-ment is not contained in the relevant slice. From Fig. 1,notice that line 5 is not contained in the relevant slice. Thisis because branch B2 is the only statement in the programthat is data-dependent upon the definition of z in line 5,but branch B2 is not even exercised by our test T (thus, line5 is not included in the relevant slice). Suppose that line 5 ischanged from z :¼ c + 1 into z :¼ b + 1. With this modi-fication, the execution trace follows the original path of B1

evaluating to false and B3 evaluating to true, and the out-put is not affected (since the modification affected the valueof the definition which is not used in the relevant slice). Atest case exercising a modified statement, that is not con-tained in the relevant slice, will likely change the programoutput only when the variable being defined by the modi-fied statement is changed to another variable which is usedin the relevant slice (e.g., by modifying the variable in theleft-hand side of an assignment). To see this, suppose thatline 5 is changed from z :¼ c + 1 into y :¼ c + 1. Thenthis actually causes the executed path to change becausenow although B1 still evaluates to false, we have that B3

now subsequently evaluates to false too. This causes theoutputted value of w to be 0, which is different from the ori-ginal outputted value of 1. However, we believe this case(in which the left-hand side of an assignment is modified)to be the only case in which an exercised modification willbe outside the relevant slice, yet may still affect the programoutput.

The above example motivates the following approach.While considering whether a test case is likely to changethe output when executed for the modified program, weneed to take into consideration the following factors.

1. The number of statements (branches) in the relevantslice of output for the test case because any modificationshould necessarily affect some computation in the rele-vant slice to be able to change the output for this testcase.

2. The number of statements that are executed by the testcase but are not in the relevant slice of the outputbecause changing the variable on lhs of an assignmentnot in the relevant slice may affect a computation inthe relevant slice and thus may change the output.

Based on the above factors, we propose the followingheuristic for test case prioritization. We order the test casesin decreasing order of test case weight, where the weight fora test is determined as follows: test case weight = # of req’s

present in the relevant slice + total # of req’s exercised by

the test case. Ties are broken arbitrarily. This criterionessentially gives ‘‘single’’ weight to those exercised require-ments that are outside the relevant slice, and ‘‘double’’weight to those exercised requirements that are containedin the relevant slice. We call this technique the‘‘REG + OI + POI’’ technique for prioritization, whereREG denotes REGular statement (branches) executed bythe test case, OI denotes the Output Influencing and POIdenotes the Potentially Output Influencing statements(branches) executed by the test case.

The above technique does not take into account thestatements that have been modified in the new version forprioritizing test cases. It is intuitive to expect that a testcase that executes larger number of program modificationsis likely to expose more faults. Guided by this intuition andby building on the relevant slice based approach, we definetwo additional techniques (‘‘GRP_REG + OI + POI’’ and‘‘MOD * (REG + OI + POI)’’) for test case prioritizationthat take number of modifications traversed by a test caseinto account while prioritizing test cases. In the next sec-tion, we discuss all the above three heuristics in detailand present detailed experiments and their analysis forevaluating the effectiveness of the above three heuristicsfor test case prioritization.

4. Experimental study

4.1. Experimental setup

We used the Siemens programs described in Table 1 asour subject programs. Each program is associated with aset of faulty versions (each containing a seeded error)and a pool of test cases. The subject programs and theirassociated faulty versions and test pools were obtainedfrom Software-artifact Infrastructure Repository.

The goal of our experiments is to see how well our pri-oritized suites for the heuristics proposed in Section 3 per-form in terms of rate of fault detection. Our experimentalapproach is to generate test suites, prioritize their test casesaccording to various prioritization approaches, and thenmeasure the resulting APFD value for each prioritizedsuite. Similar to the experimental setup in Elbaum et al.(2002), Rothermel et al. (2001), we generated 1000branch-coverage adequate test suites from the provided testpools for each program. To generate each suite, we ran-domly selected test cases from the associated test case pooland added them to the suite as long as they increased thecumulative branch coverage of the suite. As soon as thesuite achieved branch-coverage adequacy (defined as hav-ing the same branch coverage achieved by the entire testcase pool), the generated suite was complete. Every testcase within a particular generated suite is unique, but two

Table 1Siemens suite of programs

Programname

Lines ofcode

# Faultyversions

Test casepool size

Programdescription

tcas 138 41 1608 Altitude separationtotinfo 346 23 1052 Info accumulatorsched 299 8 2650 Priority schedulersched2 297 10 2710 Priority schedulerptok 402 7 4130 Lexical analyzerptok2 483 10 4115 Lexical analyzerreplace 516 32 5542 Pattern substituter

Table 2For each subject program, the average time (in seconds) required tocompute the relevant slices based upon the program output for each testcase in the associated test case pool

Average number of seconds to compute the relevant slice for a test case

Programname:

tcas totinfo sched sched2 ptok ptok2 replace

Time (s): 0.002 0.311 0.765 0.934 1.099 0.302 1.435

2 To measure fault detection, we measured the set of faults exposed byeach test case by executing every test case on each faulty version of thecorresponding subject program, and compared the output of these faultyversions to the output generated when the test was run on the base version(the ‘‘oracle’’) of the program. If the base version output differed fromsome faulty version output when both versions were run on a particulartest case, this indicated that a fault was exposed.

200 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

different suites may share the same test case. Using thisapproach, average suite sizes ranged from just under 6 testsper suite for tcas to just over 18 tests per suite forreplace. We refer to these branch-adequate suites asthe ‘‘smaller suites.’’

To see how results might change on larger test suites, wealso generated 1000 ‘‘larger’’ suites for each program as fol-lows. We randomly selected test cases into the suite, as longas they increased the cumulative branch coverage of thesuite and exercised a new path in the suite not exercisedby any other test case already in the suite, until branch-cov-erage adequacy was achieved. Then, we selected any addi-tional tests as necessary in the same way for all-usescoverage, until all-uses adequacy was also achieved in thesuite. Finally, we repeated all of this a second time to getanother branch-adequate and all-uses adequate suite(ensuring still that no selected test case exercised a samepath as a previously-selected test case), and we unionedthose two suites together to obtain a single ‘‘larger’’ suite.Using this approach, average suite sizes ranged from 8 testsper suite for tcas to just under 60 tests per suite forreplace. Program tcas contains no loops, and thereforethere are a limited number of distinct paths through thatprogram, thus limiting the generated suite sizes for tcassince every test case in a suite exercises a unique path.We refer to these branch-adequate and all-uses adequate(times 2) suites as the ‘‘larger suites’’.

For each test case, we measured the total set of exercisedstatements and branches using instrumented versions of thesubject programs as generated by the Aristotle programanalysis tool (Harrold and Rothermel, 1997). For each testwe also computed the set of statements and branches in therelevant slice based on the program output generated whenexecuting the test case on the given subject program. Tocompute a relevant slice, we first computed the statementsin the dynamic slice of the program output by computingthe transitive closure of the data and control dependenciesof the output exercised by the test case. Then, we aug-mented the dynamic slice with the additional branches/statements that were potentially-output-influencing, alongwith their corresponding data dependencies, to obtain therelevant slices (this would normally involve static analysis,but we tried to approximate this by looking at the execu-tion traces of all the tests in the large test pools providedwith the Siemens programs to see if changing a branch out-come would affect the program output or not).

Computing the relevant slices based upon the programoutput for each test case in the Siemens test case poolswas relatively quick on average for each test case, as indi-cated by the results shown in Table 2.

From this table, it can be seen that relevant slice compu-tation times ranged from 0.002 s on average per test casefor program tcas, to nearly 1.5 s on average per test casefor replace. Moreover, the actual times to prioritize thesuites according to different prioritization approaches giventhe relevant slicing information never exceeded 5 ms(0.005 s) on average for any test suite (for both the smallersuites and the larger suites). Thus, although the time tocompute relevant slices for each test case was rather small(never more than a few seconds on average), it is also truethat the total time required to process each test suite – mea-suring required coverage information and relevant slicinginformation and then prioritizing the suite – was domi-nated by the time required to compute relevant slices. Sincethe Siemens subject programs are relatively small as com-pared to other ‘‘real-world’’ applications, relevant slicingtimes will increase for larger subject programs. However,prior work by Zhang and Gupta (2004) have describedan algorithm for quickly computing dynamic slices that isbased upon a dynamic dependence graph representationthat is highly compact and rapidly-traversable. With thisnew slicing algorithm, the authors were able to conductexperiments in which slicing times were reduced from4.69 to 25.21 min, to 1.74 to 36.25 s, for program execu-tions on the order of 67 to 220 million statements executed.This approach can also be applied to help reduce the timeto compute relevant slices. Thus, we believe the timerequired to compute relevant slices in our approach doesnot limit the practicality of our approach, even when largersubject programs are considered.

Given the sets of regular exercised statements andbranches along with the corresponding relevant slices foreach test case, we used this information to prioritize the testcases according to various approaches and afterwards com-puted the APFD value for each prioritized suite in order toevaluate the suite’s rate of fault detection.2 We examined

Table 4Computed t values and the corresponding confidence with which the nullhypothesis can be rejected when comparing the APFD values of suitesprioritized using REG and REG + OI + POI, for the ‘‘smaller suites’’

Programname

Statement Branch

Computedt value

% Confidenceof rejectingnull hypothesis

Computedt value

% Confidenceof rejectingnull hypothesis

tcas 2.49 >95.0% 4.68 >99.9%totinfo 6.23 >99.9% 7.58 >99.9%sched 2.53 >95.0% 2.57 >95.0%sched2 2.45 >95.0% 3.19 >99.5%ptok �3.40 >99.9% �0.04 <80.0%ptok2 11.26 >99.9% 0.49 <80.0%replace 9.70 >99.9% 3.83 >99.9%

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 201

the types of errors introduced in the faulty versions of eachsubject program and identified six distinct categories ofseeded errors: (1) changing the operator in an expression,(2) changing an operand in an expression, (3) changingthe value of a constant, (4) removing code, (5) adding code,and (6) changing the logical behavior of the code (usuallyinvolving multiple other categories of error types simulta-neously in a single faulty version). Thus, the faulty versionsused in our experiments cover a wide variety of fault types.

We conducted experiments with each of the three heuris-tics proposed previously, using both the smaller and largersuites. For comparison, we also prioritized the tests usingthe approach in Elbaum et al. (2002), Rothermel et al.(2001) that only accounts for total requirement coverage,ordering the test cases in decreasing order of the numberof exercised statements/branches covered by each test case(ties are broken arbitrarily). We call this the ‘‘REG’’approach.

4.2. Heuristic I: prioritization Using REG + OI + POI

Table 3 gives the APFD results computed from the pri-oritized suites for the REG + OI + POI approach, on aver-age across all 1000 smaller suites for each subject program.For comparison, the table gives the average APFD resultswhen prioritizing the suites according to the REGapproach. Additionally, the table lists the average suitesizes among the suites generated for each subject program.Notice that the larger subject programs (such as the twoptok programs and replace) also have larger averagesuite sizes, since more tests are required to make thesesuites branch-coverage adequate.

From this table, it can be seen that in all cases except forptok-statement and ptok-branch, the REG + OI + POIprioritized suites consistently result in improved averageAPFD values as compared to their REG approach coun-terparts. For ptok-statement, the results are very slightlydegraded on average using the REG + OI + POI approachas compared to REG, while for ptok-branch, the averageresults are identical for the two prioritization approaches.In the remaining cases, while the amount of average per-

Table 3Average suite sizes and APFD values for each subject program using the‘‘smaller suites’’, when prioritizing according to the REG approach andthe REG + OI + POI approach

Programname

Suitesize

Statement Branch

REG REG + OI + POI REG REG + OI + POI

tcas 5.71 76.73 77.11 76.58 77.26totinfo 7.30 73.38 74.09 76.59 77.23sched 7.31 56.24 56.64 56.26 56.47sched2 8.01 60.23 60.55 62.66 63.00ptok 15.76 76.40 76.28 75.61 75.61ptok2 11.77 76.25 76.79 75.86 75.88replace 18.63 73.07 73.43 73.34 73.46

Experiments were conducted with respect to both statement and branchcoverage.

centage improvement is relatively small (less than 1%improvement on average), we will show shortly that farmore suites are improved than made worse in terms ofAPFD value when using the REG + OI + POI approach,as compared to the REG approach.

To determine whether the improvement in averageAPFD results when using the REG + OI + POI approachover the REG approach is statistically significant, we con-ducted a t-test for paired observations3 (Snedecor and Coch-ran, 1967). For each of the 1000 smaller test suites, wecreated the pair (X,Y), where X is the APFD value of thesuite prioritized using REG and Y is the APFD value ofthe suite prioritized using REG + OI + POI. We consid-ered the null hypothesis that there is no difference in themean APFD value of both the REG and REG + OI + POIprioritized suites. Table 4 shows the resulting t values com-puted for our t test, along with the percentage confidencewith which we may reject the null hypothesis. We used asreference a table of critical values presented in Snedecorand Cochran (1967). Note that the larger the magnitudeof the computed t value, the greater confidence we havein rejecting the null hypothesis. For our 999 degrees of free-dom, it turns out that for t values greater than about 1.96,we can reject the null hypothesis with over 95% confidence(for t values greater than about 3.3, we can reject the nullhypothesis with over 99.9% confidence). Thus, from thetable we see that in all cases where the average APFD valuewas improved when using REG + OI + POI over REG,the amount of average improvement is statistically signifi-cant in all cases except for ptok2-branch, in which wehave less than 80% confidence in rejecting the null hypoth-esis. Note also for the ptok program in which APFD

3 Also called a paired t-test, this is a statistical method for determiningwhether there may be any statistically-significant difference between themeans of two populations, given samples where observations from onesample can be naturally paired with observations from the other sample.The procedure is to formulate a null hypothesis that assumes thepopulation means are identical, then compute a t value from the paireddata samples, which is referenced in a corresponding table of criticalvalues to determine the confidence with which we may reject the nullhypothesis.

202 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

results were degraded when using REG + OI + POI, theamount of degradation was not statistically-significant inthe case of ptok-branch.

Average APFD results for the larger suites are reportedin Table 5. The trends in the data for the larger suites aresimilar to those of the smaller suites, with the exceptionthat results using the larger suites are also slightly degradedon average using the REG + OI + POI approach for pro-grams sched-branch and ptok2-branch.

Table 6 shows the results of conducting a t test forpaired observations comparing the APFD values of thelarger test suites prioritized using REG and REG + OI +POI. We can see that for tcas-statement, we do nothave strong evidence to justify rejecting the null hypothesis,and therefore the average APFD improvement in this caseis not statistically significant. Further, for the case ofptok-branch in which results were degraded when usingREG + OI + POI, we have less than 95% confidence thatthe amount of average degradation is statistically signifi-cant. In all other cases, the amount of improvement (ordegradation) when using REG + OI + POI over REG isstatistically significant, as we can reject the null hypothesiswith over 99.9% confidence.

The reason REG + OI + POI is able to show consistentaverage improvement over REG in nearly all cases isbecause REG + OI + POI also takes relevant slicing infor-mation into account instead of just the total coverage infor-mation of tests. We expect that the statements/branchesinside the relevant slice are those that have potential to

Table 5Similar results as Table 3, but these are for the ‘‘larger suites’’

Programname

Suitesize

Statement Branch

REG REG + OI + POI REG REG + OI + POI

tcas 8.00 73.95 74.06 73.95 74.54totinfo 16.36 82.82 83.22 87.06 87.18sched 18.22 66.39 67.02 66.05 65.85sched2 20.87 76.54 77.07 78.32 78.98ptok 30.63 83.49 83.31 82.71 82.68ptok2 41.59 93.44 93.51 93.51 93.45replace 58.86 76.11 76.31 76.30 76.34

Table 6Computed t values and the corresponding confidence with which the nullhypothesis can be rejected when comparing the APFD values of suitesprioritized using REG and REG + OI + POI, for the ‘‘larger suites’’

Programname

Statement Branch

Computedt value

% Confidenceof rejectingnull hypothesis

Computedt value

% Confidenceof rejectingnull hypothesis

tcas 0.91 <80.0% 6.22 >99.9%totinfo 6.95 >99.9% 3.31 >99.9%sched 6.23 >99.9% �4.65 >99.9%sched2 7.50 >99.9% 11.88 >99.9%ptok �8.31 >99.9% �1.73 <95.0%ptok2 4.73 >99.9% �4.94 <99.9%replace 16.14 >99.9% 3.60 >99.9%

influence program output (except for the case of modifyingthe left-hand side of an assignment in which a relevant slicemay not contain that statement), so a test exercising moreof these OI/POI requirements should have higher weight.The fact that our results show near-consistent averageimprovement when going from the REG approach to theREG + OI + POI approach is promising for the potentialof our approach, and shows that there may be somethingto be gained by accounting for OI and POI coverage duringtest case prioritization.

An observation we have made about the ptok programis that, relative to the other Siemens programs, the testcases used in the smaller suites with ptok have the highestpercentage of exercised requirements that are also OI orPOI. On average, we would expect the majority of exer-cised statements and branches in a program to also havepotential to influence the output, and this is indicated bythe results in Table 7.

However, in this table ptok happens to have the highestpercentage of OI/POI statements/branches among all theother subject programs. In particular, for statement cover-age, ptok has about 4% more statements in the relevantslice per test than the next-closest programs totinfo

and ptok2. This may help to explain why results wereslightly degraded on average for ptok in the smaller suites,because in this case, relatively few statements are outside

the relevant slices as compared to the other programs,and so there may be less opportunity to witness the benefitsof accounting for relevant slicing information since almostall exercised requirements are contained in the relevantslice anyway for program ptok. This is likely the same rea-son why results for ptok are also degraded in the largersuites. Additionally, slightly-degraded results for sched-branch and ptok2-branch in the larger suites may beexplained because even in the small suites, these two pro-grams involve test cases with quite high percentages ofOI/POI requirements (over 88% for sched-branch andover 94% for ptok2-branch), and this is likely similar inthe larger suites because tests were selected from the sametest pool for both the smaller and larger suites. We expectthat our approach would have more potential to show ben-efit in the cases where even more statements are able to be

Table 7On average per test in a ‘‘smaller suite’’: the percentage of exercisedstatements/branches that are actually OI or POI, along with the totalnumber of exercised statements/branches

Programname

% OI + POI requirements Total # exercised reqs

Stmt Branch Stmt Branch

tcas 88.23 95.38 33.67 10.92totinfo 90.71 89.65 84.01 45.89sched 78.57 88.68 88.49 44.31sched2 77.25 84.02 84.55 51.80ptok 94.59 96.64 99.32 59.48ptok2 90.60 94.10 103.09 83.01replace 82.00 94.45 100.58 59.66

-30

-20

-10

0

10

20

30

0 50 100 150 200 250 300 350 400 450

Test Suite

tcas

The Benefit of REG+OI+POI Over REG: Statement Coverage

-20

-15

-10

-5

0

5

10

15

20

0 50 100 150 200 250 300 350 400

Test Suite

totinfo

-20

-10

0

10

20

0 50 100 150 200 250 300

APFD diff: (REG+OI+POI)-(REG)

APFD diff: (REG+OI+POI)-(REG)

APFD diff: (REG+OI+POI)-(REG)

APFD diff: (REG+OI+POI)-(REG)

APFD diff: (REG+OI+POI)-(REG)

APFD diff: (REG+OI+POI)-(REG)

APFD diff: (REG+OI+POI)-(REG)

Test Suite

sched

-30

-20

-10

0

10

20

30

0 20 40 60 80 100 120 140 160

Test Suite

sched2

-8

-6

-4

-2

0

2

4

6

8

0 20 40 60 80 100 120 140 160 180

Test Suite

ptok

-10

-5

0

5

10

0 50 100 150 200 250

Test Suite

ptok2

-6

-4

-2

0

2

4

6

0 100 200 300 400 500

Test Suite

replace

Fig. 2. The difference between the APFD values of suites prioritized using REG + OI + POI, and the corresponding APFD values of the suites prioritizedusing REG, sorted by suite in the positive x direction in decreasing order of the amount of improvement of the REG + OI + POI approach (suites inwhich both REG + OI + POI and REG resulted in the same APFD value are not plotted). These plots are for prioritization when statement coverage istaken into account, using the ‘‘smaller suites’’.

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 203

identified as not POI, and may then be excluded from therelevant slices.

To gain insight into how much benefit our approachprovides for suites individually, we present the plots inFigs. 2 and 3 (for the smaller suites) and Figs. 4 and 5(for the larger suites). These figures illustrate the benefitof the REG + OI + POI approach over the REG approachin terms of promoting improved APFD values for priori-tized suites. Specifically, for each plot, we consider everytest suite for the given program in which the REG + OI +POI approach and the REG approach resulted in different

computed APFD values (suites with unchanged APFD val-ues between the two approaches are ignored). Then, foreach APFD-changed suite, we compute the difference

between the REG + OI + POI APFD value and the REGAPFD value, to obtain a measure of the improvement ofthe REG + OI + POI approach (negative differences occurwhen REG + OI + POI performs worse than REG).Finally, we order these suites in decreasing order of theamount of benefit of the REG + OI + POI approach,and then plot the data.4

4 In a graph, each unit along the x-axis represents a suite, and the y-axisrepresents the percentage difference in APFD values for a suite betweenthe REG + OI + POI and the REG approaches. Each graph also includesthe line y = 0, because the point at which the curve crosses this axis as onemoves in the positive x direction, is the point at which suites transitionfrom being better to being worse using the REG + OI + POI prioritiza-tion approach. Also, each graph includes a vertical line at the midpointalong the horizontal axis to show the median suite in the ordered sequenceof suites along the x-axis.

Figs. 2 and 4 show the results when suites are prioritizedwith respect to statement coverage, and Figs. 3 and 5 showthe results when prioritization is carried out with respect tobranch coverage. From the smaller suite plots, we can seethat in all cases except for ptok-statement, the curvecrosses the line y = 0 to the right of the vertical line occur-ring at the median suite. This implies that in these cases,there are more suites with improved APFD values thanworsened APFD values when using the REG + OI + POIapproach. It is interesting to note that for ptok-branchwith the smaller suites, average APFD values are identicalfor REG and REG + OI + POI, even though there areslightly more suites with improved APFD values in thiscase (the curve crosses the line y = 0 to the right of the ver-tical line). From the larger suite plots, we can see that thecurve crosses the line y = 0 to the right of the vertical linein all cases except for ptok (statement and branch),sched-branch, and ptok2-branch. This is consistent withthe results previously seen in the corresponding Table 5 ofaverage APFD values for the larger suites.

Notice further that the area underneath the curve bothabove and below the line y = 0 can be respectively thoughtof as a measure of the benefit and the degradation experi-enced when using the REG + OI + POI approach. Theplots seem to suggest that the area under the curve above

the line y = 0 is generally greater than the area under thecurve below the line y = 0, in the cases when the curvecrosses the line y = 0 to the right of the vertical line occur-ring at the median suite. To show this, we present theresults in Tables 8 and 9.

-20

-15

-10

-5

0

5

10

15

20

0 50 100 150 200 250 300 350 400

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

Test Suite

tcas

The Benefit of REG+OI+POI Over REG: Branch Coverage

-20

-10

0

10

20

0 50 100 150 200 250 300 350

Test Suite

totinfo

-15

-10

-5

0

5

10

15

0 20 40 60 80 100 120 140 160

Test Suite

sched

-30

-20

-10

0

10

20

30

0 20 40 60 80 100

Test Suite

sc he d2

-8

-6

-4

-2

0

2

4

6

8

0 50 100 150 200

Test Suite

ptok

-8

-6

-4

-2

0

2

4

6

8

0 50 100 150 200

Test Suite

ptok2

-6

-4

-2

0

2

4

6

0 50 100 150 200 250 300 350 400

Test Suite

replace

Fig. 3. This figure is similar to Fig. 2, except the plots here are for prioritization when branch coverage is taken into account.

204 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

Table 8 is meant to accompany Figs. 2 and 3, and Table 9is meant to accompany Figs. 4 and 5. In these tables, wepresent values computed for the area5 under the curve ofeach plot, including the areas both above and below the liney = 0. From the data in these tables, it is clear that as theplots suggested, the area under the curve above the liney = 0 is greater than the corresponding area under the curvebelow the line y = 0, in most cases where the curve crossesthe line y = 0 to the right of the median vertical line. Theexception is for case ptok-branch in Table 8, in which thereis very slightly more area under the curve below the liney = 0, even though the curve crosses y = 0 to the right ofthe median vertical line in the corresponding Fig. 3. The col-umns of the tables labeled ‘‘Ratio above:below’’ show theratio of the area above y = 0 to the area below y = 0. Inmost cases, this ratio is greater than 1. Moreover, in 8 casesthe ratio is greater than 2 (in two such cases the ratio isgreater than 4 and in another case the ratio is greater than6), indicating that for these particular plots, there is morethan twice the amount of area above the line y = 0 as belowit. These results suggest that the cases in which REG +OI + POI leads to an improvement over REG are signifi-cantly more than when degraded results are witnessed.

In Tables 10 and 11, we complement our graphs byshowing the percentage of suites with improved (higher),worsened (lower), and same APFD results, when usingthe REG + OI + POI approach as opposed to the REGapproach.

5 The area under each curve was computed in an approximate mannerby summing up the areas of individual rectangles below each curve, wherethe width of each rectangle is one unit (one suite) and the height of eachrectangle is the (absolute value of the) APFD difference for that suite.

From Table 10, we can see that in all cases except forptok-statement, there are more smaller suites that wereimproved by the REG + OI + POI approach than thosesuites that were made worse, in terms of APFD value.We see from Table 11 that there were more larger suitesthat were improved than worsened in all cases except forptok (statement and branch), sched-branch, andptok2-branch. These results are consistent with the earlierplots and average APFD results when using theREG + OI + POI approach. However, it is true that inmost cases the majority of suites had the same APFD valueacross the programs when both the REG andREG + OI + POI approaches were used (although forthe larger suites, there were fewer suites with the ‘‘same’’APFD value than for the smaller suites). Many ‘‘same’’suites can result because even though the specific priorityvalues computed for each test case will likely vary betweenthe two approaches, many times the relative ordering of thepriorities between the tests may still be unchanged andtherefore would lead to the same prioritized suites. Sincelarger suites contain more test cases, there is more opportu-nity for the relative ordering of tests to change for the lar-ger suites than the smaller suites, so this can explain whythere were fewer same/larger suites as compared to thenumber of same/smaller suites. Despite the number ofcases in which suites have the same APFD value, the poten-tial benefits of our approach are evident because signifi-cantly more suites are improved than made worse whenoutput influences and potential output influences are takeninto account during prioritization.

An observation about our experiments above is thatthere is no consideration taken for distinguishing bet-ween tests that actually do or do not exercise modified

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

APFDdiff:(REG+OI+POI)-(REG)

Test Suite

tcas

The Benefit of REG+OI+POI Over REG: Statement Coverage

-8

-6

-4

-2

0

2

4

6

8

0 100 200 300 400 500 600 700

Test Suite

totinfo

-15

-10

-5

0

5

10

15

0 100 200 300 400 500 600 700

Test Suite

sched

-15

-10

-5

0

5

10

15

0 100 200 300 400 500

Test Suite

sched2

-4

-2

0

2

4

0 50 100 150 200 250 300 350

Test Suite

ptok

-3

-2

-1

0

1

2

3

0 50 100 150 200 250 300 350 400 450

Test Suite

ptok2

-3

-2

-1

0

1

2

3

0 100 200 300 400 500 600 700 800 900

Test Suite

replace

Fig. 4. Similar to Fig. 2, but this is for the ‘‘larger suites’’.

-15

-10

-5

0

5

10

15

0 100 200 300 400 500

APFD diff:(REG+OI+POI)-(REG)

Test Suite

tcas

The Benefit of REG+OI+POI Over REG: Branch Coverage

-6

-4

-2

0

2

4

6

0 100 200 300 400 500

APFD diff:(REG+OI+POI)-(REG)

Test Suite

totinfo

-8

-6

-4

-2

0

2

4

6

8

0 50 100 150 200 250 300 350

APFD diff:(REG+OI+POI)-(REG)

Test Suite

sched

-10

-5

0

5

10

0 50 100 150 200 250 300 350 400 450

APFD diff:(REG+OI+POI)-(REG)

Test Suite

sched2

-4

-3

-2

-1

0

1

2

3

4

0 50 100 150 200 250

APFD diff:(REG+OI+POI)-(REG)

Test Suite

ptok

-3

-2

-1

0

1

2

3

0 50 100 150 200 250 300 350 400

APFD diff:(REG+OI+POI)-(REG)

Test Suite

ptok2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600 700 800

APFD diff:(REG+OI+POI)-(REG)

Test Suite

replace

Fig. 5. Similar to Fig. 3, but this is for the ‘‘larger suites’’.

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 205

statements. Tests that may cover large portions of codeand yet do not actually traverse a modified statementmay therefore get higher priority even though they willnot be able to expose new faults. Since in regression testing,information may be available about the modificationsmade in the subsequent version of software, this informa-tion should be taken into account during prioritization aswell.

4.3. Heuristic II: prioritization Using

GRP_REG + OI + POI

Tests that do not traverse any modifications should bepushed back to the end of the prioritized sequences sincethey cannot expose any new faults in the software. To seehow our results may change when we take the factor ofmodifications exercised by test cases into account, we

Table 10The percentage of suites for each subject program such that theREG + OI + POI approach yielded better, worse, and same APFD valuesas opposed to the REG approach, for both the statement and branchprioritizations, using the ‘‘smaller suites’’

Programname

REG + OI + POI over REG: % suites better/worse/same

Statement Branch

Better Worse Same Better Worse Same

tcas 25.8 19.8 54.4 27.1 16.7 56.2totinfo 27.2 16.6 56.2 26.5 10.9 62.6sched 19.0 14.7 66.3 10.9 6.9 82.2sched2 9.4 7.0 83.6 7.3 3.1 89.6ptok 7.3 12.3 80.4 12.3 11.5 76.2ptok2 23.3 5.8 70.9 12.3 10.2 77.5replace 37.7 17.9 44.4 27.0 17.8 55.2

Table 8Areas under the curves of the plots in Figs. 2 and 3, both above the liney = 0 and below it, along with the corresponding ratio values

Programname

REG + OI + POI over REG: area under curve to accompanyFigs. 2 and 3

Statement Branch

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

tcas 1563.26 1178.07 1.33 1635.96 951.73 1.72totinfo 1287.07 578.56 2.22 967.96 327.51 2.96sched 1446.00 1044.22 1.38 605.02 394.12 1.54sched2 875.55 557.27 1.57 640.38 299.07 2.14ptok 169.35 294.62 0.57 268.64 270.13 0.99ptok2 653.20 108.19 6.04 287.45 266.51 1.08replace 541.00 185.14 2.92 305.87 192.20 1.59

Table 9Areas under the curves of the plots in Figs. 4 and 5, both above the liney = 0 and below it, along with the corresponding ratio values

Programname

REG + OI + POI over REG: area under curve to accompanyFigs. 4 and 5

Statement Branch

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

tcas 1259.34 1151.31 1.09 1195.36 605.53 1.97totinfo 806.38 409.41 1.97 383.07 270.38 1.42sched 1438.26 811.35 1.77 260.92 464.94 0.56sched2 974.87 444.23 2.19 853.92 195.63 4.36ptok 100.49 278.19 0.36 109.52 138.81 0.79ptok2 179.80 107.43 1.67 75.49 134.38 0.56replace 261.45 65.09 4.02 128.82 93.71 1.37

Table 11Similar results as in Table 10, but these are for the ‘‘larger suites’’

Programname

REG + OI + POI over REG: % suites better/worse/same

Statement Branch

Better Worse Same Better Worse Same

tcas 37.2 27.3 35.5 34.5 18.9 46.6totinfo 43.8 30.8 25.4 34.0 24.6 41.4sched 47.2 28.7 24.1 14.7 20.9 64.4sched2 40.4 18.1 41.5 35.8 10.3 53.9ptok 10.9 26.8 62.3 12.9 14.3 72.8ptok2 28.1 20.7 51.2 16.5 24.5 59.0replace 65.5 25.7 8.8 47.6 35.0 17.4

Table 12Average suite sizes and APFD values for each subject program using the‘‘smaller suites’’, when prioritizing according to the REG approach andthe GRP_REG + OI + POI approach

Programname

Suitesize

Statement Branch

REG REG + OI + POI REG REG + OI + POI

tcas 5.71 76.73 77.11 76.58 77.26totinfo 7.30 73.38 74.09 76.59 77.23sched 7.31 56.24 56.64 56.26 56.47sched2 8.01 60.23 60.55 62.66 63.00ptok 15.76 76.40 76.28 75.61 75.61ptok2 11.77 76.25 76.87 75.86 75.96replace 18.63 73.07 73.43 73.34 73.46

Experiments were conducted with respect to both statement and branchcoverage.

206 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

grouped the test cases for each suite into two groups: (1)those that exercise at least one modification, and (2) thosethat do not exercise any modifications. The tests withineach group were then prioritized according to our originalREG + OI + POI criterion, and all tests in group (1) wereput earlier into the prioritization than the tests in group (2).We call this approach the GRP_REG + OI + POIapproach. We simply checked the statement number mod-

ified by the seeded fault in the program and then checked ifthe execution trace of the test case traversed that statementwhen it was executed for the correct version of the pro-gram. Note that if none of the statements covered by a testcase are modified, it cannot change the output unless theprogram modifications introduce some new code on thepath traversed by the test case.

There were only three modifications in which new state-ments were added among all the seeded errors in the Sie-mens suite programs; for these added statements, no testswere considered to be modification-traversing. So, aftergrouping the test cases in the above two sets, we put theset of test cases traversing modifications ahead of thosenot traversing modifications in the prioritized sequence.Next, we applied our REG + OI + POI heuristic to orderthe test cases within each set.

Average APFD results for this approach are shown inTable 12 for the smaller suites and in Table 13 for the lar-ger suites.

It turns out the results for GRP_REG + OI + POI arealmost identical to the results for REG + OI + POI, exceptwe noticed significant improvement by this approach overthe APFD values computed by REG using the smallersuites for the ptok2 program. Only for program ptok2,the smaller test suites on average contained between 2and 5 test cases, respectively, that did not exercise anymodification. For other programs, the modifications weresuch that almost all test cases in the test suites exercised

Table 13Similar results as Table 12, but these are for the ‘‘larger suites’’

Programname

Suitesize

Statement Branch

REG REG + OI + POI REG REG + OI + POI

tcas 8.00 73.95 74.06 73.95 74.54totinfo 16.36 82.82 83.22 87.06 87.18sched 18.22 66.39 67.02 66.05 65.85sched2 20.87 76.54 77.07 78.32 78.98ptok 30.63 83.49 83.31 82.71 82.68ptok2 41.59 93.44 93.51 93.51 93.45replace 58.86 76.11 76.31 76.30 76.34

Table 14Computed t values and the corresponding confidence with which the nullhypothesis can be rejected when comparing the APFD values of suitesprioritized using REG and GRP_REG + OI + POI, for the ‘‘smallersuites’’

Programname

Statement Branch

Computedt value

% Confidenceof rejectingnull hypothesis

Computedt value

% Confidenceof rejectingnull hypothesis

tcas 2.49 >95.0% 4.68 >99.9%totinfo 6.23 >99.9% 7.58 >99.9%sched 2.53 >95.0% 2.57 >95.0%sched2 2.45 >95.0% 3.19 >99.5%ptok �3.40 >99.9% �0.04 <80.0%ptok2 12.34 >99.9% 2.20 >95.0%replace 9.70 >99.9% 3.83 >99.9%

Table 15Computed t values and the corresponding confidence with which the nullhypothesis can be rejected when comparing the APFD values of suitesprioritized using REG and GRP_REG + OI + POI, for the ‘‘largersuites’’

Programname

Statement Branch

Computedt value

% Confidenceof rejectingnull hypothesis

Computedt value

% Confidenceof rejectingnull hypothesis

tcas 0.91 <80.0% 6.22 >99.9%totinfo 6.95 >99.9% 3.31 >99.9%sched 6.29 >99.9% �4.65 >99.9%sched2 7.50 >99.9% 11.88 >99.9%ptok �8.31 >99.9% �1.73 <95.0%ptok2 4.74 >99.9% �4.93 <99.9%replace 16.14 >99.9% 3.60 >99.9%

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 207

the modifications and therefore the average APFD valuesdid not change much with the above variant of the basicapproach. In fact, the average APFD values for the largersuites are identical for GRP_REG + OI + POI andREG + OI + POI. This is likely because larger suites implya greater likelihood of exposing all faults when a smallerfraction of the suite has been exercised (evidenced by thehigher APFD values of the larger suites in most cases ascompared to the smaller suites). Thus, there are moreopportunities for some tests to be moved to the back ofthe prioritizations of the larger suites, that in fact maynot influence the APFD value of the suite due to other ear-lier tests in the prioritization that happen to already exposeall faults that are exposed by the suite anyway.

Therefore, our particular experimental subjects had littleopportunity to show the benefits of the GRP_REG+ OI + POI approach (few to none of the tests were ableto be pushed back into the lower-priority group (2)).Indeed, for two of the programs (tcas and sched), therewere no test cases that did not exercise any modifications.In practice with larger programs such that the modifica-tions made are more localized to a particular module orsmall groups of modules in a smaller fraction of the code,we believe that even more test cases will be identified as nottraversing any modifications, and therefore our GRP_REG + OI + POI approach will have much more opportu-nity in these situations to show improvement in terms ofrate of fault detection of prioritized suites.

Table 14 shows the results of conducting a t test forpaired observations comparing the APFD values of thesmaller test suites prioritized using REG and GRP_REG + OI + POI. Table 15 shows the same, but for thelarger suites. The values shown in Table 14 (for the smallersuites) comparing REG to GRP_REG + OI + POI are iden-tical to those shown in the corresponding Table 4 compar-ing REG to REG + OI + POI, except for program ptok2.In this case, the computed t values have increased for bothstatements and branches, and thus the improvement in theAPFD values when using GRP_REG + OI + POI overREG is statistically significant for both ptok2-statementand ptok2-branch (greater than 95% confidence in reject-ing the null hypothesis). The values shown in Table 15 (forthe larger suites) are also identical to those shown in thecorresponding Table 6, but the t values have changed veryslightly in the cases of sched-statement, ptok2-state-

ment, and ptok2-branch. However, these slight changesin t values have not altered any percentages with whichthe null hypothesis can be rejected.

Similar to Figs. 2 and 3, the plots in Figs. 6 and 7 showthe benefit of using GRP_REG + OI + POI over REGwith the smaller suites. The only noticeable differencebetween the two sets of plots occurs for program ptok2,in which the curve crosses the line y = 0 a little further tothe right along the x-axis in the GRP_REG + OI + POIplots as opposed to the REG + OI + POI plots (indicatingthat more suites were improved for ptok2 usingGRP_REG + OI + POI). Similar plots are shown in Figs.8 and 9 for the larger suites, but these appear identical tothe corresponding Figs. 4 and 5 from when using theREG + OI + POI approach.

To accompany Figs. 6 and 7, we present Table 16 inwhich we show the area both above and below the liney = 0 for each plot. Again, the results are nearly identicalto those given in Table 8. The only exception is for pro-gram ptok2, in which for both statements and branches,the area under the curve above y = 0 has increased, thearea below y = 0 has decreased, and therefore the ratioof the area above to below has increased.

To accompany Figs. 8 and 9 for the larger suites, wepresent Table 17. Here, when compared to the correspond-

-30

-20

-10

0

10

20

30

0 50 100 150 200 250 300 350 400 450

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

tcas

The Benefit of GRP_REG+OI+POI Over REG: Statement Coverage

-20

-15

-10

-5

0

5

10

15

20

0 50 100 150 200 250 300 350 400

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

totinfo

-20

-10

0

10

20

0 50 100 150 200 250 300

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

sched

-30

-20

-10

0

10

20

30

0 20 40 60 80 100 120 140 160

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

sched2

-8

-6

-4

-2

0

2

4

6

8

0 20 40 60 80 100 120 140 160 180

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

ptok

-10

-5

0

5

10

0 50 100 150 200 250 300

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

ptok2

-6

-4

-2

0

2

4

6

0 100 200 300 400 500

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

replace

Fig. 6. This figure is similar to Fig. 2, except the plots here compare the GRP_REG + OI + POI approach to the REG approach.

-20

-15

-10

-5

0

5

10

15

20

0 50 100 150 200 250 300 350 400

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

tcas

The Benefit of GRP_REG+OI+POI Over REG: Branch Coverage

-20

-10

0

10

20

0 50 100 150 200 250 300 350

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

totinfo

-15

-10

-5

0

5

10

15

0 20 40 60 80 100 120 140 160

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

sched

-30

-20

-10

0

10

20

30

0 20 40 60 80 100

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

sched2

-8

-6

-4

-2

0

2

4

6

8

0 50 100 150 200

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

ptok

-8

-6

-4

-2

0

2

4

6

8

0 50 100 150 200

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

ptok2

-6

-4

-2

0

2

4

6

0 50 100 150 200 250 300 350 400

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

replace

Fig. 7. This figure is similar to Fig. 6, except the plots here are for prioritization when branch coverage is taken into account.

208 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

ing results of the larger suites when using the REG + OI +POI approach, there is slight improvement in ratio value inthe case of ptok2-statement. However, this slight differ-ence in ratio value did not translate to a changed average

value as was seen in Table 13, in which average APFD val-ues for the larger suites using GRP_REG + OI + POI wereidentical to those when using REG + OI + POI on the lar-ger suites.

Finally, we present Tables 18 and 19 to show the per-centage of suites that were improved, worsened, and thesame in terms of APFD value when comparing theGRP_REG + OI + POI approach to the REG approach.The results are similar to those given in the correspondingTables 10 and 11 (using the REG + OI + POI approach).The only difference is for program ptok2 in Table 18(using the smaller suites). For this program as compared

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

tcas

The Benefit of GRP_REG+OI+POI Over REG: Statement Coverage

-8

-6

-4

-2

0

2

4

6

8

0 100 200 300 400 500 600 700

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

totinfo

-15

-10

-5

0

5

10

15

0 100 200 300 400 500 600 700

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

sched

-15

-10

-5

0

5

10

15

0 100 200 300 400 500

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

sched2

-4

-2

0

2

4

0 50 100 150 200 250 300 350

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

ptok

-3

-2

-1

0

1

2

3

0 50 100 150 200 250 300 350 400 450

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

ptok2

-3

-2

-1

0

1

2

3

0 100 200 300 400 500 600 700 800 900

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

replace

Fig. 8. Similar to Fig. 6, but this is for the ‘‘larger suites’’.

-15

-10

-5

0

5

10

15

0 100 200 300 400 500

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

tcas

The Benefit of GRP_REG+OI+POI Over REG: Branch Coverage

-6

-4

-2

0

2

4

6

0 100 200 300 400 500

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

totinfo

-8

-6

-4

-2

0

2

4

6

8

0 50 100 150 200 250 300 350

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

sched

-10

-5

0

5

10

0 50 100 150 200 250 300 350 400 450

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

sched2

-4

-3

-2

-1

0

1

2

3

4

0 50 100 150 200 250

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

ptok

-3

-2

-1

0

1

2

3

0 50 100 150 200 250 300 350 400

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

ptok2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600 700 800

APFD diff:(GRP_REG+OI+POI)-(REG)

Test Suite

replace

Fig. 9. Similar to Fig. 7, but this is for the ‘‘larger suites’’.

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 209

to the results using the REG + OI + POI approach, thenumber of improved suites has increased, the number ofdegraded suites has decreased, and at the same time thenumber of suites with the same APFD score also hasdecreased, when using the GRP_REG + OI + POIapproach instead of the REG + OI + POI approach. InTable 19 (using the larger suites) results for GRP_REG +

OI + POI are identical to those seen in Table 11 forREG + OI + POI.

Overall, our results show that there is potential benefitto be gained by accounting for the modifications traversedby test cases when prioritizing, because those tests that maycover a large portion of the code, yet that do not exerciseany modifications and hence cannot expose any new errors,

Table 16Areas under the curves of the plots in Figs. 6 and 7, both above the liney = 0 and below it, along with the corresponding ratio values

Programname

GRP_REG + OI + POI over REG: area under curve toaccompany Figs. 6 and 7

Statement Branch

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

tcas 1563.26 1178.07 1.33 1635.96 951.73 1.72totinfo 1287.07 578.56 2.22 967.96 327.51 2.96sched 1446.00 1044.22 1.38 605.02 394.12 1.54sched2 875.55 557.27 1.57 640.38 299.07 2.14ptok 169.35 294.62 0.57 268.64 270.13 0.99ptok2 727.88 104.60 6.96 361.89 261.57 1.38replace 541.00 185.14 2.92 305.87 192.20 1.59

Table 17Areas under the curves of the plots in Figs. 8 and 9, both above the liney = 0 and below it, along with the corresponding ratio values

Programname

GRP_REG + OI + POI over REG: area under curve toaccompany Figs. 8 and 9

Statement Branch

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

tcas 1259.34 1151.31 1.09 1195.36 605.53 1.97totinfo 806.38 409.41 1.97 383.07 270.38 1.42sched 1438.26 811.35 1.77 260.92 464.94 0.56sched2 974.87 444.23 2.19 853.92 195.63 4.36ptok 100.49 278.19 0.36 109.52 138.81 0.79ptok2 180.03 107.43 1.68 75.49 134.14 0.56replace 261.45 65.09 4.02 128.82 93.71 1.37

These results are for the ‘‘larger suites.’’

Table 18The percentage of suites for each subject program such that theGRP_REG + OI + POI approach yielded better, worse, and same APFDvalues as opposed to the REG approach, for both the statement andbranch prioritizations, using the ‘‘smaller suites’’

Programname

REG + OI + POI over REG: % suites better/worse/same

Statement Branch

Better Worse Same Better Worse Same

tcas 25.8 19.8 54.4 27.1 16.7 56.2totinfo 27.2 16.6 56.2 26.5 10.9 62.6sched 19.0 14.7 66.3 10.9 6.9 82.2sched2 9.4 7.0 83.6 7.3 3.1 89.6ptok 7.3 12.3 80.4 12.3 11.5 76.2ptok2 25.7 5.6 68.7 14.5 10.0 75.5replace 37.7 17.9 44.4 27.0 17.8 55.2

Table 19Similar results as in Table 18, but these are for the ‘‘larger suites’’

Programname

REG + OI + POI over REG: % suites better/worse/same

Statement Branch

Better Worse Same Better Worse Same

tcas 37.2 27.3 35.5 34.5 18.9 46.6totinfo 43.8 30.8 25.4 34.0 24.6 41.4sched 47.2 28.7 24.1 14.7 20.9 64.4sched2 40.4 18.1 41.5 35.8 10.3 53.9ptok 10.9 26.8 62.3 12.9 14.3 72.8ptok2 28.1 20.7 51.2 16.5 24.5 59.0replace 65.5 25.7 8.8 47.6 35.0 17.4

210 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

will get pushed to the end of the prioritized sequences toensure that other tests that exercise at least one modifica-tion will have higher priority. Our particular experimentson the rather small Siemens programs with relatively manymodifications (such as tcas, which has fewer than 150lines of code yet 41 different modifications) has providedlittle opportunity to demonstrate improvements using

GRP_REG + OI + POI. However, our results for programptok2 in which relatively more tests were able to bepushed back, illustrated the potential benefit of thisapproach. On much larger programs where an arbitrarytest case may stand a relatively smaller chance of traversinga modification, the potential benefits of the GRP_REG +OI + POI approach will likely be much higher.

When accounting for the modifications traversed by atest case, rather than simply determining whether a test casedoes or does not traverse at least one modification (as theGRP_REG + OI + POI approach does), it may be moreeffective for prioritization purposes if we account for thenumber of modifications traversed by each test case. Theintuition is that tests covering more modified statementswill have a greater chance of exposing a fault and shouldtherefore have higher priority. We analyze this idea next.

4.4. Heuristic III: prioritization Using

MOD * (REG + OI + POI)

Given our intuition that tests exercising more modifica-tions should have higher priority, we prioritized suites fol-lowing a new approach where we used our originalREG + OI + POI criterion, except now we consider theREG + OI + POI value to be proportional to the numberof modifications traversed by the given test case. Notice thatunder this approach, tests that do not cover any modifica-tions are still pushed back to the end of the prioritizedsequences (they cover 0 modifications, so their computedweights become 0). We refer to this approach as theMOD * (REG + OI + POI) approach, and the averageAPFD results for each subject program when prioritizingthe smaller suites with this approach are provided inTable 20. Table 21 shows the results when the larger suitesare prioritized using MOD * (REG + OI + POI).

From Table 20 for the smaller suites, we observe that ascompared to the results for REG, the results usingMOD * (REG + OI + POI) are improved in all casesexcept for totinfo-branch and replace (statementand branch), in which results were slightly degraded whenusing MOD * (REG + OI + POI). Indeed, in the caseswhere results are improved, the amount of improvementis always greater than that achieved by either the

Table 20Average suite sizes and APFD values for each subject program using the‘‘smaller suites’’, when prioritizing according to the REG approach andthe MOD * (REG + OI + POI) approach

Programname

Suitesize

Statement Branch

REG MOD * (REG+ OI + POI)

REG MOD * (REG+ OI + POI)

tcas 5.71 76.73 78.11 76.58 78.38totinfo 7.30 73.38 73.45 76.59 75.14sched 7.31 56.24 57.29 56.26 57.47sched2 8.01 60.23 65.42 62.66 65.26ptok 15.76 76.40 77.99 75.61 77.62ptok2 11.77 76.25 79.80 75.86 79.42replace 18.63 73.07 72.15 73.34 72.21

Experiments were conducted with respect to both statement and branchcoverage.

Table 21Similar results as Table 20, but these are for the ‘‘larger suites’’

Programname

Suitesize

Statement Branch

REG MOD * (REG +OI + POI)

REG MOD * (REG +OI + POI)

tcas 8.00 73.95 76.33 73.95 76.38totinfo 16.36 82.82 82.09 87.06 84.47sched 18.22 66.39 67.86 66.05 67.50sched2 20.87 76.54 80.43 78.32 80.12ptok 30.63 83.49 84.19 82.71 83.84ptok2 41.59 93.44 93.68 93.51 93.71replace 58.86 76.11 75.49 76.30 75.49

Table 22Computed t values and the corresponding confidence with which the nullhypothesis can be rejected when comparing the APFD values of suitesprioritized using REG and MOD * (REG + OI + POI), for the ‘‘smallersuites’’

Programname

Statement Branch

Computedt value

% Confidence ofrejecting nullhypothesis

Computedt value

% Confidenceof rejectingnull hypothesis

tcas 8.32 >99.9% 12.44 >99.9%totinfo 0.49 <80.0% �11.15 >99.9%sched 5.61 >99.9% 7.76 >99.9%sched2 16.13 >99.9% 10.89 >99.9%ptok 10.35 >99.9% 13.05 >99.9%ptok2 15.73 >99.9% 16.05 >99.9%replace �10.05 >99.9% �12.37 >99.9%

Table 23Computed t values and the corresponding confidence with which the nullhypothesis can be rejected when comparing the APFD values of suitesprioritized using REG and MOD * (REG + OI + POI), for the ‘‘largersuites’’

Programname

Statement Branch

Computedt value

% Confidenceof rejectingnull hypothesis

Computedt value

% Confidenceof rejectingnull hypothesis

tcas 16.68 >99.9% 18.05 >99.9%totinfo �7.95 >99.9% �26.37 >99.9%sched 12.36 >99.9% 16.48 >99.9%sched2 18.09 >99.9% 11.39 >99.9%ptok 6.28 >99.9% 10.09 >99.9%ptok2 2.79 >99.0% 2.45 >95.0%replace �15.54 >99.9% �20.02 >99.9%

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 211

REG + OI + POI approach or the GRP_REG + OI +POI approach (except for the case of totinfo-statement,in which average results improve when using MOD *(REG + OI + POI), but not by as much as when usingREG + OI + POI). Moreover, the amount of improvementcan be relatively significant (between 2% and 5% improve-ment in certain cases for programs sched2, ptok, andptok2).

From Table 21 for the larger suites, we see that theresults using MOD * (REG + OI + POI) are improved inall cases except for totinfo (statement and branch) andreplace (statement and branch). This trend is similar tothat seen for the smaller suites, with the exception thatnow, results are degraded in the case of totinfo-state-ment. In the cases where results are improved when usingMOD * (REG + OI + POI) instead of REG, the amountof average improvement is always greater than thatachieved by either REG + OI + POI or GRP_REG +OI + POI.

Our results therefore suggest that taking into accountthe number of modifications traversed by tests in additionto relevant slicing information can lead to significantlyimproved results in certain cases when prioritizing tests.

Table 22 shows the results of conducting a t test forpaired observations comparing the APFD values of thesmaller test suites prioritized using REG and MOD * (RE-G + OI + POI). Table 23 shows the same, but for the lar-ger suites. From Table 22 for the smaller suites, we see

that the slight average APFD improvement witnessed inthe case of totinfo-statement is not statistically signifi-cant. However, in all other cases in which results wereimproved (or degraded) when using MOD * (REG + OI +POI) over REG, the amount of improvement (or degrada-tion) was statistically significant, such that we can reject thenull hypothesis with greater than 99.9% confidence. FromTable 23 for the larger suites, we can see that in all cases,the difference in APFD values when using MOD * (RE-G + OI + POI) as opposed to REG is statistically signifi-cant; we can reject the null hypothesis with at least 95%confidence in all cases.

We next present Figs. 10 and 11 for the smaller suites,and Figs. 12 and 13 for the larger suites. These are similarto the previous plots shown except now the plots comparethe MOD * (REG + OI + POI) approach to the REGapproach.

As expected, the plots illustrate the generally-higherimprovement of the MOD * (REG + OI + POI) approachover REG, as compared to the REG + OI + POI approachover REG, for most subject programs. The major excep-tions are, as expected, programs totinfo and replace.Also, for program ptok2 with the larger-suite plots, thecurve crosses the line y = 0 at about the median suite, indi-cating that about as many suites were improved as made

-40

-20

0

20

40

0 50 100 150 200 250 300 350 400

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

tcas

The Benefit of MOD*(REG+OI+POI) Over REG: Statement Coverage

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

totinfo

-30

-20

-10

0

10

20

30

0 50 100 150 200 250 300 350

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

sched

-40

-20

0

20

40

0 50 100 150 200 250 300 350 400 450

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

sched2

-20

-10

0

10

20

0 100 200 300 400 500 600 700APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

ptok

-30

-20

-10

0

10

20

30

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

ptok2

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

replace

Fig. 10. This figure is similar to Fig. 6, except the plots here compare the MOD * (REG + OI + POI) approach to the REG approach.

-30

-20

-10

0

10

20

30

0 50 100 150 200 250 300 350 400 450

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

tcas

The Benefit of MOD*(REG+OI+POI) Over REG: Branch Coverage

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

totinfo

-30

-20

-10

0

10

20

30

0 50 100 150 200 250 300

APFD cdiff:(MOD*(REG+OI+POI))-(REG)

Test Suite

sched

-40

-30

-20

-10

0

10

20

30

40

0 50 100 150 200 250 300

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

sched2

-30

-20

-10

0

10

20

30

0 100 200 300 400 500 600 700APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

ptok

-30

-20

-10

0

10

20

30

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

ptok2

-15

-10

-5

0

5

10

15

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

replace

Fig. 11. This figure is similar to Fig. 10, except the plots here are for prioritization when branch coverage is taken into account.

212 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

worse for program ptok2 using the larger suites. How-ever, it appears from the plots that there is still more areaunder the curve above the line y = 0 as opposed to below it,for the ptok2 program in this case. More insight is gainedby the computed areas under the curves as given in Tables24 and 25.

From these tables, the general degradation of results fortotinfo and replace using both the smaller and larger

suites is evident because of the ratio values that are lessthan one (except for totinfo-statement using the smallersuites, in which case the ratio is just slightly above 1). Also,for program ptok2 using the larger suites, we see thatindeed there is significantly more area under the curveabove the line y = 0 as opposed to below the line y = 0.Notice that for the other subject programs in which resultswere improved, the amount of improvement is also very

-20

-10

0

10

20

0 100 200 300 400 500 600 700

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

tcas

The Benefit of MOD*(REG+OI+POI) Over REG: Statement Coverage

-15

-10

-5

0

5

10

15

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

totinfo

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

sched

-30

-20

-10

0

10

20

30

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

sched2

-15

-10

-5

0

5

10

15

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

ptok

-10

-5

0

5

10

0 100 200 300 400 500 600 700 800 900

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

ptok2

-6

-4

-2

0

2

4

6

0 100 200 300 400 500 600 700 800 900APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

replace

Fig. 12. Similar to Fig. 10, but this is for the ‘‘larger suites’’.

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600 700

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

tcas

The Benefit of MOD*(REG+OI+POI) Over REG: Branch Coverage

-15

-10

-5

0

5

10

15

0 100 200 300 400 500 600 700 800 900

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

totinfo

-15

-10

-5

0

5

10

15

0 100 200 300 400 500 600

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

sched

-30

-20

-10

0

10

20

30

0 100 200 300 400 500 600 700

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

sched2

-15

-10

-5

0

5

10

15

0 100 200 300 400 500 600 700 800

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

ptok

-10

-5

0

5

10

0 100 200 300 400 500 600 700 800 900

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

ptok2

-6

-4

-2

0

2

4

6

0 100 200 300 400 500 600 700 800 900

APFD diff:(MOD*(REG+OI+POI))-(REG)

Test Suite

replace

Fig. 13. Similar to Fig. 11, but this is for the ‘‘larger suites’’.

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 213

clear: in more than half of the improved cases, the com-puted ratio value is greater than or equal to 3! This is a sig-nificant improvement over those results obtained whenusing either the REG + OI + POI approach or theGRP_REG + OI + POI approach. A similar observationcan be made from Tables 26 and 27, in which the subjectprograms showing improvement often had even moresuites improved and less suites worsened than when using

either the REG + OI + POI or the GRP_REG + OI +POI approaches. Notice also that the MOD * (REG +OI + POI) approach is more likely to lead to changedAPFD results as compared to the REG approach, sincethe percentages of ‘‘same’’ suites in Table 26 is generallymuch less than those listed in Table 10 (and similarly forTable 27 as compared to Table 11). This suggests that whenthe MOD * (REG + OI + POI) approach is able to alter

Table 24Areas under the curves of the plots in Figs. 10 and 11, both above the liney = 0 and below it, along with the corresponding ratio values

Programname

GRP_REG + OI + POI over REG: area under curve toaccompany Figs. 10 and 11

Statement Branch

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

tcas 2127.15 742.17 2.87 2276.46 479.41 4.75totinfo 1322.90 1254.18 1.05 563.51 2021.52 0.28sched 2107.56 1057.55 1.99 1817.43 606.68 3.00sched2 5934.94 746.33 7.95 3193.45 594.95 5.37ptok 2497.98 905.74 2.76 2791.12 779.99 3.58ptok2 4431.00 875.66 5.06 4380.45 818.90 5.35replace 639.57 1564.84 0.41 569.27 1699.33 0.33

Table 25Areas under the curves of the plots in Figs. 12 and 13, both above the liney = 0 and below it, along with the corresponding ratio values

Programname

GRP_REG + OI + POI over REG: area under curve toaccompany Figs. 12 and 13

Statement Branch

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

Areaabovey = 0

Areabelowy = 0

Ratioabove:below

tcas 3086.72 707.20 4.36 3060.25 625.98 4.89totinfo 699.75 1428.81 0.49 230.85 2826.80 0.08sched 2215.45 746.53 2.97 1761.54 308.15 5.72sched2 4730.25 842.86 5.61 2689.10 896.57 3.00ptok 1612.70 911.89 1.77 1894.37 771.11 2.46ptok2 1161.14 917.93 1.26 1118.54 909.21 1.23replace 241.48 863.75 0.28 196.51 1010.34 0.19

Table 26The percentage of suites for each subject program such that theMOD * (REG + OI + POI) approach yielded better, worse, and sameAPFD values as opposed to the REG approach, for both the statementand branch prioritizations, when using the ‘‘smaller suites’’

Programname

REG + OI + POI over REG: % suites better/worse/same

Statement Branch

Better Worse Same Better Worse Same

tcas 34.1 10.1 55.8 36.1 8.8 55.1totinfo 27.1 27.2 45.7 18.9 37.2 43.9sched 24.7 14.9 60.4 23.5 9.0 67.5sched2 40.7 6.7 52.6 26.3 6.2 67.5ptok 50.0 20.6 29.4 55.1 17.9 27.0ptok2 60.3 22.5 17.2 60.8 22.0 17.2replace 33.6 55.3 11.1 29.6 58.7 11.7

Table 27Similar results as in Table 26, but these are for the ‘‘larger suites’’

Programname

REG + OI + POI over REG: % suites better/worse/same

Statement Branch

Better Worse Same Better Worse Same

tcas 63.2 14.5 22.3 62.6 14.4 23.0totinfo 32.8 54.4 12.8 17.7 74.4 7.9sched 55.5 24.9 19.6 49.5 12.9 37.6sched2 60.6 24.0 15.4 50.3 25.7 24.0ptok 49.1 33.0 17.9 55.3 28.1 16.6ptok2 45.8 45.1 9.1 44.5 46.5 9.0replace 28.2 70.2 1.6 24.5 74.0 1.5

Table 28Among all the test cases in the ‘‘smaller suites’’, the percentage of times inwhich a test case exercising a modification actually leads to a fault beingexposed due to a change in program output

Programname

% of Exercised modificationsthat lead to exposed faults

tcas 5.41totinfo 14.56sched 10.31sched2 5.63ptok 8.33ptok2 33.37replace 3.85

214 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

the actual ordering of tests in the prioritized suites as com-pared to REG, then there is a strong likelihood that thealtered ordering is done for the better when using theMOD * (REG + OI + POI) approach to achieve a highrate of fault detection.

To explain why results can sometimes be degraded whenusing the MOD * (REG + OI + POI) approach, we next

analyze our intuition that ‘‘tests exercising more modifica-tions are likely to expose more faults’’.

Contrary to our intuition, it turns out that for the Sie-mens subject programs, it is often not the case that travers-ing a modification will expose a fault in the software.Indeed, Table 28 shows that at best, only about a thirdof the cases in which a modification is exercised by a testcase actually lead to a fault being exposed. In some cases,the likelihood of exposing an error by traversing a modifi-cation is extremely small (such as for replace, in whichthe chance of exposing a fault when traversing a modifica-tion is less than 4% on average).

The reason why many exercised modifications do notlead to a fault being exposed is evident by examining someof the particular modifications introduced in the subjectprograms. For example, in one case for tcas, a statementof the form a = a > b has an operator change to becomea = a >=b. However, this can only potentially cause thedefined value for a to change when a happens to equal b.As another example for program sched2, there is a state-ment of the form if(x = func()) return ERROR, which ischanged simply to func(). In this case, the output can onlybe affected in the cases when the if evaluates to true, whichin turn may only occur in erroneous executions. As a finalexample with program replace, a statement of the formif(x && y) is changed to if(x), which will not have an affecton the program execution if y happens to have value true.It turns out that many of the Siemens modifications aresuch that output can be affected only in certain cases, asin the above examples. The program tcas outputs only

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 215

either an error message, or one of three possible output val-ues. Programs sched and sched2 simulate the executionof processes in a system, but the output consists only of theprocesses in the order in which they finish in the system.Program replace outputs the same data that was input-ted, with the exception that all occurrences of a specifiedstring are replaced with a new string. Given the specificoutput of these programs, it is understandable that outputvalues may have less likelihood to be altered when a mod-ified statement is executed (for example, perhaps a modifi-cation adjusts the priority of a process in the sched

program, but the relative order in which processes finishin the system may remain unchanged). This observationexplains why expected improvements were not obtainedby our approach when compared to the REG approach.When considering the relevant slicing information for testcase prioritization, programs that are more computation-ally-intensive (more ‘‘mathematical’’ in the sense that theoutput depends strongly upon each individual computa-tion) may have greater potential to show benefit fromour approach since exercising a modification may moreeasily expose a fault. We now provide detailed analysisand insight into the results for each individual subject pro-gram, to explain why the results in Table 28 are as they are,and how this can influence the results we obtained whenusing the MOD * (REG + OI + POI) prioritizationapproach.

(i) REPLACE: The program replace is a patternmatching system that replaces all occurrences of a specified‘‘source’’ pattern with a specified ‘‘destination’’ pattern insome text. Notice that for both replace-statement andreplace-branch, the average APFD results were slightlydegraded when using MOD * (REG + OI + POI) asopposed to REG on both the smaller and larger suites.We observe that among all of our subject programs,replace on average has the least likelihood of leadingto a fault being exposed when a modification is exercisedby a test. This implies that for the test cases in thereplace program in particular, many tests are givenhigher priority because they traverse more modificationsin the software, but there is actually only a relatively smallchance that traversing these modifications will lead toexposed faults. In effect, many tests may be included earlierin the prioritized sequences using MOD * (REG + OI +POI) because they exercise more modifications, eventhough they may actually expose few to no faults, thusworsening the computed rate of fault detection of the suitesas opposed to the REG approach that does not account forthe modifications traversed by tests. A closer analysis of theindividual faulty versions associated with replace pro-vides an explanation as to why this is happening.

Among the faulty versions for replace, there are twofaulty versions in particular, versions v12 and v21, whichinclude modifications that are traversed by tests far moreoften than for any of the other faulty versions. Studyingthese two particular faulty versions can shed some lightinto our results for program replace. Faulty version

v12 in particular involves modified statements that are tra-versed by tests in over 10 times more situations than for allof the other faulty versions except v21. This faulty versioninvolves changing the value of a defined constant‘‘MAXPAT’’ from 100 to 50; it turns out this constantvalue is used in many of the statements in the program,and therefore this modification is considered traversed bytests in a very high number of cases. This implies that formany suites, there are a significant number of test cases get-ting marked with higher priority due to traversing this par-ticular seeded error. However, we found that in only 7.6%of the situations in which this modification is traversed isthe fault actually exposed in the program output. This isbecause the constant ‘‘MAXPAT’’ defines a maximum pat-tern length and, as long as the specified source or destina-tion patterns do not exceed 50 characters by a test case (orif the patterns exceed 100 characters), then this seeded errorwill actually not be exposed by the test. Only if a patternlength exceeds 50 characters, but does not surpass 100characters, will a fault potentially be exposed. As expected,inspection reveals that the majority of test cases forreplace have quite small pattern lengths that are lessthan 50 characters, so for the majority of test cases, it isnot possible for faulty version v12 to be exposed.

Another faulty version, v21, involves modified state-ments that are traversed by tests in over 15 times more sit-uations than for all of the other faulty versions except v12.Faulty version v21 actually involves 4 changes to the soft-ware: (1) constant ‘‘MAXPAT’’ has value changed from100 to 99; (2) the ‘‘maximum size’’ second-of-three param-eters to an ‘‘fgets’’ function call is decremented by one,from constant value 100 to value 99, when this function callis used to read a line of text from standard input represent-ing the base text from which textual replacements ofmatching patterns may occur; (3) the >= operator in a con-dition is changed to the > operator, where the right-handside of the conditional operator is a variable that happensto always contain the ‘‘MAXPAT’’ constant value; and (4)a compound condition with two predicates separated by alogical ‘‘and’’ has the second predicate removed. This par-ticular faulty version, by far, has more modified statementsthat are more frequently traversed by tests than for anyother faulty version for program replace. The reason isbecause changing the constant ‘‘MAXPAT’’ in change (1)above, as was seen earlier concerning faulty version v12,actually changes many statements in the program thatuse that constant value, and on top of that, three additionalstatements in the program are changed (changes (2)through (4) above). Despite the very high number of timesa modified statement is traversed by a test for this faultyversion, this faulty version actually has the least likelihoodof exposing a fault when one of the modified statements istraversed (namely, a mere 0.01% chance!). The reason thisvalue is so low can be attributed to the semantics of thechanges (1)–(4) as listed above. Change (1), like faulty ver-sion v12, involves changing the value of constant‘‘MAXPAT’’. However, here the value is merely changed

216 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

from value 100 to 99, and this may only possibly lead to anexposed fault if the length of a pattern is exactly 100 char-acters; otherwise the behavior of the program remainsunchanged. Change (2) above involves reading a line oftext from standard input, and rather than reading at most100 characters per line using ‘‘fgets’’, instead only at most99 characters are read per line. This functionality occurs ina loop which continues reading lines until all input hasbeen read. Certainly, if the input lines for a particular testare each no more than 99 characters long, then the behav-ior of the ‘‘fgets’’ function call remains unchanged and afault will not be exposed. Indeed, we found that among4397 distinct inputs that test cases provided for this callto function ‘‘fgets’’, only 39 of them contained at leastone input line with 100 or more characters. So for nearlyall test cases, exercising change (2) cannot lead to anyexposed fault. Change (3) above involves a conditional thatchecks whether a pattern length exceeds the maximum pat-tern length, and the change simply involves removing the =from a >= operator. Since ‘‘MAXPAT’’ in this faultyversion is simply changed from 100 to 99, and since mosttest cases have pattern lengths that are significantly lessthan 99 characters, then again, this change will not affectthe behavior of the program in most cases. Finally, change(4) involves removing the second predicate from a com-pound condition as follows:

BEFORE: ðif ððarg½i� ¼¼ EOLÞ&&ðarg½iþ 1� ¼¼ delimÞÞÞAFTER: ðif ððarg½i� ¼¼ EOLÞÞÞ

In the above code, ‘‘EOL’’ happens to refer to the con-stant ‘‘$’’ character, and ‘‘delim’’ happens to refer to the‘‘null’’ (string termination) character. The above code alsois contained within a loop that examines every character ina specified pattern string. The modification, therefore, canonly possibly lead to an exposed fault if a pattern stringcontains the ‘‘$’’ character, and if that character is not

the last character in the pattern string before the terminat-ing ‘‘null’’ character. We found out that among all 5542test cases for program replace, only 871 test cases con-tained a ‘‘$’’ character in either the source or destinationpattern strings, and among those, merely 423 of them con-tained a ‘‘$’’ character that was also followed immediatelyby some character other than the terminating ‘‘null’’ char-acter. In other words, in only a relatively small number oftest cases can exercising change (4) possibly lead to anexposed fault.

Based on the above analysis for each of the four changesin faulty version v21, we can see why the chance of leadingto an exposed fault when a modified statement in v21 isexercised, is so small. Thus, in many situations a test casewill be given higher priority due to traversing the changesfor this particular faulty version, but the test will fail toexpose the fault in virtually all cases. This can explainhow our overall results using MOD * (REG + OI + POI)can be degraded as opposed to the results of REG whenexperimenting with program replace.

(ii) TOTINFO: Program totinfo is a mathematicaldata table totaler that computes statistics from a set ofinputted data tables and outputs the results. Results fortotinfo using MOD * (REG + OI + POI), similar toprogram replace, were generally degraded as opposedto the results when using the REG approach (except forthe case of totinfo-statement using the smaller suites).We observe that this program also has a less-than-15%chance on average of leading to an exposed fault when amodification is traversed.

One particular faulty version for this program happensto involve modifications that are traversed by tests in overtwice as many situations as for the other faulty versions.This particular faulty version, v10, involves changing thedefined type of a variable ‘‘N’’ from ‘‘double’’ to ‘‘float’’inside a function. This variable is then used in several state-ments within the function, and that is why there are rela-tively more situations where this modification isconsidered traversed by a test than for the other faulty ver-sions. However, it turns out that in only 1.78% of cases,traversing this error leads to an exposed fault in the soft-ware output. The reason this percentage is so low isbecause the variable ‘‘N’’ that has precision reduced from‘‘double’’ to ‘‘float’’ is used during the computation of avalue for another variable ‘‘info’’ which has ‘‘double’’ pre-cision; the variable ‘‘info’’ is subsequently printed to stan-dard output to only 2 decimal places of precision, and‘‘info’’ is also used to compute another output value whichis printed to only 4 decimal places of precision. Therefore,even though the precision of variable ‘‘N’’ is reduced in thisfaulty version, any slight changes to the computed values ofthe output variable ‘‘info’’ is often not propagated to theactual output of the program since the program outputprecision is limited in the print statement.

Since this particular faulty version is traversed by testsfar more often than any of the other faulty versions fortotinfo, but there is a very small chance that traversingthese modified statements will lead to an exposed fault,then many tests get higher priority in the totinfo suitesdue to this particular faulty version even though they arenot providing any fault detection effectiveness with respectto this particular fault. This can explain how results can bedegraded using the MOD * (REG + OI + POI) approachfor totinfo. On the other hand, there is one other faultyversion for this program which exposes the fault 100% ofthe times it is traversed by a test case. This faulty version,v1, involves removing a ‘‘goto’’ statement in the software.However, the total number of times this fault is traversedby a test is the least out of all the other faulty versionsfor program totinfo, and therefore the benefits of ourMOD * (REG + OI + POI) approach are limited in thiscase.

(iii) TCAS: The program tcas and the remainder of theSiemens suite programs show improvement on averageusing the MOD * (REG + OI + POI) approach asopposed to the REG approach. Program tcas is the sim-plest of the Siemens suite, containing no loops and the

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 217

fewest number of lines of code. It simply takes as input asequence of 12 integers, performs a series of checks, andoutputs either a static error message or else one of threepossible output integers. For this program, there are actu-ally no particular faulty versions that stand out among allothers as being traversed by tests in the most number of sit-uations. It is true that the average chance of exposing afault for tcas when a modification is traversed is quitesmall (just over 5% chance as given in Table 28). The rea-son this average percentage is small is due to the fact thattcas has only four possible output values. As a result,multiple test cases that may exercise different situations inthe program will still map to the same output value inthe end. Therefore, there is a relatively higher likelihood(as compared to other programs whose output domainmay be much larger or even infinite) that traversing a mod-ification in tcas will still have the test case being mappedin the end to the same output value. That is why on averageamong the Siemens subjects, tcas has the second-lowestpercentage chance of leading to an exposed fault when amodification is traversed. Despite this, it is also true thattcas has the most total faulty versions available amongall the Siemens programs and so there are still a relativelyhigh total (not average) number of faulty versions withtcas that have a relatively higher chance of leading toan exposed fault when a modification is exercised. In par-ticular, 12 of the 41 faulty versions have a greater-than-10% chance of a traversed modification leading to anexposed fault (12 is a higher number of faulty versions thanthe total number of faulty versions available for four of theother six Siemens programs!). This implies that many of thetcas faulty versions involve modifications that can alterthe behavior of a test case to an extent sufficient to leadto a different output being produced.

As one example, faulty version v36 for tcas leads to afault being exposed every time the modification is traversedby a test, and the total number of times this modification istraversed is about average as compared to the other faultyversions for this program. The reason this particular faultyversion has a 100% chance of leading to an exposedfault when the modification is exercised, is because thefaulty version involves changing the value of an outputconstant from value 2 to value 1. Thus, for tcas thereare still a significant number of cases in which a traversedmodification will lead to an exposed fault (even thoughthe average percentage of cases is relatively small), andtherefore there are still some good opportunities for ourMOD * (REG + OI + POI) approach to show potentialbenefit over the REG approach in terms of rate of faultdetection for prioritized suites.

(iv) SCHED and (v) SCHED2: The two sched pro-grams are process priority schedulers that simulate thescheduling of processes in a system and output the pro-cesses in the order in which they finish in the system. Bothsched and sched2, like tcas, do not have any particularfaulty versions that stand out as being traversed by tests infar more cases than any of the other faulty versions. How-

ever, there are some observations worth mentioning thathelp to explain the improvement in results using MOD *(REG + OI + POI) for these two experimental subjects.

For program sched, the two faulty versions with themost total number of situations in which the modificationis traversed, versions v9 and v2, respectively, have 10.96%and 13.49% chance of leading to an exposed fault whenthe modification is traversed. These percentages are alsorelatively higher than those for the other faulty versionsof sched. Faulty version v9 involves the removal of a‘‘+1’’ term in a condition:

BEFORE: if ðargc < ðMAXPRIOþ 1ÞÞAFTER: if ðargc < ðMAXPRIOÞÞ

If the above condition evaluates to true, then the pro-gram terminates with an error message due to too fewinput arguments specified. However, if at least the correctnumber of input arguments are specified (according tothe base version of the program), then the above conditionwill evaluate to false both before and after the modification,and the program will proceed as normal even after themodification. By inspection of the test cases for sched, alarge majority of the test cases do indeed specify the correctnumber of input arguments, and that is why only about 1in 10 test cases are able to expose the above fault (eventhough every test cases traverses the above fault). However,compared to the other faulty versions of sched except forversion v2, version v9 still has a relatively higher chance ofleading to an exposed fault, and it is traversed in the mostnumber of situations.

Faulty version v2 involves moving a ‘‘+1’’ term fromone statement to another:

BEFORE: ðcount ¼ block queue� > mem countÞ;n ¼ ðintÞðcount � ratioþ 1Þ;

AFTER: count ¼ block queue� > mem count þ 1;

n ¼ ðintÞðcount � ratioÞ;

In the above code, defined variable ‘‘count’’ is only usedin the subsequent computation of variable ‘‘n’’. Before themodification, ‘‘n’’ is defined to have value ‘‘(count *ratio) + 1’’, and after the modification, ‘‘n’’ is effectivelydefined to have value ‘‘(count + 1) * ratio’’. It turns outthat both definitions of ‘‘n’’ lead to the same value (cannotlead to an exposed fault) if and only if ‘‘ratio’’ has value 1,or some similar value close to 1 that causes both definitionsof ‘‘n’’ to be equal after the computed results are cast to aninteger. We found that among the total times that a ratiovalue was specified in the program input, the ratio valuewas ‘‘close to 1’’ (defined as greater than 0 but less than2) in 11,519 cases. By comparison, the total times thatthe ratio was not ‘‘close to 1’’ was only 854 times. As aresult, there is a relatively strong likelihood that traversingthe above modification will not lead to an exposed fault,and that is why the fault was exposed in only 13.49% ofthe cases in which the modification was traversed. Never-theless, among all the other faulty versions for sched, this

218 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

is the one with the greatest likelihood of leading to anexposed fault, and it is traversed in the second-most num-ber of situations.

Since many test cases traverse the above two faults, andeach traversal stands a relatively high chance of leading toan exposed fault as compared to the other faulty versions,then our MOD * (REG + OI + POI) approach has goodopportunity to show potential improvement with thesched program.

For program sched2, the faulty version with the great-est likelihood of leading to an exposed fault when the mod-ification is traversed (15.82% chance), also happens to bethe faulty version that is traversed in more situations thanany of the other faulty versions for sched2. The faultyversion in question is version v8, which involves completelyremoving the following code:

if ðprio > MAXPRIOkprio < 0ÞreturnðBADPRIOÞ;

Even though traversing the above modification has thegreatest chance of leading to an exposed fault among allthe other faulty versions for sched2, there is still a lessthan 1-in-5 chance of exposing the fault. This is becausethe code snippet above takes care of checking for an errorcondition in which the specified priority value ‘‘prio’’ isout-of-bounds. Since this error condition occurs in rela-tively fewer test cases than when it does not occur, that iswhy most test cases do not expose this fault when the mod-ification is traversed (the specified ‘‘prio’’ value is correct inmost cases). However, because the above faulty versionhappens to be traversed in more situations than any otherfaulty version for sched2 and the version is also the onewith the highest likelihood of leading to an exposed faultwhen the modification is exercised, this can explain whyresults are so much improved for sched2 as comparedto many of the other Siemens programs. Here, many testsare given higher priority due to traversing this modifica-tion, and relatively many of those tests do happen toexpose the fault as well! This would explain the improvedrates of fault detection witnessed for many suites when pri-oritizing according to the MOD * (REG + OI + POI)approach for program sched2.

(vi) PTOK and (vii) PTOK2: The two ptok programsare lexical analyzers that identify tokens from a stream ofinput characters, and output each identified token. Forprogram ptok, faulty version v6 is traversed by tests inover twice as many situations as for any of the other faultyversions. This particular faulty version involves changingseveral integers contained in two statically-defined integerarrays that are used in several statements in the software;this is why relatively many situations occur in which oneof these modified statements is traversed. It turns out thatthere is just over a 10% chance that exercising one of thesemodified statements leads to an exposed fault. The reasonthere is only about a 1-in-10 chance of leading to anexposed fault here is because the two statically-defined

arrays in question each contain about 360 integers, andonly a handful of those integers within each array are mod-ified. When those arrays are used in the software, only onsome occasions will the actual modified integers withinan array be referenced; in cases where the array is usedbut only other integers that have not been modified in thearray are referenced (the majority of them), then the faultwill not be exposed. However, it turns out that this 10%chance is relatively high among most of the other faultyversions for ptok (except version v3, which will be dis-cussed shortly), and this can explain why our APFD resultsshowed benefit when using the MOD * (REG + OI + POI)approach.

The faulty version with the next-highest number of situ-ations in which the modification was traversed is versionv1, but here, there is only a 0.47% chance of exposing thefault when the modification is traversed. This faulty versioninvolves changing the location of a ‘‘case 32’’: clause withina ‘‘switch’’ block (but only the ‘‘case’’ clause is moved, notthe block of statements falling under that ‘‘case’’) so thatthe fall-through behavior of control in the ‘‘switch’’ canbe affected if control enters that particular ‘‘case’’ clause.It turns out that control enters this particular ‘‘case’’ clausein relatively few situations (these situations are the onlyones in which a fault could potentially be exposed). How-ever, this modification is considered traversed in far moresituations because many other empty ‘‘case’’ clauses imme-diately above the original ‘‘case 32’’: clause happen to fall-through to the same first statement that is contained underthe ‘‘case 32’’: clause. This is the reason for the relativelysmall percentage chance of leading to an exposed faultwhen this modification is exercised. Since this particularfaulty version is traversed in the second-highest numberof situations among the other faulty versions for this pro-gram, the amount of improvement witnessed for programptok is limited, and this can help to explain why theimprovement witnessed for program ptok using the smal-ler suites, while positive, is still less than that of ptok2using the smaller suites (discussed shortly).

One final observation about program ptok is thatone of the faulty versions, v3, leads to an exposed faultevery time the modification is traversed. This faulty versioncompletely removes a function call to function‘‘unget_char’’, which effectively pushes a read characterback onto the input stream (if the character is insteadnot pushed back, this will definitely affect the tokens thatare processed from the input stream and therefore it willalso affect the program output). However, among all thefaulty versions for ptok, this one is the one that is tra-versed in the fewest number of situations. Therefore, thislimits the amount of potential improvement that can begained by our MOD * (REG + OI + POI) approach in thiscase.

For program ptok2, faulty version v1 is the version inwhich the modification is traversed in the most number ofsituations, and further, this faulty version has almost a45% chance of leading to an exposed fault when the

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 219

modification is traversed (this faulty version involves thecomplete removal of an ‘‘if’’ block along with the three state-ments inside of it, one of which is a ‘‘return’’ statement).Moreover, three other faulty versions also have an evenhigher chance of leading to an exposed fault when the erroris traversed: version v6 has a 67.9% chance, and versions v2and v5 have a 100% chance. Indeed, 8 of the 10 faulty ver-sions for ptok have a greater-than-10% chance of leadingto an exposed fault when the modifications are traversed.

On the other hand, one faulty version, v3, never leads toan exposed fault when it is traversed (this faulty versioninvolves the removal of a ‘‘return’’ statement, but thereturn happens anyhow without any other side effects lateron in the function if none of the remaining ‘‘if’’ statementsin the function evaluate to true). However, the modifiedstatement in this faulty version is only traversed in rela-tively few situations. Thus, overall, the ptok2 programhas much potential to show benefit when using theMOD * (REG + OI + POI) approach because the modi-fied statements that are most-frequently exercised by testsalso tend to have a relatively higher chance of leading toan exposed fault. This can explain the improved APFDresults we have witnessed for ptok2, as well as why therewas greater improvement witnessed for ptok2 than forptok using the smaller test suites.

Summary: By analyzing each of the Siemens programsand each of their faulty versions individually, we were ableto find a correlation between the amount of improvementwitnessed by the MOD * (REG + OI + POI) approachand the number of times an exercised modification wasmost likely to expose a fault in the software. The cases inwhich the most-frequently traversed modifications were rel-atively unlikely to expose faults in software, were the casesin which results were the poorest with MOD * (REG +OI + POI). For example, we saw that replace involvestwo particular faulty versions whose modifications are tra-versed in significantly more cases (10–15 times more cases)than for any of the other faulty versions for that subjectprogram. Further, we showed that these two particularfaults happen to be quite difficult to detect, with themost-frequently traversed modification exposing the faultin a mere 0.01% of cases. This was due to the fact thatthe semantics of these particular modifications would onlylead to changed behavior during program execution if rel-atively uncommon cases occurred (e.g., when specified pat-tern lengths were very long or a particular input characteroccurred in an incorrect position in the input). Similarly,for totinfo, the one faulty version that stands out asbeing traversed in twice as many situations as for the otherfaulty versions, leads to an exposed fault in less than 2% ofthe times in which a modified statement is exercised. In thiscase, that was due to reducing the precision of an interme-diate floating-point variable used for computation, but theactual outputted value had limited precision such that inmany cases, not enough decimal places were outputted towitness any difference in actual output caused by thereduced precision.

On the other hand, the other cases in which the most-frequently traversed modifications were relatively likely toexpose faults in software, were the cases in which resultswere the best with MOD * (REG + OI + POI). Forinstance, we saw in sched2 that the most-frequently tra-versed modification was one involving the removal of aconditional check for an out-of-bounds input value. Eventhough there were certainly more cases than not in whichthe input value was in the proper range, there were still asufficient number of cases in which the input value wasout-of-bounds to cause this particular fault to be exposedin the greatest percentage of situations (as compared tothe other faulty versions for sched2). Similarly, forptok2, we saw that the modification traversed the mostnumber of times also led to an exposed fault in nearly halfof times it was traversed (this was due to the modificationbeing relatively large, involving the complete removal ofan ‘‘if ’’ statement along with the three statements insideof it).

Our analysis therefore suggests that among all of themodified statements present in a subsequent version ofsoftware, if we consider the number of times each modifica-tion is traversed by a test, along with the relative ease withwhich exercising the modification might lead to an exposedfault, and the most-frequently traversed modifications arerelatively likely to lead to an exposed fault due to thesemantics of the particular modification made, then wemight expect the potential benefit of the MOD * (REG +OI + POI) approach to be higher than the REG approach.A future extension to this approach is to incorporate aweight for the type of modification in addition to the num-ber of modifications traversed by a test case. For example ifthe modification is to a used variable in a definition with a1-to-1 mapping between the used and defined variables, it ismore likely affect the output than a modification to a usedvariable in a definition with a many-to-1 mapping betweenthe used variables and the defined variable. This informa-tion could be collected from profiling information availablefor the previous version of the program or could be pro-vided by the developer. In our experiments, the most-fre-quently traversed modifications were not always thosethat were most likely to lead to exposed faults (namely,for programs replace and totinfo), and that is whyour results were sometimes degraded using the MOD *(REG + OI + POI) approach. However, it is encouragingthat our experimental results are generally showing signif-icant improvement when using MOD * (REG + OI +POI), despite the fact that the Siemens faults are relativelydifficult to detect on average, as was illustrated earlier inTable 28. Moreover, in cases where modifications are morelocalized to a smaller fraction of the code such that moretest cases can be identified as not traversing any modifica-tions, we expect our approach to be able to show evenmore improvement as opposed to when the modificationsare not taken into account at all, since more tests thatare guaranteed to not expose any faults may be identifiedand pushed back in the prioritized sequences.

220 D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221

Overall, our experimental results suggest that account-ing for relevant slicing information, along with informationabout the modifications traversed by each test case, hasmuch potential when used as part of the test case prioriti-zation process for regression testing. Our results also illus-trate that results are dependent upon the semantics of theparticular programs used and the modifications made inthe subsequent versions, including the distribution of themodifications and the ease with which exercising a modifi-cation might lead to an exposed fault in the software.Accordingly, our MOD * (REG + OI + POI) approachhas been shown to be an improvement over the REGapproach in terms of rate of fault detection of prioritizedsuites for most, but not all, of the Siemens suite of experi-mental subjects.

We would also like to mention some factors that mayinfluence the validity of our experimental results. In ourexperiments, we do not control for the structure of the sub-ject programs, or for the locations where errors are seededin the faulty versions. Further, the errors in the faulty ver-sions may or may not be representative of errors that typ-ically occur in practice. Also, the set of programs used inour experiments may or may not be representative of otherprograms that are used in practice. In addition, the testsuite sizes used in our experiments may be smaller thanthose used in other applications. To account for these, weconducted experiments with branch adequate suites as wellas somewhat larger suites. We also performed statisticalanalysis of our experimental results. We used the ‘‘t testfor paired observations’’ method to show that the improve-ments obtained by our approach are indeed statisticallysignificant.

5. Related work

Several techniques for test case prioritization (Elbaumet al., 2002; Kim and Porter, 2002; Rothermel and Elbaum,2003; Rothermel et al., 2001; Srivastava and Thiagarajan,2002) have been proposed. The technique developed in Sri-vastava and Thiagarajan (2002) prioritizes test cases indecreasing order of the number of impacted blocks(impacted blocks consist of old modified blocks and newblocks) that are likely to be covered by the tests; a heuristicis used to determine which impacted blocks are likely to becovered by each test. A test case prioritization techniquethat uses historical execution data was developed in Kimand Porter (2002). Under this approach, tests are priori-tized according to their previous execution histories (howoften the test has been run lately), fault detection effective-ness (how many faults the test has revealed recently), andcoverage of program entities (which particular coveragerequirements have been exercised by the test case).

The techniques discussed in Elbaum et al. (2002), Roth-ermel and Elbaum (2003), Rothermel et al. (2001) prioritizetest cases based on the total and additional requirementcoverage information of the tests, as collected during previ-ous testing iterations of the software. In these studies, the

prioritized test suites outperformed their unprioritizedcounterparts in terms of improving the rate of fault detec-tion, but the relative performance of the prioritization tech-niques varied with the characteristics of test suites,programs, and faults. For example, in some cases total cov-erage techniques performed better than additional coveragetechniques while the reverse was true in other cases.

Korel and Laski (1991) first introduced the notion of rel-evant slices. Agrawal et al. (1993) presented an improvedversion of the definition of relevant slice and Gyimothyet al. (1999) presented an algorithm for computing relevantslices. These authors suggest that looking at potential

dependencies, besides just actual dependencies, are usefulfor debugging when modifications are made to software.

Our own work combines the concept of relevant slices(Agrawal et al., 1993; Gyimothy et al., 1999), and the cov-erage-based prioritization techniques presented in Elbaumet al. (2002), Rothermel and Elbaum (2003), Rothermelet al. (2001), to obtain a new approach (Jeffrey and Gupta,2006) for test case prioritization that accounts for the out-put-influencing and potentially-output-influencing cover-age of test cases. As far as we know, this approach is thefirst attempt at incorporating the ideas of relevant slicesin developing new test case prioritization approaches. InJeffrey and Gupta (2006), we had presented detailed exper-iments with one heuristic (REG + OI + POI) based on thisapproach. This heuristic does not take into account thestatements that have been modified while prioritizing thetest cases. In this paper, we have presented two additionaltest case prioritization techniques based on the aboveapproach. We have also presented results of extensiveexperiments conducted on prioritizing test suites for theSiemens suite programs using the above three heuristics.We have also given detailed analysis of the results for eachof these programs in the context of their semantics and var-ious types of modifications to these programs. Siemenssuite programs have been used quite extensively in the lit-erature (Elbaum et al., 2002; Leon et al., 2005; Rothermeland Elbaum, 2003; Rothermel et al., 1998; Rothermelet al., 2001; Zhang et al., 2005) on test case prioritization,test case minimization and fault localization. However,none of the prior work provide detailed analysis of thebehavior of these programs for various types of modifica-tions to these programs corresponding to their faulty ver-sions available with the Siemens suite. The detailedanalysis provided in this paper can be helpful to otherresearchers using the Siemens suite programs in providinginsight into the comparison of the effectiveness of new tech-niques for test case selection, minimization, prioritizationetc. with the existing techniques.

The topic of test suite minimization (Harrold et al., 1993;Rothermel et al., 1998; Wong et al., 1998) is related to testcase prioritization because both types of techniquesattempt to reduce the cost of testing by providing testerswith a way of identifying ‘‘important’’ test cases. However,unlike prioritization techniques, test suite minimizationtechniques permanently discard a subset of test cases from

D. Jeffrey, N. Gupta / The Journal of Systems and Software 81 (2008) 196–221 221

each suite. Prioritization techniques, on the other hand,only order all the test cases in a test suite without discard-ing any tests.

Another related topic is that of safe regression test case

selection. Safe regression test case selection techniques(Frankl et al., 2003; Harrold et al., 1992; Rothermel andHarrold, 1997) guarantee that the selected subset of testcases will include all those test cases whose output can pos-sibly change due to the given modifications to the programand thus not omit faults that could have been exposed bythe entire suite. In comparison, our technique is for testcase prioritization, which only reorders the execution ofthe test cases in the suite, and thus does not omit any testcase from the original suite.

6. Conclusions

In this paper, we have presented a new approach for testcase prioritization that takes into account the output-influ-encing and potentially-output-influencing statements andbranches executed by the tests, as determined through thecomputation of relevant slices. We presented an experimen-tal study comparing the effectiveness of three different heu-ristics based on our approach with the traditionalprioritization techniques that only account for the total(regular) statement and branch coverage of tests. Ourexperimental results show that our new prioritizationapproach has a lot of potential in terms of ordering thetests in suites so as to detect faults early in the testingprocess.

References

Agrawal, H., Horgan, J.R., Krauser, E.W., London, S., 1993. Incrementalregression testing. In: Proceedings of the IEEE Conference onSoftware Maintenance, pp. 348–357.

Elbaum, S., Malishvesky, A.G., Rothermel, G., 2002. Test case prioriti-zation: a family of empirical studies. IEEE Transactions on SoftwareEngineering 28 (2), 159–182.

Frankl, P.G., Rothermel, G., Sayre, K., Vokolos, F.I., 2003. An empiricalcomparison of two safe regression test selection techniques. Interna-tional Symposium on Empirical Software Engineering, 195–204.

Gyimothy, T., Beszedes, A., Forgacs, I., 1999. An efficient relevant slicingmethod for debugging. Foundations of Software Engineering, 303–321.

Harrold, M.J., Gupta, R., Soffa, M., 1992. An approach to regressiontesting using slicing. In: IEEE-CS International Conference onSoftware Maintenance, pp. 299–308.

Harrold, M.J., Gupta, R., Soffa, M.L., 1993. A methodology forcontrolling the size of a test suite. ACM Transactions on SoftwareEngineering and Methodology 2 (3), 270–285.

Harrold, M.J., Rothermel, G., 1997. Aristotle: a system for research onand development of program analysis based tools. Technical ReportOSU-CISRC-3/97-TR17, Ohio State University.

Hutchins, M., Foster, H., Goradia, T., Ostrand, T., 1994. Experiments onthe effectiveness of dataflow- and controlflow-based test adequacycriteria. In: Proceedings of the 16th International Conference onSoftware Engineering, pp. 191–200.

Jeffrey, D., Gupta, N., 2006. Test case prioritization using relevant slices.In: Proceedings of 30th Annual International Computer Software andApplications Conference (COMPSAC 2006), Chicago, USA, Septem-ber 18–21.

Kim, J.-M., Porter, A., 2002. A history-based test prioritization techniquefor regression testing in resource constrained environments. In:Proceedings of the 24th International Conference on Software Engi-neering, pp. 119–129.

Korel, B., Laski, J., 1991. Algorithmic software fault localization. In:Proceedings of the Twenty-Fourth Annual Hawaii InternationalConference on System Sciences, pp. 246–252.

Leon, D., Masri, W., Podgurski, A., 2005. An empirical evaluation of testcase filtering techniques based on exercising complex informationflows. In: Proceedings of the International Conference on SoftwareEngineering (ICSE 2005), pp. 412-421.

Rothermel, G., Elbaum, S., 2003. Putting your best tests forward. IEEESoftware 20 (5), 74–77.

Rothermel, G., Harrold, M.J., 1997. A safe, efficient regression testselection technique. ACM Transactions on Software Engineering andMethodology, 173–210.

Rothermel, G., Harrold, M.J., Ostrin, J., Hong, C., 1998. An empiricalstudy of the effects of minimization on the fault detection capabilitiesof test suites. In: International Conference on Software Maintenance,November, pp. 34–43.

Rothermel, G., Untch, R.H., Chu, C., Harrold, M.J., 2001. Prioritizingtest cases for regression testing. IEEE Transactions on SoftwareEngineering 27 (10), 929–948.

Snedecor, G.W., Cochran, W.G., 1967. Statistical Methods, sixth ed. IowaState University Press, Ames, IA.

Software-artifact Infrastructure Repository. Available from: http://www.cse.unl.edu/~galileo/sir.

Srivastava, A., Thiagarajan, J., 2002. Effectively prioritizing tests indevelopment environment. In: Proceedings of the 2002 ACM SIG-SOFT International Symposium on Software Testing and Analysis,pp. 97–106.

Wong, W.E., Horgan, J.R., London, S., Mathur, A.P., 1998. Effect of testset minimization on fault detection effectiveness. Software – Practiceand Experience 28 (4), 347–369.

Zhang, X., Gupta, R., 2004. Cost effective dynamic program slicing. In:ACM SIGPLAN Conference on Programming Language Design andImplementation, Washington, DC, June, pp. 94–106.

Zhang, X., He, H., Gupta, N., Gupta, R., 2005. Experimental evaluationof using dynamic slices for fault location. In: SIGSOFT-SIGPLANSixth International Symposium on Automated and Analysis-DrivenDebugging, Monterey, CA, September, pp. 33–42.