Coverage Based Regression Test Selection – Selecting ... · Coverage Based Regression Test Selection – Selecting Regression Tests for a JVM. NADA Numerisk analys och datalogi

Erik Hansson

TRITA-NA-E04034

Coverage Based Regression Test Selection– Selecting Regression Tests for a JVM

NADA

Numerisk analys och datalogi Department of Numerical AnalysisKTH and Computer Science100 44 Stockholm Royal Institute of Technology

SE-100 44 Stockholm, Sweden

Erik Hansson

TRITA-NA-E04034

Master’s Thesis in Computer Science (20 credits)at the School of Computer Science and Engineering,

Royal Institute of Technology year 2004Supervisor at Nada was Kjell Lindqvist

Examiner was Stefan Arnborg

Coverage Based Regression Test Selection– Selecting Regression Tests for a JVM

Abstract In large-scale software development, the process of regression testing tends to consume large amounts of resources and require extensive effort. As new versions of a program are conceived, old tests are re-executed to ensure that functionality that worked on the previous version has not been broken.

This report presents a project where a simple regression test selection algorithm – an algorithm aiming to select a subset of the available tests for execution – was evaluated for the development of BEA WebLogic JRockit, a server-side Java Virtual Machine. Code coverage data from tests used by the JRockit developers, showing what parts of JRockit that were executed, has been analyzed, and the possible use of a regression test selection scheme has been investigated.

The results indicate that time and resources can be saved by means of regression test selection, but not without a price. The deployment and use of a selection scheme may be costly, and the risk of missing introduced faults is increased.

Kodtäckningsbaserat val av regressionstester Att välja regressionstester för en JVM

Sammanfattning Vid storskalig mjukvaruutveckling är regressionstestning ofta mycket kostsamt, både med hänseende till de resurser och till den tid som tas i anspråk. När nya versioner av ett program tas fram kör man gamla tester igen för att försäkra sig om att funktionalitet som tidigare funnits fortfarande fungerar.

Den här rapporten presenterar ett examensarbete där en enkel algoritm för att välja regressionstester för utvecklingen av BEA WebLogic JRockit utvärderats. JRockit är en virtuell maskin för Java som är anpassad för att få hög prestanda för serverapplikationer. Kodtäckningsdata, data som beskriver vilka delar av ett program som exekverats under en eller flera körningar, från tester som används av JRockit-utvecklarna har analyserats och användningen av testvalssystem har undersökts.

Resultaten antyder att tid och resurser kan sparas med hjälp av ett testvals-system. Införandet och användningen av ett sådant kan dock vara kostsamt, och risken för att fel förblir oupptäckta ökar.

Acknowledgements I would like to thank all the people that have helped me to complete this report, which has been taking up most of my time for the past months. In particular, I’d like to thank my advisors at BEA Systems and KTH, Mattias Bertilsson and Kjell Lindqvist, who have taken the time to help me through this process. I’d also like to thank Boris Chen at BEA Systems, for helping me find a subject for my thesis, and giving me the chance to see it through at BEA.

Table of Contents

1 INTRODUCTION 1

1.1 Problem 2

1.2 Objective 2

1.3 Scope 3

1.4 Approach 3

1.5 Summary of Results 4

2 BACKGROUND 5

2.1 BEA WebLogic JRockit 5 2.1.1 Progressive Optimization 5 2.1.2 Garbage Collectors 6

2.1.2.1 Singlecon 6 2.1.2.2 Gencon 6 2.1.2.3 Parallel 6

2.2 Code Coverage 6 2.2.1 How Coverage Data is Obtained 7

2.3 Notation 7

2.4 Regression Testing 8 2.4.1 Test Suite Reduction 9 2.4.2 Test Case Prioritization 9

2.4.3 Selective Retest Techniques 9

2.5 Regression Test Selection 9 2.5.1 Properties of Tests 10 2.5.2 Controlled Regression Testing Assumption 11 2.5.3 Properties of Selection Techniques 11

2.5.3.1 Inclusiveness 11 2.5.3.2 Precision 12 2.5.3.3 Efficiency 12 2.5.3.4 Generality 12 2.5.3.5 Accountability 12

2.5.4 Test Suite Granularity 13 2.5.4.1 Selection Granularity 13

3 METHOD 14

3.1 The Selection Algorithm 14

3.2 Environment 14

3.3 Building & Instrumentation 15 3.3.1 Software Used 15 3.3.2 Compiler Settings 15

3.3.2.1 Debug Information 15 3.3.2.2 Incremental Building 15 3.3.2.3 Relocation 16

3.3.3 Instrumentation 16

3.4 Gathering Coverage Data 17 3.4.1 The Tests 17 3.4.2 PureCoverage Reports 17 3.4.3 Running the Suite 18

3.5 Analyzing the Coverage Data 18 3.5.1 Objective 18 3.5.2 Problems 18 3.5.3 The Coverage API 19 3.5.4 Coverage 19 3.5.5 Function Characteristics 20 3.5.6 Weighted Function Characteristics 21 3.5.7 Line characteristics 21 3.5.8 Function Coverage Reproducibility 21

3.6 Implementing the Selection Tool 22

3.7 Selection Evaluation 23

4 RESULTS 24

4.1 Coverage 24

4.2 Coverage Characteristics 26 4.2.1 Function Coverage Characteristics 26

4.2.1.1 Influence of the Garbage Collector 27 4.2.2 Weighted Function Coverage Characteristics 28 4.2.3 Line Coverage Characteristics 29

4.3 Reproducibility 30

4.4 Savings on Real Changes 31

5 DISCUSSION 33

5.1 Observations 33 5.1.1 No Safe Selection 33 5.1.2 Selection Efficiency 34

5.2 Risks 34 5.2.1 Test Suite 34 5.2.2 Instrumentation 35 5.2.3 Testing Environment 35 5.2.4 Used Modifications 35

5.3 Selection for the JRockit Team 36 5.3.1 The Examined Algorithm 36 5.3.2 Test Selection for Verification 36 5.3.3 Selection for Continuous Testing 36 5.3.4 Justifying Risks 37 5.3.5 Coverage Data Decay 37 5.3.6 Alternative Approaches to Selection 37 5.3.7 Conclusion 38

6 FUTURE WORK 39

REFERENCES 40

APPENDIX A: THE TESTS 42

- 1 -

1 Introduction When developing software, regression testing – the practice of rerunning existing tests to ensure that new versions of a software system do not introduce new bugs in old functionality – often stands out as an expensive and time-consuming activity. To reduce the cost of regression testing, several different strategies have been devised. These strategies attempt to either reduce the number of tests to be executed or to execute the same tests more effectively. Among these strategies we find such methods as test suite reduction, regression test selection and test case prioritization.

Test suite reduction aims to reduce the size of the regression test suite, normally by excluding tests that are not necessary to reach a certain coverage criterion. Regression test selection, by contrast, selects a subset of the available tests when testing, leaving the test-suite intact. Many selection techniques select tests based on changes made to the system, attempting to run only the tests that are most likely to reveal faults in the modified code. Test case prioritization attempts to order the execution so that the expected time to find faults is minimized.

This report presents an evaluation of a simple regression test selection algorithm for the development of BEA WebLogic JRockit, a Virtual Machine for Java. A number of tests for JRockit have been used estimate how well such a regression test selection might pay off for the development and testing efforts of this JVM. The project is described in the remainder of this chapter. Theoretical background, implementation, results, analysis and discussion are presented in subsequent chapters.

- 2 -

1.1 Problem The cost of testing in general and regression testing in particular represent a substantial part of overall software development costs. Regression testing is performed on a modified system to instill confidence that no new faults have been introduced and consists, in practice, of the re-execution of available tests. The currently most used regression testing scheme is the trivial retest all approach in which all available tests are re-executed every time testing needs to be performed [rothermel96]. While this method is among the best at uncovering faults, it may use excessive time and resources, especially as a program matures and the available test suite grows large [rothermel02a].

To address this problem, regression test selection can be used to reduce the amount of testing that needs to be performed. By rerunning only some of the available tests, less testing effort is required and the cost reduced.

The data needed to facilitate a simple regression test selection is available in regular code coverage data – data describing what parts of a program is exercised by a test or a test suite. The gathering of such data can be motivated by uses other than regression test selection, since it can aid the testing process in other ways. Coverage data can, for instance, be used to identify areas of code that is in need of further testing. An in-depth explanation of code coverage can be found in section 2.2.

1.2 Objective The purpose of this project was to investigate the use of a simple code coverage based regression test selection scheme for the development of BEA WebLogic JRockit; one in which code coverage data is used to determine which tests exercise which functions, and then select only those tests that exercise modified functions. A formal description of this algorithm can be found in section 3.1.

Regression testing is used extensively to test new versions of JRockit. Tests are re-executed on a regular basis to detect new faults and ensure that the program is still operating as it is meant to. Although this testing is time consuming and would make a suitable target for streamlining, it would most likely be better served by a selection algorithm other than the one primarily discussed in this report. This is discussed in further detail in section 5.3.

One use for the examined algorithm is to provide highly focused check-in testing. As changes are made to the JRockit code, it is desirable to ensure that the changes are sound. By using regression test selection, the time this verification takes can be reduced, or the number of tests that are available to the verification process can be increased. This is achieved by running only those tests that exercise the modified parts of the code.

Since the time required to rerun all available tests for JRockit is substantial, regression test selection schemes can increase effectiveness by reducing the amount of unnecessary testing that is performed, and the time required to identify faults in modified code. The time and resources spent on rerunning superfluous

- 3 -

tests could then be used more constructively, for instance on developing and running new tests to further increase product quality.

If this selection mechanism proved ineffective or improper, the reasons were to be discussed and alternative solutions suggested. Other possible test selection schemes have been considered as alternatives to the chosen one. Benefits and drawbacks with different approaches are discussed in chapter 5.

1.3 Scope The aim of this project was to obtain and analyze enough data to draw some meaningful conclusions on the viability of using regression test selection based on code coverage data for the development of JRockit. Integration of the selection system with the environment used by the JRockit team was not a part of this project, and the tools were constructed primarily to support an estimation of the possible benefits of a test selection system.

It was decided that only a manageable subset of the test-suite should be used for this project. For the purpose of evaluating selection, this should be acceptable. This cutoff does, however, affect the obtained coverage data. This data may serve as sample data, not as a description of the JRockit test suite.

Although JRockit exists for a number of different platforms, all work for this project was conducted on a regular 32 bit Windows machine. The different platforms were not considered in part because the obtained results are likely to be similar for all platforms, and because doing otherwise would amount to too much work. Selection of platforms on which to run tests is a possible use for an automated test selection algorithm that should not be disregarded entirely. A change in the code for JRockit on one platform does not require the other platforms to be retested. This selection, however, is more dependent on analysis of the source changes than on the execution coverage that was the main consideration for this project.

1.4 Approach To investigate the benefit of a regression test selection scheme, coverage data was extracted for a number of JRockit tests. This data was analyzed to estimate how well a simple regression test selection algorithm could work. Further, a regression test selection system prototype was built, which could select tests based on actual code changes. This prototype was evaluated using actual changes in the code for JRockit, thus indicating the possible benefit of using such a selection scheme.

For each test, code coverage data was obtained by running the test with the instrumented version of JRockit. A list of all functions that each test exercised was extracted from this coverage data. When a change was made to one of those functions, that test would be selected.

- 4 -

This algorithm was chosen for two reasons; its simplicity and the fact that, under certain circumstances, the algorithm is safe. Safe selection algorithms are explained in section 2.5.3.1. The implemented algorithm is described in section 3.1.

1.5 Summary of Results Although regression test selection was implemented, the data obtained through coverage analysis indicate that this regression test selection is far from safe since the execution traces are indeterministic for JRockit. This was confirmed during evaluation of the prototype selection system.

To proceed with a test selection scheme for JRockit, one of two paths must be taken; either safety must be sacrificed to reduce resource usage, effectively allowing the possibility of missing introduced faults by selecting fewer tests, or a more advanced regression test selection scheme would need to be implemented that would most likely produce less savings.

Since a sophisticated regression test selection scheme is nontrivial and may in itself become a source of problems, my conclusion is that the most beneficial path is to either accept a certain uncertainty in the selection and proceed with a scheme that could reduce the test load substantially, or to simply not use regression test selection at all. Possible approaches to take to facilitate a better regression test selection for JRockit are discussed in chapter 5.

- 5 -

2 Background This chapter describes the necessary background knowledge required to fully comprehend this report. It introduces the JRockit JVM, code coverage and previous work in the field of regression test selection techniques. Commonly used models, terms and assumptions are described to provide a foundation for the reader on which to base understanding of the following chapters. The notation used in this report is described after the necessary background concepts have been introduced, and before this notation is used.

2.1 BEA WebLogic JRockit Since the introduction of Java in 1994, the language has become one of the most widely used programming languages. In recent years, its use in server-side applications has increased substantially.

Java programs run in a virtual machine, or JVM, a program that emulates a theoretical machine. This allows Java programs to be executed on multiple platforms.

BEA WebLogic JRockit, hereafter JRockit is a server-side JVM optimized to provide high performance for long-running applications with high throughput. It provides progressive code optimization and multiple garbage collectors for im-proved performance [jrockit03].

2.1.1 Progressive Optimization

To improve performance, JRockit monitors running Java programs and the most used methods are recompiled and aggressively optimized. To determine which parts of an application that requires optimization, a system called the hotspot

- 6 -

detector monitors the execution, attempting to identify “hot spots”, or code that is heavily exercised. This makes sense since almost all the time spent in a typical application is spent in just a small fraction of its methods [jrockit03].

2.1.2 Garbage Collectors

In addition to optimization of methods in the running program, JVM performance is highly dependent on the garbage collector. JRockit provides a number of different garbage collectors, suitable for different applications. When testing, the tests are generally run with several different GC options to test the collectors.

2.1.2.1 Singlecon

The singlecon garbage collector is a single generation garbage collector that runs in a separate GC thread, mostly concurrently with the application. Singlecon is suitable for applications that use relatively few live objects and where pause times may be a problem.

2.1.2.2 Gencon

Gencon, like singlecon, runs mostly concurrently with the program, but also provides a nursery. Newly allocated objects are placed in the nursery, a separate section of the heap, which is garbage collected more often than the rest of the heap. This often reduces the amount of processing required for garbage collection.

2.1.2.3 Parallel

Finally, the parallel garbage collector stops the running program when doing a garbage collection. When the application is paused, it uses multiple parallel threads to quickly clean the heap of unreachable objects. This garbage collector is the fastest since it does not need the overhead required to safely perform a garbage collection when the program is running, but it causes the program to halt for longer than the others do. Parallel does not use a nursery.

2.2 Code Coverage This section briefly describes code coverage, or test coverage as it is commonly called. Code coverage is a measure of how large a proportion of a program is exercised by a test suite, usually expressed as a percentage.

There are many measures of code coverage. Among the most common are line coverage and function coverage, which measure how many of the lines or functions in the source code that are exercised respectively. Other kinds of cover-age include statement coverage, path coverage, branch coverage and condition coverage [kaner99, zhu97].

For example, 100% line coverage indicates that a test suite exercises every line in the source of a program. A line is exercised if it is executed during the test.

- 7 -

100% path coverage indicates that every path through the program has been taken. This is usually extremely difficult, if at all possible, to achieve.

As intuition suggests, code coverage is related to defect coverage for a test suite. This implies that by achieving a high degree of code coverage, in as many measures as possible, we can detect a large portion of the bugs that exist within a software system. Using coverage measures as guidelines for test development is often a more efficient method than to create tests randomly [malaiya94, malaiya96].

Code coverage is often used to identify weaknesses in a test suite, and thus aid the development of new tests. Developing tests to increase code coverage is essentially equivalent to developing tests for previously untested code. When developing tests, coverage criteria are often used as goals or milestones. This can serve as a way to formally introduce rules that require the software to be thoroughly tested.

Definition 1. Let Coverage Criteria mean statements concerning some coverage measure. Such criteria can for instance state that the test suite must cover 90% of the lines in the tested program.

2.2.1 How Coverage Data is Obtained

Coverage data is typically obtained through running the tests with an instru-mented version of the software. This means that special tracing statements, probes, are inserted into the program, normally at compile- or link-time. During test execution, those probes signal what parts of the code are exercised, providing an execution trace that can be used to measure coverage [tikir02].

The instrumentation naturally affects the performance of the target program. How much differ between implementations, but the slowdown is typically significant, and possibly altering of timing-dependent behavior in the program. For instance, one of the coverage tools tested in the preliminary phase of this project, True Coverage, caused performance to drop to about 6% when running a particular test. While this is certainly an extreme example, it highlights the potentially high impact that instrumentation can have.

2.3 Notation For brevity and clarity, a mathematical notation is used to describe measure-ments and properties throughout this report. This section provides a brief explan-ation of this notation.

When referring to a program and, possibly a modified version of that same program, these are called P and P′ . A test suite is generally denoted T and a single test t . If the test is a part of a test suite, this is written Tt ∈ . A subset of T with selected tests is denoted T ′ .

Coverage data from running a test with P is denoted ( )PC t and a test suite ( )PC T . This can be viewed as a set of lines or functions exercised by the tests

- 8 -

given as the parameter. Merged coverage from running two test suites looks like this; 1 2( ) ( )P PC T C T∪ . We say that a function f or a line l is covered by a test suite and a program with ( )Pf C T∈ and ( )Pl C T∈ respectively. Often a single program is discussed. In such cases, the program index is generally omitted.

For example, the set of functions covered by 1T but not by 2T is written like (2.1). The coverage obtained from two runs of a test t is denoted ,1( )PC t and ,2 ( )PC t respectively, with the index denoting the run.

( ){ }1 2( ) ( )P Pf f C T C T∈ − (2.1)

Where line- or function-coverage is specifically required, this is denoted lC or fC respectively. The set of lines that make up the functions exercised by a test

is thus written as (2.2). The set of lines is denoted L and of functions F . The corresponding set of covered lines and functions are CL and CF .

{ }( )fPl l C t∈ (2.2)

2.4 Regression Testing Software testing is the process of exposing a software system to a number of tests with the hope of revealing faults. Testing has been used as the principal technique to determine whether or not software systems are working, and represents one of the largest costs associated with software development today [chen94, rothermel02a].

Definition 2. A fault is an error or a bug in a software system.

Definition 3. A failure is a manifestation of a fault.

When changes have been made to a program, we normally wish to ensure that the new version does not contain faults. For this purpose, old tests are re-executed with the new version of the program. We call this process regression testing.

Definition 4. Regression Testing is the process of rerunning tests on a modified version of a software system to ensure that no new faults have been introduced in previously working functionality.

As more and more tests are created for a software system, regression testing can become increasingly time consuming and costly. To reduce this cost, various regression testing techniques have been proposed that aim to improve the effectiveness of the regression testing process [bates93, chen94, rothermel94a, rothermel94b, rothermel96, vokolos97, vokolos98, wong95, zhu97]. Among these, we find techniques such as test suite reduction, test case prioritization and regression test selection. Naturally, any technique must, in order to be useful, provide a greater benefit than its associated costs. For regression testing techni-ques, the main potential benefits are that less testing needs to be performed and that faults can be detected earlier. The cost lies in employing the techniques

- 9 -

themselves and the risk of not detecting faults, which would otherwise have been detected [rothermel02a].

The currently most used regression testing technique is the trivial retest all technique [rothermel02b]. As the name implies, it says to simply rerun all available tests when regression testing is to be performed. Although both simple and safe, this method is costly if the time- and resource-requirements to rerun the entire test suite is high.

2.4.1 Test Suite Reduction

Test suite reduction techniques attempt to permanently reduce the size of a test suite by removing redundant tests. Commonly, redundancy is determined by use of coverage criteria as described above. For instance, tests that do not contribute any coverage can be considered redundant and may be removed from the test suite.

It is worth noting that most such criteria do not ensure that the tests will not uniquely detect faults in the software. The example above simply ensures that the coverage will not decrease as a result of our removing the test from the test suite.

Test suite reduction techniques have been used to significantly reduce the size of a test suite with only a small reduction in fault detection effectiveness [wong95].

2.4.2 Test Case Prioritization

Test case prioritization techniques retain the entire test suite, but attempts to order the tests in such a way as to minimize the time required to meet testing requirements. Tests can, for instance, be ordered to quickly achieve a certain degree of coverage or to cover bug-prone parts of the system early. This may help finding faults earlier during the testing process. Since all tests are rerun when using prioritization, this method is inherently safe.

2.4.3 Selective Retest Techniques

Regression test selection techniques select a subset of the existing test suite at test-execution time, leaving the test suite unchanged. The basis for selection can be coverage criteria, as is often the case with test suite reduction. More interestingly, it can be a combination of test execution behavior and source code modifications [rothermel02a]. Since this is the technique studied in this project, it is explained in further detail below.

2.5 Regression Test Selection The purpose of regression test selection techniques is to identify a subset of a test suite that sufficiently tests changes made to a program. There exist a number of different algorithms and methods for this purpose. These can be broadly categorized into three categories; minimization techniques, coverage techniques and safe techniques [rothermel94a].

- 10 -

Minimization techniques aim to select a minimal subset that satisfies some coverage criterion. Such techniques can often reduce the test set substantially, but risk missing faults that would otherwise have been detected.

Coverage techniques, like minimization techniques use coverage criteria to select tests. The difference between coverage techniques and minimization techniques is that coverage techniques do not attempt to minimize the selected set.

Finally there are safe techniques. Safe selection techniques do not primarily use coverage criteria, but instead aim to select every test that will cause the modified program to produce a different output than the unmodified program. The algorithm studied in this project uses code coverage data to determine which tests may produce different output.

When choosing a strategy for selective retesting, the different approaches are suitable in different situations. For example, if the time required to run tests must be shortened as much as possible, a minimization technique is appropriate, whereas a safe approach is required if it cannot be afforded to miss any faults that would otherwise have been uncovered.

Further, a safe technique can be expected to perform poorly under certain circumstances, for instance if there are large differences between the two versions or if every test cover large parts of the code.

Rothermel and Harrold present a framework for evaluating selective retest techni-ques in [rothermel94a]. The terminology used in this report is mostly adopted from that paper.

2.5.1 Properties of Tests

We define three important properties of tests with respect to P and P′ . Assume that all tests work correctly with P .

Definition 5. A test is fault-revealing if it produces a failure with P ′ .

The regression test selection problem is really a matter of identifying the set of fault-revealing tests. If it could be quickly determined which tests would uncover faults, only those tests needs to be executed. This, however, is in general not possible [rothermel94a].

Definition 6. A test is modification-revealing iff it produces different output for P and P ′ .

It should be clear that all fault-revealing tests are modification-revealing given the assumption above. Although it is impossible to determine, in the general case, if tests are modification-revealing [rothermel94a], one can, given execution trace information, determine if a test is modification-traversing.

Definition 7. A test is modification-traversing if it executes a new or modified statement in P ′ , or misses a statement in P ′ that was executed in P .

- 11 -

Note that the set of fault-revealing tests is a subset of the modification-revealing tests. Also, the set of modification-revealing tests is a subset of the modification-traversing tests.

2.5.2 Controlled Regression Testing Assumption

A problem when using the above classification is that we cannot, in general, determine if a test is modification-traversing given only execution data from P . Consider a program that chooses one of two paths randomly. When one of the two paths is modified, it is impossible to know beforehand if the test will be modification-traversing.

If all factors can be held constant, except for the program itself, we have what is called controlled regression testing [rothermel96]. This implies that the program will execute in a fully deterministic fashion, yielding the same results in the same way every time we run the program.

Definition 8. Controlled Regression Testing implies that ( )( , ) ( ) ( )i ji j C T C T∀ = .

Note that a test must operate under the controlled regression testing assumption for safe selection techniques to operate as expected. If this is not the case, a test may be modification-revealing in one run, and not in another. Also, as stated above, there is no way of knowing if a program is modification-traversing in advance of a given execution. If the execution is controlled, coverage data can be gathered from a test case once, and that same coverage will be obtained each time that test is executed.

If the controlled regression testing assumption holds, and P works correctly with all tests inT , we can safely select tests by selecting all modification-traversing tests [rothermel96]. This works because any changes, in order to affect the output of P′ , must be exercised.

2.5.3 Properties of Selection Techniques

In [rothermel94a], as mentioned above, Rothermel and Harrold propose a framework for comparing and evaluating regression test selection techniques. In particular, they introduce five interesting properties of regression test selection techniques; inclusiveness, precision, efficiency, generality and accountability.

2.5.3.1 Inclusiveness

Inclusiveness is the extent to which a selection technique selects tests that are modification-revealing. A technique that always selects all modification-revealing tests is 100% inclusive. Such a technique is also safe.

Definition 9. A regression test selection technique is safe iff it select all tests that will cause the output from the modified program to be different from the output from the original program.

Inclusiveness is defined relative a program, a modified version of this program and a test set, since the degree of inclusiveness may vary with any of those. It is

- 12 -

not possible to determine the inclusiveness of a technique for arbitrary programs and tests [rothermel94a].

Definition 10. The inclusiveness of a regression test technique relative P , P′ and T is ( )100 M M′ where M is the set of modification-revealing tests and M ′ is the

selected subset of M .

Although the inclusiveness for a technique is undefined, the measure can be useful to compare selection techniques with each other. It is often possible, for instance, to prove that one technique selects a superset of the tests selected by another. Or we can determine that a technique is safe if it selects a superset of the modification-revealing tests.

2.5.3.2 Precision

Precision is the extent to which a selection technique omits tests that are not modification-revealing. As inclusiveness, precision is defined only relative to a particular program, modified program and test set. It is defined as the percentage of the non-modification-revealing tests that are omitted from selection. A techni-que that omits every test that is not modification-revealing is precise.

Definition 11. The precision of a technique relative P , P′ and T is 100 N N N′− where N is the set of non-modification-revealing tests.

Definition 12. A precise selection algorithm is an algorithm that never selects a test that is not modification-revealing.

Also as with inclusiveness, there is no way to determine the precision for a technique relative arbitrary programs and tests.

2.5.3.3 Efficiency

The efficiency of a selective retest technique is measured by its space and time requirements along with its automatability. Essentially, the efficiency is a measure of the cost to use the technique.

For test selection to be profitable, the cost of running T T ′− must be greater than the cost of using the selection technique itself.

2.5.3.4 Generality

The generality is the ability for a selective retest technique to function in different situations and environments. Techniques that are specifically adapted to a particular class of programs have a poor generality whereas those that work on a wide range of programs and in different situations are general.

2.5.3.5 Accountability

Accountability represents the ability of a selection technique to support structural coverage criteria, for instance by identifying unused program components or selecting tests that maximize coverage. The accountability measurement is

- 13 -

useful since it highlights uses of the selection technique that is not directly related to selection.

2.5.4 Test Suite Granularity

When performing regression test selection, the test suite’s granularity has a substantial impact on the effectiveness [rothermel02b]. The test suite granularity is a measure of how specialized the test cases are. A test suite with few and large tests has a low granularity, whereas a suite with many small tests typically has a high granularity.

Definition 13. A test suite has high granularity if it consists of many small tests and low granularity if it consists of few, big tests.

As intuition suggests, change-based selection algorithms perform better on test suites with a high granularity, since any one of the many tests are less likely to exercise the altered parts of the code.

It is interesting to note, however, that the execution of a low granularity test suite with the same testing capability is generally faster than that of a high granularity suite. This is because startup and shutdown times are smaller. Also, complex faults are often easier to provoke using more complex tests [rothermel02b].

2.5.4.1 Selection Granularity

Selection granularity refers to the coarseness of the measurements used to determine which tests to select. In an algorithm where we select modification-traversing tests, for instance, a low granularity selection algorithm may select all tests that exercise code in the same files as the changes, whereas a high granularity selection analyses the code and selects only tests that exercise, for example, changed lines. Selection will generally be more effective with higher selection granularity.

Definition 14. High selection granularity means that selection is based on small units of measurement, whereas low selection granularity indicates larger units, such as source files or modules.

The algorithm discussed in this report uses a function-based selection. It is thus of higher granularity than the source-file based algorithm, but of lower granularity than a line-based selection algorithm.

- 14 -

3 Method This chapter describes the work performed throughout this project, the methods used to acquire and analyze the coverage data, the implementation of the required tools and the choices and tradeoffs that were made along the way. The obtained results and their implications are discussed in the following chapter.

3.1 The Selection Algorithm The selection algorithm investigated in this project is one of the simplest selection algorithms available. Consider a test t that works as expected with an instru-mented program P . The function coverage data obtained from running t with P ,

( )fPC t , is the set of functions that are executed.

Consider further that changes are made to a set of functions MF . The test is selected for re-execution with P′ , the modified version of P , iff:

( )fP MC t F∩ ≠∅ (3.1)

This same test is repeated for every test case available, yielding a subset of the suite T . As noted in section 2.5.1 and 2.5.2, this selection algorithm is safe if the controlled regression assumption holds.

3.2 Environment The data gathering and analysis performed for this project was done on a regular 32 bit Windows 2000 workstation with a single 1200 MHz CPU and 512 MB of RAM. Although JRockit is developed for multiple platforms, including Linux and 64 bit Windows, no attempts have been made to evaluate the selection on these

- 15 -

platforms. Mainly, this is motivated by the considerable gain in simplicity compared to managing multiple platforms, but also by noting that the results are likely to be similar on all platforms. A discussion on the chosen platform's possible effect on the result can be found in chapter 5.

The software used has, as far as possible, been the same as used for the development of JRockit. Building has been done with Microsoft Visual Studio, PureCoverage was used to gather coverage data and JRockit itself was used to run the utilities needed to analyze the data.

3.3 Building & Instrumentation This section describes the building and instrumentation performed to produce the version of JRockit used throughout this project. Since the building and instru-mentation process necessarily affects the program, a risk exists that it also affects the acquired output coverage data, and thus the results in this report.

3.3.1 Software Used

JRockit was built using Microsoft Visual Studio .NET 2003. The compiled pro-gram was instrumented with PureCoverage, and used to run the tests and collect coverage data.

As implied above, a custom version of JRockit was built for this project. The code used for this version is known internally as load 22 of Ariane.

3.3.2 Compiler Settings

To allow instrumentation, some compiler settings needed to be changed from the defaults. Incremental building was turned off, and the program needed to be built to allow relocation. In addition, PureCoverage requires debug information, so a debug build was chosen.

3.3.2.1 Debug Information

Any windows executable or dll may have an accompanying program database file (a PDB-file), containing information about the program. This information is not necessary for the execution of the program, but is needed to be able to map instructions to source code, find variable names etc. Since PureCoverage reports which lines of code are executed, this information is required. As the name implies, this information is most commonly used when debugging the program.

3.3.2.2 Incremental Building

Incremental building is a linker option that allows the compiler to recompile only those functions in an exe or dll that has been modified since the last build. The code generated when this option is turned on is larger than it would otherwise have been [MSDN-library], to allow possibly larger functions to be inserted into

- 16 -

the program. PureCoverage does not support programs compiled using incre-mental building, so this option had to be turned off.

3.3.2.3 Relocation

Finally the program has to allow relocation. Relocation occurs during program startup when two modules collide in the process address-space. All executable code has a preferred base address where it is normally loaded. If this memory is occupied, the module is loaded at another address, and the code needs to be patched to reflect this change. This increases startup time and memory requirements for a program.

If the program is compiled with a fixed base address, the relocation information can be stripped from the file. The resulting program will be smaller and will not be able to load at an address other than its preferred base address [MSDN-library, richter99]. Relocation was required for PureCoverage. Since the instrumentation (see below) inserts statements into the code, the addresses will be offset just as when the program is relocated, and the relocation information is needed to compensate for this.

3.3.3 Instrumentation

When instrumenting regular windows programs, PureCoverage uses Object Code Insertion. This means that it instruments the compiled program, as opposed to the source code. The instrumentation process inserts additional statements into the program, called probes, which are used to trace the execution of the tests. This is explained in section 2.2.1.

The instrumentation substantially affects the performance of JRockit. When running Volano – a Java benchmark simulating a client/server chat – with 50 rooms, 20 users and 100 iterations, the instrumented version operates at about 46% of the speed of the uninstrumented version. This indicates what perfor-mance degradation can be expected from the instrumentation.

The time required to run the regression tests is thus increased when gathering coverage data. Also, behavior that is timing dependent is likely to change when running with the instrumented version of JRockit. This issue is further discussed in section 5.2.2.

The instrumentation provides line and function coverage data for the tests. Although line coverage is not needed for the simple test selection algorithm used for this project, it was extracted to support more refined selection and to increase the usefulness of the coverage data for purposes other than regression test selection. In particular, much of the analysis performed and presented in this report requires line coverage data.

The code length, measured in lines, also provides a reasonable weighting of the functions. It seems reasonable to assume, for instance, that a function with 50 lines of code in general is more likely to be changed than one containing only a single line of code.

- 17 -

As noted below, the PureCoverage definition of a line is a bit closer to statement than one could initially suspect. This makes sense since a function containing 50 blank lines is no more complex than an equivalent function without those 50 empty lines.

3.4 Gathering Coverage Data This section describes how the coverage data that was used in this project was collected. First the tests that were used are presented, and then the data itself. Finally the process used to gather the data is described.

3.4.1 The Tests

To create a test suite for this project, 20 test classes were selected from Echelon, the in-house test suite for JRockit. Each of these is a Java class written for the used test framework. The test classes may in turn contain a number of test methods designed to test some specific aspect of JRockit. Every test class contains at least one test method.

When JRockit is tested, each test is normally executed several times with different options sent to JRockit. This allows the testing of the different garbage collectors and optimization modes. Since those options normally alter the execution behavior of the JVM, and thus alter the produced coverage, they present an important aspect to consider when selecting tests. It was decided to include each test class three times, each time with a different garbage collector. Each such (test class, GC) pair is considered a test case in this report.

Definition 15. Let a test case be a pair of a test and some JVM settings, specifically a test class and a garbage collector.

Selection aims to select such test cases to rerun. A list of the used tests can be found in Appendix A.

3.4.2 PureCoverage Reports

When running a test with the instrumented version of JRockit, a report is produced. This report contains information about what parts of the program was executed, and how many times. Essentially, it contains a list with each probe in the program, and a counter for how many times that probe was reached.

Definition 16. We define a probe as a counter, tracked by PureCoverage. Empty lines, lines with declarations or lines with only comments are not traced. Multiple probes can trace complex textual lines, such as for-loops.

In this project, these probes rather than the lines they represent was used as a basis for the “line” coverage. This is syntactically convenient and the results are almost the same. The main difference is that there are more probes than lines. In JRockit, there are 141,432 probes, but only 80,107 lines with code.

- 18 -

In addition, the report also contains some statistical information, notably a list of all functions in the program, their location, number of lines and their coverage statistics. From these reports, a convenient representation of the JRockit source could be constructed, and the execution data could be attached to this representation. This is described in further detail below.

3.4.3 Running the Suite

To produce the required coverage data, the 60 test cases were executed ten times each. The repetition allowed the comparison of different runs of the tests with each other, indicating how stable the results from the test runs are with respect to the achieved coverage. As discussed in section 2.5, this is directly related to how safe the chosen algorithm is. For the controlled regression testing assumption (section 2.5.2) to hold, the coverage must be identical for all ten runs. The opposite is not true, since the same coverage can be obtained if the code is executed in a different order.

A few simple scripts were constructed to run the tests and organize the coverage data. The extraction took little less than 24 hours and resulted in about 1.8 GB of coverage reports. A single coverage report for Jrockit is about 3 MB.

3.5 Analyzing the Coverage Data This section describes the analysis as performed during this project. It starts with a general description of the work and the reason for performing it, and then proceeds to define what, exactly, was sought after. The obtained results and a discussion on their implications can be found in chapter 4.

3.5.1 Objective

The point of analyzing the code coverage data is mainly to answer questions, which cannot easily be answered by running the prototype test-selection tool. By looking at the obtained coverage, observations can be made that explain the program under test, and about the possibilities of using a regression test selection mechanism.

Specifically the expected behavior from random changes of the JRockit code and whether or not we can expect a better performance from a more sophisticated algorithm is examined. Also, the stability of the coverage data indicates how well the controlled regression testing assumption holds.

3.5.2 Problems

Although conceptually simple, the large amounts of data make it difficult to answer questions about the coverage data directly. Collecting the necessary data and extracting meaningful information from the coverage reports required additional tools.

- 19 -

3.5.3 The Coverage API

For this purpose, a small Java API was written. This API allows for processing of the reports and the extraction of relevant information. Given a report from PureCoverage, this API allows for the simple construction of a model of the source, to which the coverage data can be attached and accessed.

The use of a Java API to extract information is in many ways impractical, since it requires the construction of a utility program to extract data. Still, the advantage of not having to spend a lot of time constructing general purpose data analyzers, and the power provided by such an API made it a suitable method for this project.

To keep memory consumption down, the coverage API was designed to allow for the source structure to be represented separately from the coverage data. A tree structure representing the source was built from a single PureCoverage report, the inner nodes representing source files and directories, and the leaf nodes representing functions. Lists of the significant line numbers was attached to each function node.

Definition 17. Let significant line mean a piece of code tracked by a probe.

With this tree in place, the coverage data for each report can be attached to the leaf nodes. To further simplify this, the nodes were numbered to allow storage of this data in regular arrays. A parsed report is thus, in essence, nothing more than an array of execution counters for each significant line in the program.

In addition, a number of utility classes were written. Given a source tree structure and a number of parsed reports, these classes can be used to extract information that is not immediately available. Generally this is done by constructing some aggregate data from the reports, such as the coverage statistics for subtrees of the source, or how many of the reports that cover each function.

As mentioned above, utility programs needed to be written to extract the information from the reports. Such a program typically parses a number of reports, initializes one or a few utility classes, hands the reports to those classes, and then scans the constructed aggregate data for the sought-after information.

3.5.4 Coverage

One of the most readily available, yet interesting aspects of the acquired data is the total coverage of the selected tests. Since this is exactly the data we obtain directly from the coverage tool, no special work was required to extract this data.

Three coverage measures are examined. The first (3.2) is the resulting coverage from running the test suite once:

1( )C T (3.2)

The second (3.3) is the total merged coverage from running the tests ten times. Finally, the coverage from running a HelloWorld application is examined.

- 20 -

10

1

( )ii

C T=U (3.3)

3.5.5 Function Characteristics

Function Characteristics is a measure of how many of the tests that exercise the different functions. The set of functions is partitioned into five subsets, based on how many tests exercise those functions. For each function, f the value of (3.4) is used to determine which category that function falls in.

{ }( ),t f C t t T∈ ∈ (3.4)

The five categories are “none”, “low”, “medium”, “high“ and “all”, and correspond to 0, 1-10, 11-49, 50-59 and 60 test cases respectively. Recall that there are 60 test cases altogether used for this analysis. These ranges are summarized in Table 1.

Category Tests None 0 Low 1-10 Medium 11-49 High 50-59 All 60

Table 1. Number of tests in the categories.

We define categoryFC as the size of the respective categories. As an example lowFC is defined as:

{ }{ }1 ( ), 10lowFC f t f C t t T= ≤ ∈ ∈ ≤ (3.5)

This is of immediate interest when studying safe, function-based regression test selection, since it indicates what size can be expected of the selection when a change is made to a single function. A change to a function in the “low” class will, for instance, result in the selection of 1 through 10 tests. Selection is of best use when the “none”, “low” and “medium” categories are large, since this will provide the largest payoff. For testing purposes, however, the “none” category ought to be as small as possible, since it represents untested functionality. When it is too large, effort needs to be made to write new tests, effectively moving functions from “none” to “low” or “medium”.

On the other end we have the “all” category, which indicates that those functions are exercised by all tests. This renders function-based, safe selection useless, since it must select every available test.

- 21 -

3.5.6 Weighted Function Characteristics

The same measure is also considered with the functions weighted by their respective number of lines. This essentially partitions each line in JRockit into the five categories. As an example, the number of lines making up functions of the “low” category is:

{ }{ }1 ( ), 10flowWFC l t l C t t T= ≤ ∈ ∈ ≤ (3.6)

This measure is useful since it allows the changing of a single line rather than a function to be considered. This makes sense if faults are more likely to occur in functions with much code than in functions with only a few lines.

3.5.7 Line characteristics

A similar measure not based on function coverage at all was also obtained. The line characteristics were obtained using the same method as above, but based on single statements. This gives an indication of the potential for improvement by using a finer granularity selection algorithm, compared to the function-based one examined in this report.

{ }{ }1 ( ), 10llowLC l t l C t t T= ≤ ∈ ∈ ≤ (3.7)

The function-based selection, naturally, disregards differences in the execution trace inside the functions. A finer granularity selection could take these differen-ces into account and thus achieve a higher precision. The results were classified the same way as the function characteristics, and can be compared directly with the weighted function coverage.

The higher granularity measure always yields a lower coverage. This is because uncovered lines appearing in covered functions are no longer counted as covered when using the lower granularity measure.

3.5.8 Function Coverage Reproducibility

It was mentioned earlier that the reliability of the test selection process is highly dependent on the repeatability of the test execution. If a test exercises a function only sporadically, we might exclude tests that could otherwise reveal faults. To assess the likeliness of this scenario, the tests were executed ten times. The resulting sets of covered functions were then examined and the number of sporadically covered functions determined.

Definition 18. A sporadic function is a function that is covered by a test case during at least one, but not all, executions.

A large amount of sporadic functions clearly indicates that using test selection based on data from a single execution entails a substantial risk of not selecting all relevant tests. Arguably, the risk of missing these faults existed either way, since

- 22 -

the tests only exercised the functions spontaneously. However, since the selection can only reduce the number of tests to run, the risk is increased when selection is used.

The set of functions, F was partitioned into three subsets.

( ){ }( ){ }

( ){ }

( , ) ( )

( , , ) , ( ), ( )

( , ) ( )

U i

S i j

C i

F f i t t T f C t

F f i j t t T f C t f C t

F f i t t T f C t

= ∀ ∈ → ∉

= ∃ ∈ ∈ ∉

= ∀ ∈ → ∈

(3.8)

Here, UF is the set of all uncovered functions, SF the set of sporadically covered functions and CF the set of constantly covered functions over our ten test runs.

3.6 Implementing the Selection Tool To evaluate the selection algorithm with the gathered data and real changes, a simple test selection system was constructed. This system consists of two simple utilities, a tool that builds an index file mapping test cases to exercised functions, and another that, given a change number, fetches the differences and the mod-ified source code corresponding to that change and determines what functions are affected by the change.

The mapping between changes and functions was implemented in a simple fashion. Perforce, the versioning system provides diffs between the two versions of the source file, as well as the source itself for either of them. A simple utility mapping line numbers to function names was implemented, and the line numbers from the diff was used to translate this information into a set of altered functions.

Changes that are not localized inside a particular function currently cause the entire test-suite to be selected. Such changes may potentially alter the behavior of multiple functions, and a choice was made not to attempt to resolve what functions could, in fact, be affected.

If all changes are found within functions, the test selection algorithm is run. The intersection between the set of altered functions and the set of exercised functions for each test case is checked. Whenever such an intersection is non-empty, the corresponding test case is selected for re-execution.

As noted above, the current mapping of changes to functions is very primitive and hardly suitable for real use, as it severely limits the use of the selection utility. The mapping utility is purely based on textual differences, meaning that any difference is detected and treated as a change in the program. Clearly, this is not the case. For instance, if only a comment is added or modified, or if a simple formatting change is done within the source, the program remains unchanged. Further, such a non-change outside the boundaries of a function causes the process to “fail” with mapping the changes to functions, rendering the selection useless. It is clearly possible to improve this process by comparing the source-files after processing them with a utility converting them into a canonical form, for instance

- 23 -

by stripping comments and whitespace. Better implementations exist [chen94, vokolos97], but are more difficult to construct.

3.7 Selection Evaluation To test the selection, 100 changes affecting the JRockit source directory were used. Of those, only 22 contain changes only within functions. The other 78 changes currently result in the selection of the entire test suite, since the change to function mapping fails.

Selection was executed on the 22 well-behaved changes, and the selected test cases were saved. The selection was executed ten times, based on the coverage data obtained from the ten test runs.

It is important to note that the well-behaved changes are not likely typical changes. That only these tests are used as a basis for the evaluation severely limits the conclusions that can be drawn from the selection evaluation.

- 24 -

4 Results This chapter presents the specific results and their implications. A discussion on the results in general and the outcome of the project can be found in the following chapter.

First, general measures are explored, which describe the coverage data for the selected tests. This is followed by coverage characteristic measures, which essentially describe the expected outcome of a safe test selection on random changes within the system. Next, the repeatability of the execution data is examined. The differences in coverage clearly indicate that the controlled regression test assumption fails for JRockit, and that achieving safe regression test selection will be difficult at best. Finally, the results of the selection evaluation are presented. The corresponding measures have been defined in sections 3.5.4 through 3.5.8 and 3.7.

4.1 Coverage To begin with, the line and function coverage obtained from running the used tests was studied. This is in itself not very interesting, since only a part of the test suite used for JRockit was included in this project. However, in addition to the coverage from running the test-suite, the coverage obtained from running HelloWorld with the default flags was examined.

Runs C

F F CF F C

L L CL L

Suite, One run 3241 6267 0.52 70401 141432 0.50 Suite, Ten runs 3274 6267 0.52 71377 141432 0.50 HelloWorld 2187 6267 0.35 42982 141432 0.30

Table 2. Coverage statistics for used suite and HelloWorld.

- 25 -

In Table 2, the coverage obtained through running test suite once (3.2), through running the suite ten times (3.3) and through running HelloWorld can be seen.

The relatively high coverage obtained from running HelloWorld is interesting. Although the program does next to nothing, almost a third of the code in the JVM is exercised. This suggests that a large part of the code will be exercised almost regardless of what program is executed.

With a safe selection algorithm, changes made in this code will cause the entire test suite to be selected, thus not yielding any savings at all. That a trivial program covers such a large portion of the code indicates that a safe selection algorithm might not provide a very high payoff.

The increase in coverage that was obtained when the entire suite was used indicates portions of the code that is exercised by the tests specifically. Thus, when studying the tests used for this report, it is reasonable to expect that a change in those 17% of the code could yield an interesting selection. This is verified in section 4.2.1 for all but about 50 of the functions. Most likely, these are exercised by the test framework, and thus fall into the “all” category.

Directory C

F F CF F C

L L CL L

\jvm 45 243 0.19 690 2579 0.27 \jvm\code 83 131 0.63 1383 2746 0.50 \jvm\code\cache 12 148 0.08 74 3855 0.02 \jvm\code\gen 641 1203 0.53 24591 41109 0.60 \jvm\code\ir 425 603 0.70 7930 12510 0.63 \jvm\code\opt 253 333 0.76 9534 13596 0.70 \jvm\code\platform 22 36 0.61 174 411 0.42 \jvm\mm 93 170 0.55 1073 2887 0.37 \jvm\mm\alloc 46 79 0.58 793 1677 0.47 \jvm\mm\gc 320 421 0.76 5512 8450 0.65 \jvm\mm\lowlevel 53 101 0.52 591 1576 0.38 \jvm\model 38 85 0.45 682 1715 0.40 \jvm\model\classload 122 166 0.73 2696 4218 0.64 \jvm\model\reflect 137 212 0.65 2119 3899 0.54 \jvm\model\strings 45 59 0.76 625 927 0.67 \jvm\native 12 44 0.27 197 1085 0.18 \jvm\native\exceptions 26 79 0.33 455 1872 0.24 \jvm\native\jni 218 750 0.29 2308 10608 0.22 \jvm\native\rni 186 608 0.31 2995 11905 0.25 \jvm\native\system 9 14 0.64 91 277 0.33 \jvm\threads 137 169 0.81 1459 2178 0.67 \jvm\threads\context 17 19 0.89 136 187 0.73 \jvm\threads\fatthreads 56 68 0.82 810 1187 0.68 \jvm\threads\pd 33 33 1.00 356 460 0.77 \jvm\threads\sync 13 34 0.38 191 789 0.24 \jvm\util 169 344 0.49 1996 5079 0.39

Table 3. Coverage statistics for the directory tree. Based on the first test suite execution.

- 26 -

The fact that we further increase coverage by rerunning the test suite more than once indicates that the controlled regression test assumption fails for JRockit. This is discussed further in chapter 5.

To get a sense of what components of JRockit is well covered by the tests, the coverage statistics for specific directories in the source tree was also obtained. The directory statistics are presented in Table 3. We can see, for instance, that the /jvm/code/cache directory was poorly covered. As can be seen in Appendix A, none of the used tests test code caching, so this is to be expected.

The coverage data presented in Table 3 is based on the first run through the test suite. To get an overview of the source, only directories of the first two levels are shown. For directories not directly under /jvm or one of its subdirectories, nested subdirectories have been recursively included in the statistics.

4.2 Coverage Characteristics The coverage characteristics measure considers the number of tests that exercise a given function or line. This has an obvious application for safe regression test selection based on modification-traversing, namely that if a part of the code is changed that is traversed by N tests, those N tests should be selected.

Although function coverage is used for the regression test selection algorithm, the line coverage measure can be interesting as an indication of the potential improvement in precision with increased coverage granularity. If the line cov-erage based characteristics are considerably better than the function coverage based characteristics, higher granularity selection is likely to provide notably better selection.

4.2.1 Function Coverage Characteristics

The function coverage characteristics provide the best indication on the expected performance of our regression test selection algorithm when changes are made to a random function in JRockit. Recall that the set of functions were partitioned into five subsets; noneFC , lowFC , mediumFC , highFC and allFC (formula (3.4) and (3.5)).

The results from classifying the functions this way from the first run through the tests can be found in Figure 1.

If the function is in the “none” category, no tests will be selected. This indicates a weakness in the test suite. Ideally, new tests should be written to test the altered functionality. In the “low”, “medium” and “high” categories, some tests will be selected and others will be left out. Ideally, all functions should be in these categories, since all functions could then be tested while allowing this testing without the re-execution of the entire test suite.

- 27 -

None, 3026, 48%

Low, 356, 6%

Medium, 640, 10%

High, 4, 0%

All, 2241, 36%

Figure 1. Function coverage characteristics. Diagram shows number of functions in the

FC categories.

Finally, changes made to a function in the “all” category will result in every test being selected. This means that the selection process will not provide any savings whatsoever. If changes affect functions in this category, the test selection algorithm becomes useless.

As can be seen in Figure 1, 36% of the functions are exercised by every test available. This is understandable given that the HelloWorld program covered 30% of the JRockit code (see Table 2). The additional 6% can be explained by a number of reasons. Likely, they are mostly an effect of the common test framework.

It is worth noting that including more tests in the categorization invariably will increase the size of the “low”, “medium” and “high” categories, at the cost of “none” and “all”, which can never grow by the addition of new tests. In the extreme case with a one test-case suite, all functions will be categorized into those two categories.

We could drastically increase the theoretical efficiency of our selection if we disregarded functions in “all”, replacing the safe selection by some other selection for changes in those functions. This makes sense since the core functions are likely to be well tested either way, since all tests exercise those functions. However, faults in the core functions could be argued to be more serious, since they are used by all applications.

By the same reasoning, that the core functions are likely to be the functions that are tested the best, we can assume that the fault density in those functions are lower than in the rest of the system. As the product stabilizes and the number of faults decreases, those functions ought to reach a higher quality faster than the ones less often tested, thus being the target of modifications less frequently.

4.2.1.1 Influence of the Garbage Collector

As mentioned in the chapter 3, the tests cases used for this project was a set of 20 tests, each executed with three different garbage collectors. This was done since the garbage collector is a central part of the JVM, and it is also the only

- 28 -

Category Tests None 0 Low 1-3 Medium 4-16 High 17-19 All 20

Table 4. Number of tests in the modified categories.

major component that can easily be switched between invocations. Assuming that the garbage collectors have at least partially separate implementations, switching garbage collectors should also change the coverage. As a conseq-uence, a change made in one of the garbage collectors should result in tests using the other collectors to be omitted.

To investigate the impact of the garbage collection algorithm on the function characteristics, the same classification was made on the test cases with the same garbage collector. Since there were only 20 of each, the boundaries of the classification were changed accordingly. See Table 4.

If the used garbage collection algorithm substantially affects the coverage, the resulting characteristics should be less favorable for selection when the algorithms are examined in isolation. For instance, if many of the functions in the “medium” category in Figure 1 are specific for the different garbage collectors, these should appear in the “all” category when only tests using a single garbage collector are examined.

None, 3210, 51%

Low, 188, 3%

Medium, 620, 10%

High, 3, 0%

All, 2246, 36%

Parallel Gencon Singlecon

None, 3124, 50%

Low, 208, 3%

Medium, 663, 11%

High, 1, 0%

All, 2271, 36%

None; 3177; 51%

Low; 153; 2%

Medium; 680; 11%

High; 3; 0%

All; 2254; 36%

Figure 2. Function coverage characteristics. Suite partitioned by GC.

The characteristics are similar when considering the test cases for each GC separately (Figure 2). The “none” and “all” categories have grown slightly, as they are expected to when fewer and more similar tests are grouped together. Although several differences can be noted, the characteristics are largely the same even with the garbage collector constant. This indicates that the difference in coverage between test cases is not primarily due to the choice of garbage collector.

4.2.2 Weighted Function Coverage Characteristics

As discussed in chapter 3.5.6, weighted function coverage characteristics model the probability of a random change in the code triggering certain selection outcomes.

- 29 -

None, 53707, 38%

Few, 8402, 6%

Medium, 21275, 15%

High, 40, 0%

All, 58008, 41%

Figure 3. Function coverage characteristics. Functions weighted by number of lines.

In Figure 3 it can be seen, the commonly covered functions have a higher weight compared to the uncommon functions. This suggests, not surprisingly, that long functions are more likely to be tested. Comparing with Figure 1, we can see, for instance, that the 2241 functions exercised by all tests contain more lines than then 3026 uncovered functions. From Figure 3 we can conclude that a series of random changes to single functions with selection for each change will select approximately half of the tests.

The expected outcome of selection based on this measure can be calculated more precisely by considering the average number of tests selected for a change at each sampling point.

{ }, ( )f

l Lt t T l C t L

∈

⎛ ⎞∈ ∈⎜ ⎟

⎝ ⎠∑ (4.1)

This formula yields 27.62 when applied on the acquired coverage data. As expected, random changes will on average result in little less than half of the 60 test cases being selected.

4.2.3 Line Coverage Characteristics

As mentioned above, the line coverage characteristics can serve as an indication of the potential for improvements with a similar regression test selection scheme based on units smaller than functions. Although the mapping of changes with line coverage information to determine which tests that exercise modified code is nontrivial, there are systems that facilitate such selection, notably Pythia [vokolos97], which is also a regression test selection tool based on textual differences in the source code. Several other methods to facilitate a higher precision selection have also been implemented [rothermel96].

Comparing Figure 4 to Figure 3, it can be seen that the expected selection is substantially lower when line coverage data is used compared to when function coverage data is used.

- 30 -

None, 71031, 50%Few, 8612, 6%

Medium, 16712, 12%

High, 43, 0%All, 45034, 32%

Figure 4. Line coverage characteristics.

This can be verified by calculating the corresponding average selection, but based on line coverage instead of function coverage:

{ }, ( )l

l Lt t T l C t L

∈

⎛ ⎞∈ ∈⎜ ⎟

⎝ ⎠∑ (4.2)

From the same coverage data, this yields 21.49 tests. A selection algorithm based on line coverage as opposed to function coverage would, in this instance, reduce the size of the selection set by 22%. This implies that, as expected, a higher granularity selection would produce better results.

4.3 Reproducibility As discussed in section 2.5, the reproducibility of test runs determines whether or not safe selection can be facilitated by selecting modification-traversing tests. This is why the variations in function coverage across multiple executions were examined.

It was described in section 3.5.8 how the set of functions was partitioned into three subsets, depending on by how many of the test executions they were exercised. The results of this classification can be seen in Figure 5.

939 functions were found to be sporadic by running our 60 test cases ten times. For each test case, an average of 77 sporadic functions was detected, with a maximum of 389 and a minimum of 1 function. No test case yielded the same function coverage for all ten executions.

As can be seen from Figure 5, large parts of the JRockit code display what we call sporadic behavior, and changes to such code may result in selection algorithms based on modification-traversing to miss selecting tests. The found sporadic code is a lower bound. More sporadic code may be found by running more tests.

The indeterministic behavior of JRockit can be explained by its dependency on multiple concurrently running threads interacting in a timing-dependent manner. Notably, the progressive optimization is executing concurrently with the program,

- 31 -

Sporadic, 939, 15%

Uncovered, 2991, 48%

Stable, 2337, 37%

Sporadic, 28208, 20%

Uncovered, 52774, 37%

Stable, 60450, 43%

Figure 5. Left: the execution behavior of functions, as observed during ten test runs of the

used 60 test case suite. Right: The execution behavior of lines.

sampling execution progression at various times to determine what methods need optimizing.

Obviously the results of such a procedure depend on the scheduling of the threads by the operating system; how far they manage to execute and in what order they are scheduled to run. In short, we are far from controlled regression testing. To make matters worse, all tests have been run on a single machine, with only a single CPU and with the same settings each time. If anything, this ought to make prediction easier.

Although the timing sensitivity of JRockit was not unknown, this did not necess-arily mean that the obtained coverage would be unstable. How often and to what degree this occurred was previously not known at all.

4.4 Savings on Real Changes Finally, in spite of the indeterministic behavior of JRockit, the proposed selection algorithm was tested on some actual changes. As was mentioned in chapter 3.7, 22 modifications that could be parsed successfully were identified. For each of those, ten selections were made, based on coverage data from the ten runs.

The total number of selected tests for all 22 modifications was calculated for each of the runs. It varied between 154 and 160 for all 22 modifications. Only two of these modifications affected functions exercised by all tests, and another two modifications affected functions exercised at all. The remaining 18 changes resulted in empty selections. 22 random changes to a single function would be expected to trigger about 600 tests (22 changes, each selecting among 60 tests and about half of those selected).

The low number of tests selected can have a number of reasons, as mentioned in section 3.7. It is worth considering the possibility, however, that the line of reasoning in section 4.2.1 is correct. In other words, that faults are less common in the most tested parts of the program. Since the functions exercised by many or all of the tests are commonly tested, it is not unreasonable to expect them to contain fewer faults, and thus be the target of modifications less often.

- 32 -

Modifications in new functionality that is not yet covered by the test suite will also result in empty selections. Many of the 18 modifications may fall into this cate-gory, since the test suite used for this project was so small.

On a final note, the unsafe nature of the selection algorithm was shown during the sample selections. Since only modification-traversing functions were selec-ted, a safe selection algorithm would have made the same selection for each of the ten attempts.

- 33 -

5 Discussion This chapter discusses the conclusions drawn and observations made during the project, as well as what possible sources there are for errors. The possible use of the evaluated regression test selection scheme is discussed, as well as possible improvements and alternatives.

5.1 Observations This section describes the main observations made throughout the project: that most safe selection algorithms will not work with JRockit due to its indeterministic execution behavior and that a selection scheme can be used to reduce the test load substantially.

5.1.1 No Safe Selection

The coverage data obtained from JRockit clearly indicates that the execution behavior is indeterministic. As shown in section 4.3, and discussed in section 3.5.8, large parts of the JRockit code can not be used to reliably select tests. The sporadic code may not be exercised when coverage data is gathered, and tests that would potentially expose faults when this code is modified may then not be selected.

Clearly, the controlled regression testing assumption does not hold for JRockit. Although most factors are held constant, timing-dependent behaviors such as thread scheduling affect the execution, and thus the resulting coverage data. The set of modification-revealing tests for P′ is necessarily a subset of the modifi-cation-traversing tests for P only when this assumption holds [rothermel96].

- 34 -

In addition, any practical application of a coverage based regression test selection algorithm is likely to suffer from coverage data decay, resulting in additional uncertainty with the selection. This is further discussed in section 5.3.5.

5.1.2 Selection Efficiency

Analysis of the coverage data in section 4.2.2 indicated that a single random change in the code would on average cause about half of the available tests to be selected. Since real modifications often affect more than one part of the code, this should be considered a low estimation of the expected selection size. The result sets from the performed selection were substantially smaller (section 4.4), indicating a higher actual efficiency with real changes than expected from random changes.

Although the selection evaluation was limited in scope, and the changes used were those that conveniently worked with the primitive change to function mapping mechanism, this at least suggests that changes are less likely to be found in commonly covered code. This makes sense since the code that is exercised by all tests is likely to be thoroughly tested much earlier than the rest of the code. This also suggests that selection would become more and more beneficial with time, as the core functionality stabilizes.

Selection is also likely to become more and more useful as more tests are developed. When the test suite grows, the potential benefit for selection becomes larger. This is particularly true when tests become less aimed at testing core functionality and more focused on peripheral code.

Whatever the reason, the selected test sets for the evaluated changes were surprisingly small. This suggests that a similar selection algorithm may be of practical use. However, the used change set limits the conclusions that can be drawn from these results.

5.2 Risks This section discusses the primary sources of potential misinformation in the report. It is an attempt to identify risks and pinpoint weaknesses in the work that has been done.

In general, regarding the results with certain skepticism might be a good idea. The small scope of the project limits the conclusions that can safely be made from the observations. Never the less, the results can be useful to direct future work.

5.2.1 Test Suite

The test suite used during the project, as discussed in section 3.4.1, consisted of 20 test classes. Although not an exceptionally small set of tests, it is still only a small part of the suite used to test JRockit. As always, such a choice risks getting tests that are not representative for the entire suite.

- 35 -

Perhaps more importantly, the obtained coverage data is guaranteed not to be representative for the entire suite. The coverage characteristics for a large test suite will have less code in the “uncovered” and “covered by all” (or “none” and “all”) categories than will a smaller suite. The total coverage is likely also smaller with a small subset of the suite.

5.2.2 Instrumentation

The JRockit version that was used was instrumented with PureCoverage. As mentioned earlier, it is not unlikely that this instrumentation affected the runtime behavior of the tests. Timing dependant behavior, such as the optimization, is likely to be affected since the actual code that is executed is altered. As mentioned in section 3.3.3 the instrumented version of JRockit executed at less than half the speed compared to the uninstrumented version.

How much of an impact this has on the coverage data obtained is difficult to estimate, since the coverage is unknown when running an uninstrumented version.

5.2.3 Testing Environment

The coverage data and test selection used in this project was all obtained from a single computer. If selection is to be used in practice, this might not be a good idea.

As mentioned above, timing dependent behavior is not unlikely to affect the obtained coverage data, and thus the selection itself. This indicates that coverage data might also be affected by hardware or software configuration. In particular, the timing dependent behavior can be expected to change between single CPU and multiple CPU machines.

It can be argued that this is a general problem with testing software that is in some way indeterministic. Any test that works on a particular machine is not guaranteed to work properly on another machine. Failures might be detected on fast servers, but not on older workstations or vice versa. And what is worse, a test may produce failures only on some computers at rare occasions. Unfortunately, this is a problem that can be made worse by using test selection; as such tests then risk being executed less often and on fewer configurations.

5.2.4 Used Modifications

As mentioned in section 3.7, the modifications used to evaluate the selection tool were selected based on their suitability for the implemented tool. This must be considered, since it is very likely not to be a representative subset. To improve the quality of the obtained results, a better mapping algorithm is required. In the absolute worst case, the used changes represent the 22 simplest changes. Even then, the results are applicable as an indication on selection application on this category of changes.

Further, the used modifications are all from within a single week, and may thus reflect some temporary condition. To draw safe conclusions, modifications from

- 36 -

many different points in time, based on coverage data from that same time would be required. Ideally, a selection mechanism could be set up and data collected continually over a long period of time.

5.3 Selection for the JRockit Team This section discusses the use of regression test selection for JRockit. First, the examined algorithm is discussed, followed by reasoning around regression test selection in general, and the possible alternatives that might be of use.

5.3.1 The Examined Algorithm

The selection algorithm investigated in this report, which is described in section 3.1, could be used to approximate a safe regression test selection for JRockit if the implementation is improved to correctly handle all changes. This could be done by, for instance, comparing the source files after running them through a source code formatter and by actually mapping changes to data structures to possibly affected functions.

It is important to note that the examined selection algorithm, although it could certainly be used, is by no means safe. It can be used to focus testing efforts toward test cases where the altered functions are known to have been exercised in the past. It is uncertain whether they will be exercised when the tests are re-executed, or whether all test cases that would exercise the modified functions are selected.

5.3.2 Test Selection for Verification

When changes are made to the JRockit code base, before they are submitted, it is desirable to ensure that no serious new faults are introduced. For such a verification procedure, it is desirable to limit the time such a verification procedure may take, to provide fast feedback to the developers.

With limited time available for verification, regression test selection can be used to select a subset of the available test suite, attempting to maximize the chance of detecting faults in the available time. If few tests exercising the modified code are found, those tests may be stressed further, hopefully increasing the chance of detecting faults.

5.3.3 Selection for Continuous Testing

As suggested in section 1.2, the periodic re-execution of tests is not likely to be best served by the examined selection algorithm. Since many changes are generally committed between each of the full regression-testing runs, the size of the selected test sets are likely to be very large.

Selection based on source modifications is better suited the fewer and smaller changes that have been made. The more changes that have been made, the

- 37 -

closer the selection will come to the full test suite, and the entire procedure becomes a waste of time.

5.3.4 Justifying Risks

It should be clear that a coverage based regression test selection scheme could be used to substantially cut the number of tests to rerun when time is critical. It should be equally clear that such a procedure does, in fact, introduce a risk of not detecting faults that would otherwise have been detected.

This risk can largely be offset by also running the entire test suite regularly. This testing, however, can be done when there is free machine time available. The risk of missing faults is thus reduced to a risk of finding them at a later time.

5.3.5 Coverage Data Decay

The coverage data used as a basis for test selection will become obsolete as changes are made to the program. To maintain reasonable selection data, the tests need to be regularly re-executed with new, instrumented versions of JRockit. Ideally, this should be done after every change is submitted. However, since rerunning the tests with the instrumented version of JRockit is significantly slower than rerunning the tests, this would increase the test load substantially and reduce response time only for those changes tested when the coverage data is ready.

The less often this update can be performed, the larger the risk that the coverage data is obsolete, and may result in a poor selection. Although updating the coverage data may consume considerable amount of machine time, it is nece-ssary to maintain a reliable selection. If this work can be performed continuously but with a lower priority than the testing, new coverage data can be maintained and response time reduced. The more machines are available, the more up-to-date the data can be kept.

5.3.6 Alternative Approaches to Selection

As the execution behavior of JRockit does not allow for safe coverage-based test selection, a minimization or coverage technique might be more appropriate. Such selection techniques can reduce the set of tests further than the safe techniques, ensuring coverage rather than the execution of all detected relevant tests. In particular, this makes sense when the algorithms that select modification-traver-sing tests select all or close to all tests.

A third option is to deploy a selection mechanism that is adapted to JRockit specifically. For instance, a hybrid approach where an approximately safe selec-tion, like the one investigated is used for changes in functions covered by less than some percentage of the tests. When changes are detected in functions more commonly covered a coverage criterion can be used to select tests. This would avoid rerunning the entire test suite when changes are made in the most commonly executed functions, assuming instead that these functions will be thoroughly tested shortly either way.

- 38 -

5.3.7 Conclusion

As noted in section 2.5, the choice of selection algorithms is really best made once the use for such selection has been identified. Although speculation on possible solutions can provide rich soil for theories and ideas, the choice of algorithm depends heavily on what use is in mind. An aggressive algorithm might be best for change validation, whereas a not so precise but more inclusive algorithm is better for regular but more infrequent testing.

I believe that the use of regression test selection could improve the testing process for the JRockit team. It is not certain, however, that such an investment would be profitable, especially not in the short term.

- 39 -

6 Future Work The limited scope of this project leaves many interesting questions unanswered. A more thorough evaluation of selection algorithms would be particularly inter-esting; implementing and integrating a selection system and running it alongside a retest-all setup, comparing the results over an extended period of time. This would allow a comparison not only between the size of the selected test sets, but also of running times, fault detection ability and required computation time for selection or background processing.

In general, such research is in short supply. The practical application of the many regression test techniques that have been proposed during the last decades is often not sufficiently evaluated. Although some experiments have been made [chen94, liu99, rothermel96, rothermel97, rothermel02b, vokolos98], large scale or real life evaluation is lacking.

Questions such as whether or not test selection techniques are profitable to employ for large scale, or even small scale, software development projects thus remain unanswered. Comparative studies on one or a few regression test selection techniques on a real software project would be most welcome.

- 40 -

References [bates93] Samuel Bates, Susan Horwitz. Incremental Program Testing using

Program Dependence Graphs. Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, pages 384 to 396. March 1993.

[chen94] Yih-Farn Chen, David Rosenblum, Kiem-Phong Vo. TestTube: A System for Selective Regression Testing. Proc. 16th Int’l Conference on Software Engineering, IEEE Computer Society, pages 211 to 220. May 1994.

[jrockit03] BEA WebLogic JRockit™: Java for the Enterprise, Technical Whitepaper, Bea Systems, http://www.bea.com, 2003.

[kaner99] Cem Kaner, Jack Falk, Hung Quoc Nguyen. Testing Computer Software. Wiley Computer Publishing, 1999. ISBN 0-471-35846-0.

[liu99] Yi Liu. Regression Testing Experiments and Infrastructure. Research project at Oregon State University, June 1999.

[malaiya94] Yashwant K. Malaiya. Software Reliability Growth with Test Coverage. IEEE Transactions on Reliability, Vol. 51, No. 4, December 2002.

[richter99] Jeffrey Richter. Programming Applications for Microsoft Windows. Microsoft Press, 1999. ISBN 1-57231-996-8.

[rothermel94a] Gregg Rothermel, Mary Jean Harrold. A Framework for Evaluating Regression Test Selection Techniques. Proc. 16th Int’l Conference on Software Engineering, pages 201 to 210. May 1994.

- 41 -

[rothermel94b] Gregg Rothermel, Mary Jean Harrold. Selecting Tests and Identifying Test Coverage Requirements for Modified Software. Proc. ACM Int’l Symposium on Software Testing and Analysis, pages 169 to 184. August 1994.

[rothermel96] Gregg Rothermel, Mary Jean Harrold. Analyzing Regression Test Selection Techniques. IEEE Transactions on Software Engineering, Vol. 22, No. 8, pages 529 to 551, August 1996.

[rothermel02a] Gregg Rothermel, Alexey G. Malishevsky, Sebastian Elbaum. Modeling the Cost-Benefits Tradeoffs for Regression Testing Techniques. Proc. Int'l Conference on Software Maintenance, pages 204 to 213, October 2002.

[rothermel02b] Gregg Rothermel, Sebastian Elbaum, Alexey Malishevsky, Praveen Kallakuri, Brian Davia. The Impact of Test Suite Granularity on the Cost-Effectiveness of Regression Testing. Proc. 24th Int’l Conference on Software Engineering, pages 230 to 240. May 2002.

[tikir02] Mustafa Tikir, Jeffrey Hollingsworth. Efficient Instrumentation for Code Coverage Testing. ACM SIGSOFT Software Engineering Notes, Proc. Intl. symposium on Software testing and analysis, Vol. 27, Issue 4. July 2002.

[vokolos97] Filippos Vokolos, Phyllis Frankl. Pythia: A Regression Test Selection Tool based on Textual Differencing. Int’l Conference on Reliability, Quality and Safety on Software Intensive Systems (ENCRESS ’97). May 1997.

[vokolos98] Filippos Vokolos, Phyllis Frankl. Empirical Evaluation of the Textual Differencing Regression Testing Technique. Int’l Conference on Software Maintenance. November 1998.

[wong95] Eric Wong, Joseph Horgan, Saul London, Aditya Mathur. Effect of Test Set Minimization on Fault Detection Effectiveness. Proc. 17th int’l conference on Software Engineering, April 1995l.

[zhu97] Hong Zhu, Patrick Hall, John May. Software Unit Test Coverage and Adequacy. ACM Computing Surveys, Vol. 29, No. 4, December 1997.

- 42 -

Appendix A: The Tests Below is a list of the test classes that were used to evaluate regression test selection for JRockit in this project. Each entry contains the name of the test class (bold), the parameters used (italics) and a brief description of what the test does (normal).

1. jrockit.qa.tests.codegen.classload.ReflectTest cName=jrockit.qa.tests.codegen.classload.RF2 Tests reflection: Loads classes through reflection and ensures that the obtained classes are described as expected when examined using reflection.

2. jrockit.qa.tests.gc.HashValueRegressionTest iter=5 minNoObj=50000 maxNoObj=100000 rmMode=2 prob=50 timing=true randomSeed=2 beVerbose=false Tests hash value computation, Hashtable behavior and synchronization of Objects with cached hashcodes.

3. jrockit.qa.tests.nio.channels.BlockingReadWriteTest channelfactory=jrockit.qa.tests.nio.channels.SocketChannelFactory bufferfactory=jrockit.qa.tests.nio.DirectByteBufferFactory size=500 maxbuffer=17 readers=1 writers=1 Tests the blocking reading and writing operations of NIO SocketChannels.

- 43 -

4. jrockit.qa.tests.misc.Hinkar None General test. Solves the “Bucket-problem”. This test uses a variety of common basic constructs.

5. jrockit.qa.tests.nio.BufferTest factory=jrockit.qa.tests.nio.DirectByteBufferFactory Tests the java.nio.ByteBuffer using a direct buffer.

6. jrockit.qa.tests.nio.BufferTest factory=jrockit.qa.tests.nio.LightweightByteBufferFactory Tests the java.nio.ByteBuffer using a buffer that is not direct.

7. jrockit.qa.tests.codegen.reflect.AbstractMethods None Tests that implicit methods, meaning inherited methods, are reported by Class.getMethods().

8. jrockit.qa.tests.codegen.exceptions.ExceptionInInitializer None Loads classes with errors to ensure that the correct exceptions are thrown. Validates error handling when there are problems with the class loading..

9. jrockit.qa.tests.io.MultiReadZipTest iter=1 Tests reading from a zip-file with multiple concurrent threads.

10. jrockit.qa.tests.thread.WaitTest iter=8 Tests the wait/notify and synchronization of java.lang.Object.

11. jrockit.qa.tests.misc.RegisterKill None Times various arithmetic and hashing operations.

- 44 -

12. jrockit.qa.tests.io.RevLookupTest nohosts=1 hostname1=maker.jrpg.bea.com hostname2=www.yahoo.com hostname3=home.beasys.com hostname4=www.google.com hostname5=yp.yahoo.com hostname6=finance.yahoo.com Tests the INetAddress reverse lookup.

13. jrockit.qa.tests.thread.DivideAndConquerTest maxarraysize=1000 Sorts a random array using a divide-and-conquer algorithm that spawns new threads with each split.

14. jrockit.qa.tests.codegen.api.CharsetTest None Tests that all reported charsets can be loaded properly.

15. jrockit.qa.tests.codegen.classload.CLGCTest iter=1 Tests garbage collection of classes and class loaders.

16. jrockit.qa.tests.serialization.MicroBenchmark numob=5000 obsize=100 numobref=0 numiter=100 topology=noref Tests correctness and performance of serialization. Serializes a graph without references.

17. jrockit.qa.tests.serialization.MicroBenchmark numob=50000 obsize=0 numobref=2 numiter=10 topology=multiref Tests correctness and performance of serializing. Serializes a graph with multiple references.

18. jrockit.qa.tests.thread.ThreadEnds None Checks that a ThreadDeathException is thrown as expected.

19. jrockit.qa.tests.thread.Pong iter=1 Tests object creation and synchronization.

- 45 -

20. jrockit.qa.tests.codegen.classload.StringInternTest nt=2 folderdiff=build Tests loading classes dynamically. Checks methods and fields of loaded the loaded classes.

Documents

Coverage Based Regression Test Selection – Selecting ... · Coverage Based Regression Test Selection – Selecting Regression Tests for a JVM. NADA Numerisk analys och datalogi