When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

When Tests Collide: Evaluating and Coping with the

Impact of Test Dependence

Wing Lam, Sai Zhang, Michael D. Ernst

University of Washington

2

Executing them in a different order:

Order dependent

DependenttestTwo tests:

createFile(“foo”)...

readFile(“foo”)...

(the intended test results)

Executing them in default order:

Why should we care about test dependence?

• Makes test behaviors inconsistent

• Affects downstream testing techniques

3

CPU 2

CPU 1

Test parallelizationTest prioritization

Test selection

4

• Test independence is assumed by:– Test selection– Test prioritization– Test parallel execution– Test factoring– Test generation– …

Conventional wisdom: test dependence is not a significant issue

31 papers inICSE, FSE, ISSTA, ASE,ICST, TSE, and TOSEM

(2000 – 2013)

5

• Test independence is assumed by:– Test selection– Test prioritization– Test parallel execution– Test factoring– Test generation– …

Conventional wisdom: test dependence is not a significant issue

31 papers inICSE, FSE, ISSTA, ASE,ICST, TSE, and TOSEM

(2000 – 2013)27

31

Assume test independencewithout justification

As a threat to validity

Consider test dependence

6

Recent work

• Illinois MIR work on flaky tests and test dependences [Luo FSE’14, Gyori ISSTA’15, Gligoric ISSTA’15]– Tests revealing inconsistent results

Dependent test is a special type of flaky test.

• UW PLSE work on empirically revisiting the test independence assumption [Zhang et al ISSTA’14]– Dependent test assumption should not be ignored

7

Is the test independence assumption valid?

• Does test dependence arise in practice?

• What repercussions does test dependence have?

• How can we nullify the impact of test dependence?

‒ Affecting downstream testing techniques

‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.

No!

‒ Yes, in both human-written and automatically-generated suites

8








9

Methodology

Reported dependent tests

5 issue tracking systems

New dependent tests

5 real-world projects

10

Methodology

Reported dependent tests

5 issue tracking systems

• Search for 4 key phrases:(“dependent test”, “test dependence”, “test execution order”, “di erent test outcome”)ff

• Manually inspect 450 matched bug reports

• Identify 96 distinct dependent tests

Characteristics:‒ Root cause‒ Developers’ action

11

Root cause

96 dependent tests

12

Root cause

59

23

10

4 static variable

file system

database

Unknown

at least 61% are due to side-e ectingff access to static variables.

13

Developers’ action

98% of the reported tests are marked as major or minor issues

91% of the dependence has been fixed‒ Improving documentation‒ Fixing test code or source code

14

Methodology

• Human-written test suites‒ 6413 tests

• Automatically-generated test suites‒ use Randoop [Pacheco’07]‒ 11002 tests

• Selected these subjects from previous project [Zhang et al. ISSTA’14] that identified dependent tests in them

37 (0.6%) dependent tests

608 (5.5%) dependent tests

New dependent tests


15








16

Test prioritization

…A test execution order

…A new test execution order

Achieve coverage fasterImprove fault detection rate…

Each test should yield the same result.

17

Four test prioritization techniques [Elbaum et al. ISSTA 2000]

Test prioritization technique

Prioritize on coverage of statements

Prioritize on coverage of statements not yet covered

Prioritize on coverage of methods

Prioritize on coverage of methods not yet covered

• Record the number of dependent tests yielding different results

Total: 37 human-written and 608 automatically-generated dependent tests


18

Evaluating test prioritization techniques

Test prioritization technique Number of tests that yield different results

Prioritize on coverage of statements 5 (13.5%)

Prioritize on coverage of statements not yet covered 9 (24.3%)

Prioritize on coverage of methods 7 (18.9%)

Prioritize on coverage of methods not yet covered 6 (16.2%)

• Implication:‒ On average, 18% chance test dependence would

affect test prioritization on human-written tests

Out of 37 human-written dependent tests

19

Evaluating test prioritization techniques

Test prioritization technique Number of tests that yield different results

Prioritize on coverage of statements 372 (61.2%)

Prioritize on coverage of statements not yet covered 331 (54.4%)

Prioritize on coverage of methods 381 (62.3%)

Prioritize on coverage of methods not yet covered 357 (58.7%)


affect test prioritization on automatically-generated tests

Out of 608 automatically-generated tests

20

Test selection


…A subset of the test execution order

Runs faster…


21

Six test selection techniques [Harrold et al. OOPSLA 2001]

Selection granularity Ordered by

Statement Test id (no re-ordering)

Statement Number of elements tests cover

Statement Number of uncovered elements tests cover

Function Test id (no re-ordering)

Function Number of elements tests cover

Function Number of uncovered elements tests cover

• Record the number of dependent tests yielding different results

Total: 37 human-written and 608 automatically-generated dependent tests


22

Evaluating test selection techniques

• Implication:‒ On average, 3.2% chance test dependence would

affect test selection on human-written tests


Selection granularity Ordered by Number of tests that yield different results

Statement Test id (no re-ordering) 1 (2.7%)


1 (2.7%)


1 (2.7%)

Function Test id (no re-ordering) 1 (2.7%)


1 (2.7%)


2 (5.4%)

23

Evaluating test selection techniques


affect test selection on automatically-generated tests

Out of 608 automatically-generated dependent tests

Selection granularity Ordered by Number of tests that yield different results

Statement Test id (no re-ordering) 95 (15.6%)


109 (17.9%)


109 (17.9%)

Function Test id (no re-ordering) 266 (44.0%)


294 (48.4%)


297 (48.8%)

24

Test parallelization


Reduce test latency…


…Schedules the test execution order across multiple CPUs

CPU 1

CPU 2

25

Two test parallelization techniques [1]

• Record the number of dependent tests yielding different results Total: 37 human-

written and 608 automatically-generated dependent tests


[1] Executing unit tests in parallel on a multi-CPU/core machine in Visual Studio. http://blogs.msdn.com/b/vstsqualitytools/archive/2009/12/01/executing-unit-tests-in-parallel- on-a-multi-cpu-core-machine.aspx.

Test parallelization technique

Parallelize on test id

Parallelize on test execution time

26

Evaluating test parallelization techniques

• Implication:‒ On average, when the #CPUs = 2, 27% chance test

dependence would affect test parallelization for human-written tests. When the #CPUs = 16, 36% chance


Parallelize on test id Parallelize on test execution time

#CPUs = 2 #CPUs = 16 #CPUs = 2 #CPUs = 16

2 (5.4%) 13 (35.1%) 14 (37.8%) 14 (37.8%)

#CPUs = 4 and 8 were evaluated but omitted for space reasons

27

Evaluating test parallelization techniques

• Implication:‒ On average, when the #CPUs = 2, 46% chance test

dependence would affect test parallelization for automatically-generated tests. When the #CPUs = 16, 64% chance

Parallelize on test id Parallelize on test execution time

#CPUs = 2 #CPUs = 16 #CPUs = 2 #CPUs = 16

194 (31.9%) 349 (57.4%) 360 (59.2%) 433 (71.2%)

Out of 608 automatically-generated dependent tests

#CPUs = 4 and 8 were evaluated but omitted for space reasons

28

Impact of test dependence

Technique Test suite type Chance of impact by test dependence

Prioritization Human-written Low

Prioritization Automatically-generated High

Selection Human-written Low

Selection Automatically-generated Moderate

Parallelization Human-written Moderate

Parallelization Automatically-generated High

Chance Average % of dependent test

exposed

Low 0-25%

Moderate 25-50%

High +50%

Dependent tests does affect downstream testing technique especially for automaticatlly-generated test suites!

29








30

General algorithm to nullify test dependence

A test suite:-Product of test prioritization, selection, parallelization

Known test dependences:-Can be generated through approximate algorithms [Zhang et al. ISSTA’14] or empty-Reuseable for different testing techniques and when developers change their code

…A test suite

…Reordered/Amended

test suite

…Known test dependences


31

Prioritization algorithm to nullify test dependence

Measured average area under the curve (APFD) for percentage of faults detected over life of the test suite- APFD of original prioritization

algorithms was 89.1%.- This dependence-aware algorithm

was 88.1% — a negligible difference.

…A test suite

Prioritization

…Prioritized test suite

…Reordered test suite



General algorithm

32

Selection algorithm to nullify test dependence

Measured number of tests selected- Number of tests selected on

average of original selection algorithms was 41.6%.

- This dependence-aware algorithm selected 42.2% — a negligible difference.

…A test suite

Selection

…Selected test suite


test suite



General algorithm

33

Parallelization algorithm to nullify test dependence

Measured time taken by slowest machine and its average speedup compared to unparallelized suites- Average speedup of original

parallelization algorithms was 41%.- This dependence-aware algorithm’s

speedup was 55%.

…A test suite

Parallelization

…Subsequences of test suite


test suite



General algorithm

34

Future work

• For test selection, measure the time it takes for our dependence-aware test suites to run compared to the dependence-unaware test suites

• Evaluate our effectiveness at incrementally recomputing test dependences when developers make code changes

35

• Evaluating and coping with impact of test dependence– Test dependence arises in practice– Test dependence does affect downstream testing

techniques– Our general algorithm is effective in practice to

nullify impact of test dependence

• Our tools, experiments, etc. https://github.com/winglam/dependent-tests-impact/

Contributions

36

[Backup slides]

37

Why more dependent tests in automatically-generated test suites?• Manual test suites:

– Developer’s understanding of the code and their testing goals help build well-structured tests

– Developers often try to initialize and destroy the shared objects each unit test may use

• Auto test suites:– Most tools are not “state-aware”– The generated tests often “misuse” APIs, e.g., setting up the

environment incorrectly– Most tools can not generate environment setup / destroy code

38

Dependent tests vs. Nondeterministic tests

• Nondeterminism does not imply dependence– A program may execute non-deterministically, but its tests

may deterministically succeed.

• Test dependence does not imply nondeterminism– A program may have no sources of nondeterminism, but its

tests can still be dependent on each other

Documents

When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington