38
When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

Embed Size (px)

Citation preview

Page 1: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

When Tests Collide: Evaluating and Coping with the

Impact of Test Dependence

Wing Lam, Sai Zhang, Michael D. Ernst

University of Washington

Page 2: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

2

Executing them in a different order:

Order dependent

DependenttestTwo tests:

createFile(“foo”)...

readFile(“foo”)...

(the intended test results)

Executing them in default order:

Page 3: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

Why should we care about test dependence?

• Makes test behaviors inconsistent

• Affects downstream testing techniques

3

CPU 2

CPU 1

Test parallelizationTest prioritization

Test selection

Page 4: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

4

• Test independence is assumed by:– Test selection– Test prioritization– Test parallel execution– Test factoring– Test generation– …

Conventional wisdom: test dependence is not a significant issue

31 papers inICSE, FSE, ISSTA, ASE,ICST, TSE, and TOSEM

(2000 – 2013)

Page 5: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

5

• Test independence is assumed by:– Test selection– Test prioritization– Test parallel execution– Test factoring– Test generation– …

Conventional wisdom: test dependence is not a significant issue

31 papers inICSE, FSE, ISSTA, ASE,ICST, TSE, and TOSEM

(2000 – 2013)27

31

Assume test independencewithout justification

As a threat to validity

Consider test dependence

Page 6: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

6

Recent work

• Illinois MIR work on flaky tests and test dependences [Luo FSE’14, Gyori ISSTA’15, Gligoric ISSTA’15]– Tests revealing inconsistent results

Dependent test is a special type of flaky test.

• UW PLSE work on empirically revisiting the test independence assumption [Zhang et al ISSTA’14]– Dependent test assumption should not be ignored

Page 7: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

7

Is the test independence assumption valid?

• Does test dependence arise in practice?

• What repercussions does test dependence have?

• How can we nullify the impact of test dependence?

‒ Affecting downstream testing techniques

‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.

No!

‒ Yes, in both human-written and automatically-generated suites

Page 8: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

8

Is the test independence assumption valid?

• Does test dependence arise in practice?

• What repercussions does test dependence have?

• How can we nullify the impact of test dependence?

‒ Affecting downstream testing techniques

‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.

‒ Yes, in both human-written and automatically-generated suites

Page 9: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

9

Methodology

Reported dependent tests

5 issue tracking systems

New dependent tests

5 real-world projects

Page 10: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

10

Methodology

Reported dependent tests

5 issue tracking systems

• Search for 4 key phrases:(“dependent test”, “test dependence”, “test execution order”, “di erent test outcome”)ff

• Manually inspect 450 matched bug reports

• Identify 96 distinct dependent tests

Characteristics:‒ Root cause‒ Developers’ action

Page 11: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

11

Root cause

96 dependent tests

Page 12: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

12

Root cause

59

23

10

4 static variable

file system

database

Unknown

at least 61% are due to side-e ectingff access to static variables.

Page 13: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

13

Developers’ action

98% of the reported tests are marked as major or minor issues

91% of the dependence has been fixed‒ Improving documentation‒ Fixing test code or source code

Page 14: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

14

Methodology

• Human-written test suites‒ 6413 tests

• Automatically-generated test suites‒ use Randoop [Pacheco’07]‒ 11002 tests

• Selected these subjects from previous project [Zhang et al. ISSTA’14] that identified dependent tests in them

37 (0.6%) dependent tests

608 (5.5%) dependent tests

New dependent tests

5 real-world projects

Page 15: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

15

Is the test independence assumption valid?

• Does test dependence arise in practice?

• What repercussions does test dependence have?

• How can we nullify the impact of test dependence?

‒ Affecting downstream testing techniques

‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.

‒ Yes, in both human-written and automatically-generated suites

Page 16: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

16

Test prioritization

…A test execution order

…A new test execution order

Achieve coverage fasterImprove fault detection rate…

Each test should yield the same result.

Page 17: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

17

Four test prioritization techniques [Elbaum et al. ISSTA 2000]

Test prioritization technique

Prioritize on coverage of statements

Prioritize on coverage of statements not yet covered

Prioritize on coverage of methods

Prioritize on coverage of methods not yet covered

• Record the number of dependent tests yielding different results

Total: 37 human-written and 608 automatically-generated dependent tests

5 real-world projects

Page 18: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

18

Evaluating test prioritization techniques

Test prioritization technique Number of tests that yield different results

Prioritize on coverage of statements 5 (13.5%)

Prioritize on coverage of statements not yet covered 9 (24.3%)

Prioritize on coverage of methods 7 (18.9%)

Prioritize on coverage of methods not yet covered 6 (16.2%)

• Implication:‒ On average, 18% chance test dependence would

affect test prioritization on human-written tests

Out of 37 human-written dependent tests

Page 19: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

19

Evaluating test prioritization techniques

Test prioritization technique Number of tests that yield different results

Prioritize on coverage of statements 372 (61.2%)

Prioritize on coverage of statements not yet covered 331 (54.4%)

Prioritize on coverage of methods 381 (62.3%)

Prioritize on coverage of methods not yet covered 357 (58.7%)

• Implication:‒ On average, 59% chance test dependence would

affect test prioritization on automatically-generated tests

Out of 608 automatically-generated tests

Page 20: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

20

Test selection

…A test execution order

…A subset of the test execution order

Runs faster…

Each test should yield the same result.

Page 21: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

21

Six test selection techniques [Harrold et al. OOPSLA 2001]

Selection granularity Ordered by

Statement Test id (no re-ordering)

Statement Number of elements tests cover

Statement Number of uncovered elements tests cover

Function Test id (no re-ordering)

Function Number of elements tests cover

Function Number of uncovered elements tests cover

• Record the number of dependent tests yielding different results

Total: 37 human-written and 608 automatically-generated dependent tests

5 real-world projects

Page 22: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

22

Evaluating test selection techniques

• Implication:‒ On average, 3.2% chance test dependence would

affect test selection on human-written tests

Out of 37 human-written dependent tests

Selection granularity Ordered by Number of tests that yield different results

Statement Test id (no re-ordering) 1 (2.7%)

Statement Number of elements tests cover

1 (2.7%)

Statement Number of uncovered elements tests cover

1 (2.7%)

Function Test id (no re-ordering) 1 (2.7%)

Function Number of elements tests cover

1 (2.7%)

Function Number of uncovered elements tests cover

2 (5.4%)

Page 23: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

23

Evaluating test selection techniques

• Implication:‒ On average, 32% chance test dependence would

affect test selection on automatically-generated tests

Out of 608 automatically-generated dependent tests

Selection granularity Ordered by Number of tests that yield different results

Statement Test id (no re-ordering) 95 (15.6%)

Statement Number of elements tests cover

109 (17.9%)

Statement Number of uncovered elements tests cover

109 (17.9%)

Function Test id (no re-ordering) 266 (44.0%)

Function Number of elements tests cover

294 (48.4%)

Function Number of uncovered elements tests cover

297 (48.8%)

Page 24: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

24

Test parallelization

…A test execution order

Reduce test latency…

Each test should yield the same result.

…Schedules the test execution order across multiple CPUs

CPU 1

CPU 2

Page 25: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

25

Two test parallelization techniques [1]

• Record the number of dependent tests yielding different results Total: 37 human-

written and 608 automatically-generated dependent tests

5 real-world projects

[1] Executing unit tests in parallel on a multi-CPU/core machine in Visual Studio. http://blogs.msdn.com/b/vstsqualitytools/archive/2009/12/01/executing-unit-tests-in-parallel- on-a-multi-cpu-core-machine.aspx.

Test parallelization technique

Parallelize on test id

Parallelize on test execution time

Page 26: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

26

Evaluating test parallelization techniques

• Implication:‒ On average, when the #CPUs = 2, 27% chance test

dependence would affect test parallelization for human-written tests. When the #CPUs = 16, 36% chance

Out of 37 human-written dependent tests

Parallelize on test id Parallelize on test execution time

#CPUs = 2 #CPUs = 16 #CPUs = 2 #CPUs = 16

2 (5.4%) 13 (35.1%) 14 (37.8%) 14 (37.8%)

#CPUs = 4 and 8 were evaluated but omitted for space reasons

Page 27: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

27

Evaluating test parallelization techniques

• Implication:‒ On average, when the #CPUs = 2, 46% chance test

dependence would affect test parallelization for automatically-generated tests. When the #CPUs = 16, 64% chance

Parallelize on test id Parallelize on test execution time

#CPUs = 2 #CPUs = 16 #CPUs = 2 #CPUs = 16

194 (31.9%) 349 (57.4%) 360 (59.2%) 433 (71.2%)

Out of 608 automatically-generated dependent tests

#CPUs = 4 and 8 were evaluated but omitted for space reasons

Page 28: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

28

Impact of test dependence

Technique Test suite type Chance of impact by test dependence

Prioritization Human-written Low

Prioritization Automatically-generated High

Selection Human-written Low

Selection Automatically-generated Moderate

Parallelization Human-written Moderate

Parallelization Automatically-generated High

Chance Average % of dependent test

exposed

Low 0-25%

Moderate 25-50%

High +50%

Dependent tests does affect downstream testing technique especially for automaticatlly-generated test suites!

Page 29: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

29

Is the test independence assumption valid?

• Does test dependence arise in practice?

• What repercussions does test dependence have?

• How can we nullify the impact of test dependence?

‒ Affecting downstream testing techniques

‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.

‒ Yes, in both human-written and automatically-generated suites

Page 30: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

30

General algorithm to nullify test dependence

A test suite:-Product of test prioritization, selection, parallelization

Known test dependences:-Can be generated through approximate algorithms [Zhang et al. ISSTA’14] or empty-Reuseable for different testing techniques and when developers change their code

…A test suite

…Reordered/Amended

test suite

…Known test dependences

…Known test dependences

Page 31: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

31

Prioritization algorithm to nullify test dependence

Measured average area under the curve (APFD) for percentage of faults detected over life of the test suite- APFD of original prioritization

algorithms was 89.1%.- This dependence-aware algorithm

was 88.1% — a negligible difference.

…A test suite

Prioritization

…Prioritized test suite

…Reordered test suite

…Known test dependences

…Known test dependences

General algorithm

Page 32: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

32

Selection algorithm to nullify test dependence

Measured number of tests selected- Number of tests selected on

average of original selection algorithms was 41.6%.

- This dependence-aware algorithm selected 42.2% — a negligible difference.

…A test suite

Selection

…Selected test suite

…Reordered/Amended

test suite

…Known test dependences

…Known test dependences

General algorithm

Page 33: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

33

Parallelization algorithm to nullify test dependence

Measured time taken by slowest machine and its average speedup compared to unparallelized suites- Average speedup of original

parallelization algorithms was 41%.- This dependence-aware algorithm’s

speedup was 55%.

…A test suite

Parallelization

…Subsequences of test suite

…Reordered/Amended

test suite

…Known test dependences

…Known test dependences

General algorithm

Page 34: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

34

Future work

• For test selection, measure the time it takes for our dependence-aware test suites to run compared to the dependence-unaware test suites

• Evaluate our effectiveness at incrementally recomputing test dependences when developers make code changes

Page 35: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

35

• Evaluating and coping with impact of test dependence– Test dependence arises in practice– Test dependence does affect downstream testing

techniques– Our general algorithm is effective in practice to

nullify impact of test dependence

• Our tools, experiments, etc. https://github.com/winglam/dependent-tests-impact/

Contributions

Page 36: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

36

[Backup slides]

Page 37: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

37

Why more dependent tests in automatically-generated test suites?• Manual test suites:

– Developer’s understanding of the code and their testing goals help build well-structured tests

– Developers often try to initialize and destroy the shared objects each unit test may use

• Auto test suites:– Most tools are not “state-aware”– The generated tests often “misuse” APIs, e.g., setting up the

environment incorrectly– Most tools can not generate environment setup / destroy code

Page 38: When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

38

Dependent tests vs. Nondeterministic tests

• Nondeterminism does not imply dependence– A program may execute non-deterministically, but its tests

may deterministically succeed.

• Test dependence does not imply nondeterminism– A program may have no sources of nondeterminism, but its

tests can still be dependent on each other