Upload
shannon-small
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
When Tests Collide: Evaluating and Coping with the
Impact of Test Dependence
Wing Lam, Sai Zhang, Michael D. Ernst
University of Washington
2
Executing them in a different order:
Order dependent
DependenttestTwo tests:
createFile(“foo”)...
readFile(“foo”)...
(the intended test results)
Executing them in default order:
Why should we care about test dependence?
• Makes test behaviors inconsistent
• Affects downstream testing techniques
3
CPU 2
CPU 1
Test parallelizationTest prioritization
Test selection
4
• Test independence is assumed by:– Test selection– Test prioritization– Test parallel execution– Test factoring– Test generation– …
Conventional wisdom: test dependence is not a significant issue
31 papers inICSE, FSE, ISSTA, ASE,ICST, TSE, and TOSEM
(2000 – 2013)
5
• Test independence is assumed by:– Test selection– Test prioritization– Test parallel execution– Test factoring– Test generation– …
Conventional wisdom: test dependence is not a significant issue
31 papers inICSE, FSE, ISSTA, ASE,ICST, TSE, and TOSEM
(2000 – 2013)27
31
Assume test independencewithout justification
As a threat to validity
Consider test dependence
6
Recent work
• Illinois MIR work on flaky tests and test dependences [Luo FSE’14, Gyori ISSTA’15, Gligoric ISSTA’15]– Tests revealing inconsistent results
Dependent test is a special type of flaky test.
• UW PLSE work on empirically revisiting the test independence assumption [Zhang et al ISSTA’14]– Dependent test assumption should not be ignored
7
Is the test independence assumption valid?
• Does test dependence arise in practice?
• What repercussions does test dependence have?
• How can we nullify the impact of test dependence?
‒ Affecting downstream testing techniques
‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.
No!
‒ Yes, in both human-written and automatically-generated suites
8
Is the test independence assumption valid?
• Does test dependence arise in practice?
• What repercussions does test dependence have?
• How can we nullify the impact of test dependence?
‒ Affecting downstream testing techniques
‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.
‒ Yes, in both human-written and automatically-generated suites
9
Methodology
Reported dependent tests
5 issue tracking systems
New dependent tests
5 real-world projects
10
Methodology
Reported dependent tests
5 issue tracking systems
• Search for 4 key phrases:(“dependent test”, “test dependence”, “test execution order”, “di erent test outcome”)ff
• Manually inspect 450 matched bug reports
• Identify 96 distinct dependent tests
Characteristics:‒ Root cause‒ Developers’ action
11
Root cause
96 dependent tests
12
Root cause
59
23
10
4 static variable
file system
database
Unknown
at least 61% are due to side-e ectingff access to static variables.
13
Developers’ action
98% of the reported tests are marked as major or minor issues
91% of the dependence has been fixed‒ Improving documentation‒ Fixing test code or source code
14
Methodology
• Human-written test suites‒ 6413 tests
• Automatically-generated test suites‒ use Randoop [Pacheco’07]‒ 11002 tests
• Selected these subjects from previous project [Zhang et al. ISSTA’14] that identified dependent tests in them
37 (0.6%) dependent tests
608 (5.5%) dependent tests
New dependent tests
5 real-world projects
15
Is the test independence assumption valid?
• Does test dependence arise in practice?
• What repercussions does test dependence have?
• How can we nullify the impact of test dependence?
‒ Affecting downstream testing techniques
‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.
‒ Yes, in both human-written and automatically-generated suites
16
Test prioritization
…A test execution order
…A new test execution order
Achieve coverage fasterImprove fault detection rate…
Each test should yield the same result.
17
Four test prioritization techniques [Elbaum et al. ISSTA 2000]
Test prioritization technique
Prioritize on coverage of statements
Prioritize on coverage of statements not yet covered
Prioritize on coverage of methods
Prioritize on coverage of methods not yet covered
• Record the number of dependent tests yielding different results
Total: 37 human-written and 608 automatically-generated dependent tests
5 real-world projects
18
Evaluating test prioritization techniques
Test prioritization technique Number of tests that yield different results
Prioritize on coverage of statements 5 (13.5%)
Prioritize on coverage of statements not yet covered 9 (24.3%)
Prioritize on coverage of methods 7 (18.9%)
Prioritize on coverage of methods not yet covered 6 (16.2%)
• Implication:‒ On average, 18% chance test dependence would
affect test prioritization on human-written tests
Out of 37 human-written dependent tests
19
Evaluating test prioritization techniques
Test prioritization technique Number of tests that yield different results
Prioritize on coverage of statements 372 (61.2%)
Prioritize on coverage of statements not yet covered 331 (54.4%)
Prioritize on coverage of methods 381 (62.3%)
Prioritize on coverage of methods not yet covered 357 (58.7%)
• Implication:‒ On average, 59% chance test dependence would
affect test prioritization on automatically-generated tests
Out of 608 automatically-generated tests
20
Test selection
…A test execution order
…A subset of the test execution order
Runs faster…
Each test should yield the same result.
21
Six test selection techniques [Harrold et al. OOPSLA 2001]
Selection granularity Ordered by
Statement Test id (no re-ordering)
Statement Number of elements tests cover
Statement Number of uncovered elements tests cover
Function Test id (no re-ordering)
Function Number of elements tests cover
Function Number of uncovered elements tests cover
• Record the number of dependent tests yielding different results
Total: 37 human-written and 608 automatically-generated dependent tests
5 real-world projects
22
Evaluating test selection techniques
• Implication:‒ On average, 3.2% chance test dependence would
affect test selection on human-written tests
Out of 37 human-written dependent tests
Selection granularity Ordered by Number of tests that yield different results
Statement Test id (no re-ordering) 1 (2.7%)
Statement Number of elements tests cover
1 (2.7%)
Statement Number of uncovered elements tests cover
1 (2.7%)
Function Test id (no re-ordering) 1 (2.7%)
Function Number of elements tests cover
1 (2.7%)
Function Number of uncovered elements tests cover
2 (5.4%)
23
Evaluating test selection techniques
• Implication:‒ On average, 32% chance test dependence would
affect test selection on automatically-generated tests
Out of 608 automatically-generated dependent tests
Selection granularity Ordered by Number of tests that yield different results
Statement Test id (no re-ordering) 95 (15.6%)
Statement Number of elements tests cover
109 (17.9%)
Statement Number of uncovered elements tests cover
109 (17.9%)
Function Test id (no re-ordering) 266 (44.0%)
Function Number of elements tests cover
294 (48.4%)
Function Number of uncovered elements tests cover
297 (48.8%)
24
Test parallelization
…A test execution order
Reduce test latency…
Each test should yield the same result.
…Schedules the test execution order across multiple CPUs
CPU 1
CPU 2
25
Two test parallelization techniques [1]
• Record the number of dependent tests yielding different results Total: 37 human-
written and 608 automatically-generated dependent tests
5 real-world projects
[1] Executing unit tests in parallel on a multi-CPU/core machine in Visual Studio. http://blogs.msdn.com/b/vstsqualitytools/archive/2009/12/01/executing-unit-tests-in-parallel- on-a-multi-cpu-core-machine.aspx.
Test parallelization technique
Parallelize on test id
Parallelize on test execution time
26
Evaluating test parallelization techniques
• Implication:‒ On average, when the #CPUs = 2, 27% chance test
dependence would affect test parallelization for human-written tests. When the #CPUs = 16, 36% chance
Out of 37 human-written dependent tests
Parallelize on test id Parallelize on test execution time
#CPUs = 2 #CPUs = 16 #CPUs = 2 #CPUs = 16
2 (5.4%) 13 (35.1%) 14 (37.8%) 14 (37.8%)
#CPUs = 4 and 8 were evaluated but omitted for space reasons
27
Evaluating test parallelization techniques
• Implication:‒ On average, when the #CPUs = 2, 46% chance test
dependence would affect test parallelization for automatically-generated tests. When the #CPUs = 16, 64% chance
Parallelize on test id Parallelize on test execution time
#CPUs = 2 #CPUs = 16 #CPUs = 2 #CPUs = 16
194 (31.9%) 349 (57.4%) 360 (59.2%) 433 (71.2%)
Out of 608 automatically-generated dependent tests
#CPUs = 4 and 8 were evaluated but omitted for space reasons
28
Impact of test dependence
Technique Test suite type Chance of impact by test dependence
Prioritization Human-written Low
Prioritization Automatically-generated High
Selection Human-written Low
Selection Automatically-generated Moderate
Parallelization Human-written Moderate
Parallelization Automatically-generated High
Chance Average % of dependent test
exposed
Low 0-25%
Moderate 25-50%
High +50%
Dependent tests does affect downstream testing technique especially for automaticatlly-generated test suites!
29
Is the test independence assumption valid?
• Does test dependence arise in practice?
• What repercussions does test dependence have?
• How can we nullify the impact of test dependence?
‒ Affecting downstream testing techniques
‒ General algorithm adds/reorders tests for techniques such as prioritization, etc.
‒ Yes, in both human-written and automatically-generated suites
30
General algorithm to nullify test dependence
A test suite:-Product of test prioritization, selection, parallelization
Known test dependences:-Can be generated through approximate algorithms [Zhang et al. ISSTA’14] or empty-Reuseable for different testing techniques and when developers change their code
…A test suite
…Reordered/Amended
test suite
…Known test dependences
…Known test dependences
31
Prioritization algorithm to nullify test dependence
Measured average area under the curve (APFD) for percentage of faults detected over life of the test suite- APFD of original prioritization
algorithms was 89.1%.- This dependence-aware algorithm
was 88.1% — a negligible difference.
…A test suite
Prioritization
…Prioritized test suite
…Reordered test suite
…Known test dependences
…Known test dependences
General algorithm
32
Selection algorithm to nullify test dependence
Measured number of tests selected- Number of tests selected on
average of original selection algorithms was 41.6%.
- This dependence-aware algorithm selected 42.2% — a negligible difference.
…A test suite
Selection
…Selected test suite
…Reordered/Amended
test suite
…Known test dependences
…Known test dependences
General algorithm
33
Parallelization algorithm to nullify test dependence
Measured time taken by slowest machine and its average speedup compared to unparallelized suites- Average speedup of original
parallelization algorithms was 41%.- This dependence-aware algorithm’s
speedup was 55%.
…A test suite
Parallelization
…Subsequences of test suite
…Reordered/Amended
test suite
…Known test dependences
…Known test dependences
General algorithm
34
Future work
• For test selection, measure the time it takes for our dependence-aware test suites to run compared to the dependence-unaware test suites
• Evaluate our effectiveness at incrementally recomputing test dependences when developers make code changes
35
• Evaluating and coping with impact of test dependence– Test dependence arises in practice– Test dependence does affect downstream testing
techniques– Our general algorithm is effective in practice to
nullify impact of test dependence
• Our tools, experiments, etc. https://github.com/winglam/dependent-tests-impact/
Contributions
36
[Backup slides]
37
Why more dependent tests in automatically-generated test suites?• Manual test suites:
– Developer’s understanding of the code and their testing goals help build well-structured tests
– Developers often try to initialize and destroy the shared objects each unit test may use
• Auto test suites:– Most tools are not “state-aware”– The generated tests often “misuse” APIs, e.g., setting up the
environment incorrectly– Most tools can not generate environment setup / destroy code
38
Dependent tests vs. Nondeterministic tests
• Nondeterminism does not imply dependence– A program may execute non-deterministically, but its tests
may deterministically succeed.
• Test dependence does not imply nondeterminism– A program may have no sources of nondeterminism, but its
tests can still be dependent on each other