From Continuous Delivery to
Continuous Recovery
Fixing the intermittent failures in your CI pipeline
Greg Law
Co-Founder and CTO, Undo
From 1990 - 2005 development hardly changed
In the last ten years everything has changed
Test OK?
What does this mean??
100% test coverage?
(obviously not.)
100% reliable test-suite?
Absolutely!
The productivity vs quality tradeoff
Quality
Pro
ductiv
ity
The productivity vs quality tradeoff
Quality
Pro
ductiv
ity
Arithmetic Lesson
● 50 tests, once per week
● Half solid, half 99%
reliable
● 25 x 4 = 100
● 1 failure per month
● 50,000 tests per hour
● 25,000 * 0.01
● = 250 failures per hour
● 24*7*4.333
● = 182,000 per month
● 50,000 tests per hour
● 99% reliable, 1% @
99.9%
● 50,000*0.01*0.001
● = 0.5; 1 failure every
hour
● 168/2
● = 84 failures per week
● 84 * 4.33333
● = 364 per month
The productivity vs quality tradeoff
Quality
Pro
ductiv
ity
Go green, stay green
Going green is the hard bit.
But it’s essential.
Step 1: exclude flaky tests
Ever-growing backlog of test that are flaky where no-one understands why.
CI/CD vision assumes reliable/repeatable testing WHAT TO DO?
Remove the flaky tests?
Viable, but has the obvious flaw
of reducing coverage.
The flaky tests are often the most
interesting.
Write only deterministic
tests?
Not viable because deterministic
tests are unable to catch non-
deterministic errors (e.g. race
conditions).
Excludes fuzz testing and other
powerful techniques.
Fix the flaky tests?
Gee thanks, great advice(!)
Intermittent test failures kill us
● 1000’s more tests every hour.
● Even 0.1% failure rate very bad news.
● Most of them probably don’t really matter.
● So we’ll come back to them later, it should be less
hectic next week.
Ever-growing backlog of test that are flaky where no-one understands why.
University of Cambridge Study
The majority of software firms struggle to keep their test suites green and reproduce failing tests
Does your CI run uncover failures
less than 1% of the time?
How much faster could you deliver releases
if reproducing failures was not an issue?
What is the biggest barrier to
finding and fixing bugs in your backlog faster?
Software engineers spend on average 13 hours to fix a failing test in the backlog
Fixing bugs (once reproduced)
Automating the re-run of tests
Writing tests
Extending and adapting tools
Getting the bug to reproduce
“
With LiveRecorder, we were
able to dramatically cut down
the analysis time that is
required to understand the
root cause of very complex
software defects.”
Chief Development Architect, SAP HANA
— Dr. Alexander Böhm
CI Stress Testing of SAP HANA
• SAP HANA as an enterprise-class, in-memory database management system
• OLTP and OLAP, relational and noSQL functionality in a single system
• Complex codebase
• Very strict quality and governance processes
• Sophisticated continuous integration platform
• Large functional and performance test harness (see Rehmann@RDSS 2014)
• “Regular“ tests plus highly parallel, multi-user stress tests (PMUT)
• Arbitrary database operations (DML, DDL, etc) in parallel
• High amount of stress for system resources
• Complements other tests with explorative/non-deterministic testing
• Similar approaches with other systems („chaos monkey“)
Software Flight Recording Technology
The solution
● SAP uses Live Recorder from Undo to record multi-user stress test (PMUT) runs
● When a failure occurs the recording is kept and handed over to developers to diagnose
● Turns the sporadic problem into a 100% reproducible
● SAP developers use Live Recorder’s interactive reversible debugger – UndoDB – on the recording to diagnose the root cause of the problem
Captured in test and diagnosed with LiveRecorder
● A number of sporadic memory leaks and memory corruption defects
● Several issues in the networking code, including the incorrect flushing of a receive
buffer and sporadically releasing channels in cases of timeout, resulted in queries
incorrectly aborting
● Incorrect parallel access to a shared data-structure which resulted in very subtle
sporadic problems which were hard to reproduce
● Very sporadic race condition in SAP HANA’s asynchronous garbage collection for in
memory table structures during table unloads under heavy system load
● A race condition in SAP HANA’s transaction management cache with the potential of
incorrectly reusing cached session data
Questions?
@gregthelaw
https://undo.io/resources/gdb-watchpoint/