Download pdf - From Continuous Delivery to Continuous Recovery€¦ · From Continuous Delivery to Continuous Recovery Fixing the intermittent failures in your CI pipeline Greg Law Co-Founder and

From Continuous Delivery to

Continuous Recovery

Fixing the intermittent failures in your CI pipeline

Greg Law

Co-Founder and CTO, Undo

From 1990 - 2005 development hardly changed

In the last ten years everything has changed

Test OK?

What does this mean??

100% test coverage?

(obviously not.)

100% reliable test-suite?

Absolutely!

The productivity vs quality tradeoff

Quality

Pro

ductiv

ity


Quality

Pro

ductiv

ity

Arithmetic Lesson

● 50 tests, once per week

● Half solid, half 99%

reliable

● 25 x 4 = 100

● 1 failure per month

● 50,000 tests per hour

● 25,000 * 0.01

● = 250 failures per hour

● 24*7*4.333

● = 182,000 per month

● 50,000 tests per hour

● 99% reliable, 1% @

99.9%

● 50,000*0.01*0.001

● = 0.5; 1 failure every

hour

● 168/2

● = 84 failures per week

● 84 * 4.33333

● = 364 per month


Quality

Pro

ductiv

ity

Go green, stay green

Going green is the hard bit.

But it’s essential.

Step 1: exclude flaky tests

Ever-growing backlog of test that are flaky where no-one understands why.

CI/CD vision assumes reliable/repeatable testing WHAT TO DO?

Remove the flaky tests?

Viable, but has the obvious flaw

of reducing coverage.

The flaky tests are often the most

interesting.

Write only deterministic

tests?

Not viable because deterministic

tests are unable to catch non-

deterministic errors (e.g. race

conditions).

Excludes fuzz testing and other

powerful techniques.

Fix the flaky tests?

Gee thanks, great advice(!)

Intermittent test failures kill us

● 1000’s more tests every hour.

● Even 0.1% failure rate very bad news.

● Most of them probably don’t really matter.

● So we’ll come back to them later, it should be less

hectic next week.

Ever-growing backlog of test that are flaky where no-one understands why.

University of Cambridge Study

The majority of software firms struggle to keep their test suites green and reproduce failing tests

Does your CI run uncover failures

less than 1% of the time?

How much faster could you deliver releases

if reproducing failures was not an issue?

What is the biggest barrier to

finding and fixing bugs in your backlog faster?

Software engineers spend on average 13 hours to fix a failing test in the backlog

Fixing bugs (once reproduced)

Automating the re-run of tests

Writing tests

Extending and adapting tools

Getting the bug to reproduce

“

With LiveRecorder, we were

able to dramatically cut down

the analysis time that is

required to understand the

root cause of very complex

software defects.”

Chief Development Architect, SAP HANA

— Dr. Alexander Böhm

CI Stress Testing of SAP HANA

• SAP HANA as an enterprise-class, in-memory database management system

• OLTP and OLAP, relational and noSQL functionality in a single system

• Complex codebase

• Very strict quality and governance processes

• Sophisticated continuous integration platform

• Large functional and performance test harness (see Rehmann@RDSS 2014)

• “Regular“ tests plus highly parallel, multi-user stress tests (PMUT)

• Arbitrary database operations (DML, DDL, etc) in parallel

• High amount of stress for system resources

• Complements other tests with explorative/non-deterministic testing

• Similar approaches with other systems („chaos monkey“)

Software Flight Recording Technology

The solution

● SAP uses Live Recorder from Undo to record multi-user stress test (PMUT) runs

● When a failure occurs the recording is kept and handed over to developers to diagnose

● Turns the sporadic problem into a 100% reproducible

● SAP developers use Live Recorder’s interactive reversible debugger – UndoDB – on the recording to diagnose the root cause of the problem

Captured in test and diagnosed with LiveRecorder

● A number of sporadic memory leaks and memory corruption defects

● Several issues in the networking code, including the incorrect flushing of a receive

buffer and sporadically releasing channels in cases of timeout, resulted in queries

incorrectly aborting

● Incorrect parallel access to a shared data-structure which resulted in very subtle

sporadic problems which were hard to reproduce

● Very sporadic race condition in SAP HANA’s asynchronous garbage collection for in

memory table structures during table unloads under heavy system load

● A race condition in SAP HANA’s transaction management cache with the potential of

incorrectly reusing cached session data

Questions?

@gregthelaw

https://undo.io/resources/gdb-watchpoint/