Flaky tests and bugs in Apache software (e.g. Hadoop)

Flaky Tests and Bugs in

Apache Software (e.g. Hadoop)

Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

NTT Software Innovation Center

ApacheCon Core North America (May 12, 2016, at Vancouver)

• Software Engineer at NTT Corporation

• NTT: the largest telecom in Japan

• Engaged in improvement on reliability of

distributed systems

• Some contributions to ZooKeeper / Hadoop

including critical bug fixes (non-committer)

• github: https://github.com/AkihiroSuda

Who am I

• Current "flakiness" in Apache software

• Why flaky test matters?

• What causes a flaky test?

• How can we find, reproduce, and fix a flaky test?

• Existing work at Apache communities

• Our work: Namazu(鯰, catfish)

https://github.com/osrg/namazu

Agenda

Good News: Apache software are well tested!

Software Production code (LOC) Test code (LOC)

MapReduce 95K 87K

YARN 178K 121K

HDFS 152K 150K

ZooKeeper 33K 27K

HBase 571K 222K

Spark 167K 128K

Flume 46K 34K

Cassandra 168K 78K

Data are measured at 14/01/2016, using CLOC

Prod Test

Bad News: https://builds.apache.org/job/%s-trunk/

MapReduce YARN HDFS

ZooKeeper

Data are captured at 14/01/2016

HBaseBuild

Build Time

Blue = Success

Red = Failure

I've never seen fully successful Hadoop build,

even on my local machine...

Bad News: JIRA QL: project = ? AND text ~ "test fail*"

Software #Matched #All

Issues

MapReduce 2,441 (38%) 6,373

YARN 2,290 (63%) 4,756

HDFS 5,141 (53%) 9,672

ZooKeeper 828 (35%) 2,384

HBase 6,595 (42%) 15,542

Spark 794 ( 6%) 14,047

Flume 342 (12%) 2,882

Cassandra 1,656 (15%) 11,430

Roughly speaking,

the half of

Hadoop development

is dedicated to

debugging test failures.

Interestingly,

its flakiness seems

not uniform

across software..

(discussed later)

just for approximation

Agenda

97% unit test failures in Apache software are said to be

harmless for production ("false-alarm")

• Information source:

"An Empirical Study of Bugs in Test Code" (A.Vahabzadeh et al., ICSME'15)

Not all test failures are critical for production..

It still matters!

For developers..

It's a barrier to promotion of CI

• If many tests are flaky, developers tend to ignore CI

failure overlook real bugs

It's also a psychological barrier to contribution

• A developer may be blamed due to a test failure

For users..

It's a barrier to risk assessment for production

• No one can tell flaky tests from real bugs

So flaky test doesn't matter, as it doesn't affect production?

SemaphoreCI suggests "No broken windows" strategy

for flaky tests

https://semaphoreci.com/community/tutorials/how-to-deal-with-and-eliminate-flaky-tests

So flaky test doesn't matter, as it doesn't affect production?

image: http://guides.lib.jjay.cuny.edu/nypd/brokenwindows

Agenda

• Typical flaky test is caused by a malformed async

operation like this

(A.Vahabzadeh et al., ICSME'15 / Q.Luo et al., ACM FSE'14 / YARN-4478)

• Basically it can be fixed by increasing timeout&retries

• But it's not easy to find a reasonable timeout value

(e.g. YARN-{4804, 4807, 4929...})

• Long timeout is expensive

Basic cause: async operation

invokeAsyncOperation();// some tests lack even this sleepsleep(certainHardcodedTimeout);assertTrue(checkSomethingGoodHasHappened());

• Host configuration

• Host performance

• Docker is great! But it still has some

issues

Testbed (e.g. CI) can cause test failures as well

• HADOOP-12687

• Many YARN test fails when /etc/hosts has multiple loopback

entries

• ZOOKEEPER-2252

• Test: nslookup("a") should fail

• It does not fail when there is actually the host named "a“

• INFRA-11811

• JDK was not set up properly in a Jenkins slave

• Such a test can fail when the job is assigned to a

specific buildbot and it looks like a flaky test

CI host configuration can cause test failures

CI host performance: they're not made equal

• Hadoop's buildbot https://builds.apache.org/computer/

• Spark's buildbot https://amplab.cs.berkeley.edu/jenkins/computer/

• Significant difference in the response time!

• Maybe related to the fact that Spark has only a

small number of test-related issues

(e.g. YARN 63% vs Spark 6% (slide 7))

Target Average Max Min

Hadoop 1163ms 1482ms 30ms

Spark 3ms 6ms 0ms

Docker is great for testing!

• Some Apache software are using Docker on their

CI (via Apache Yetus)

• Apache BigTop also utilizes Docker for

provisioning Hadoop

• People also loves Docker for setting up test beds

on their workstations and laptops

• Of course me too

Docker issues

• Mentioned in several Apache-related issue tickets:

• jupyter/docker-stacks#75: Spark hanging

• docker-library/cassandra#43, #46

• docker-solr/docker-solr#4

• ALLURA-8039

• AMBARI-14706

• IGNITE-2377

• YETUS-229 …

• Fortunately Apache Buildbot (Yetus) didn't hit the bug,

but made people's local testbeds flaky in a weird way.

• Fixed in recent kernels (so, accurately, it's not a Docker's issue)

Docker #18180: Java VM unkillable zombie

AUFS: fcntl(F_SETFL, O_APPEND) was not supported

(#20199)

• Can cause data corruption (Dovecot is known to be affected)

• Fixed in recent AUFS

Overlay: You should not open O_RDWR and

O_RDONLY simultaneously (#10180)

• Can cause data corruption (RPM is known to be affected)

• Expected behavior, won't get fixed

More information: https://github.com/AkihiroSuda/docker-issues

Other potential Docker-related issues

• Some issues can occur only in a

deployed environment rather than in a

• e.g. TCP packet corruption

• Very flaky and critical

Flaky test is not limited to xUnit in CI..

https://www.pagerduty.com/blog/the-discovery-of-apache-

zookeepers-poison-packet/

• TCP checksum was ignored in some IPsec

configuration

• ZooKeeper became weird intermittently due to corrupted TCP

packet

https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-

data-to-mesos-kubernetes-docker-containers-

4986f88f7a19#.gq8chzply

• TCP checksum was ignored in some veth

configuration

• Mesos and Kubernetes are affected

TCP packet corruption

• It's very hard to notice (and reproduce) flaky TCP

packet corruption...

• Should distributed systems be TCP-corruption

tolerant...?

• the probability is very low in regular environments,

but it is not zero

(32-bit Ethernet CRC + 16-bit TCP checksum)

• JIRA issues: ZOOKEEPER-2175, HDFS-8161…

TCP packet corruption

Agenda

• determine-flaky-tests-hadoop.py

• Apache Kudu‘s CI (dist_test)

• Google's TAP

• Our work: Namazu

https://github.com/osrg/Namazu

• and similar great tools

Efforts to find/reproduce a flaky test

• Picks up failed tests using Jenkins API

• Included in hadoop.git/dev-support (HADOOP-

11045)

determine-flaky-tests-hadoop.py

$ determine-flaky-tests-hadoop.py --job Hadoop-YARN-trunk****Recently FAILED builds in url: https://builds.apache.org/job/Hadoop-YARN-trunk...Among 15 runs examined, all failed tests <#failedRuns: testName>:

7: TestContainerManagerRecovery.testApplicationRecovery...

• Great tool, but it doesn't support running a

specific test repeatedly

• Also there is a maven dependency issue (YARN-

• B depends on A

• TestB is never executed if TestA fails

if TestA is flaky, we can't evaluate the flakiness of

TestB!

determine-flaky-tests-hadoop.py

Kudu's CI: flaky test dashboard

http://dist-test.cloudera.org:8080/ (Apr 25)

Recently open-sourced and introduced at Apache: Big Data (Monday)

https://github.com/cloudera/dist_test

Kudu's CI: flaky test dashboard

• Tests are run repeatedly on CI to find flaky tests

• KUDU_FLAKY_TEST_ATTEMPTS

• KUDU_FLAKY_TEST_LIST

(From https://github.com/apache/incubator-kudu/commit/1a24338a)

Fix flakiness of client_failover-itest

The reason this test was flaky is that there is a race between....

Looped 100x and they all passed:

http://dist-test.cloudera.org/job?job_id=mpercy.1454486819.10566

Author Mike Percy Jan 29, 2016 8:01 AMCommitter Todd Lipcon Feb 4, 2016 2:14 PMCommit 1a24338ad60a8842d1ae5e227f8f03e58faea8c0

• Google's internal CI

• 1.6M test failures per day

• 73K (4.5%) are flaky

• Repeat a failing test 10 times for labeling

flaky tests

• Information source: An Empirical Analysis

of Flaky Tests (Q.Luo et al. ACM FSE'14)

Google's TAP

• Modern CIs run jobs repeatedly to find /

reproduce flaky tests

• But they don't control non-determinism

• Overlook a flaky test

• Can not reproduce a failure

Cannot analyze the failure

• Our suggestion: increase non-determinism

for finding and reproducing flaky tests

Challenge: poor non-determinism

NAMAZU: PROGRAMMABLE FUZZY SCHEDULER

NOTE: Namazu was formerly named "Earthquake"

Namazu: programmable fuzzy scheduler

EventFuzzed (Randomized)

Schedule

Increases non-determinismfor finding and

reproducing flaky tests

Filesystem Packet Go[planned] Linux threadsJava

鯰 (namazu) means

a catfish in Japanese

Netfilter

Openflow

Byteman

AspectJ

AspectGo

sched_

setattr(2)

Namazu uses non-invasive techniques

• can be easily applied to any environment

• can avoid false-positives

Namazu: programmable fuzzy scheduler

https://github.com/AkihiroSuda/golang-exp-aspectgo

• xUnit tests

• 😃 Easy to get started; just run `mvn`

• 😃 Can reproduce test failures observed in CI

• 😞 Limited testable scope

• Integration tests on a distributed cluster

• 😃 Can test everything

• 😞 Need to write a script to set up the cluster

• But Docker helps us a lot!

Namazu targets

We support the both scenarios

Namazu targets

Single-node mode

(for xUnit tests)

Distributed mode

(for integration tests)

$ mvn test

Orchestrator

NAMAZU + XUNIT TESTS

$ mvn test

• Namazu is a comprehensive framework...

• Quick start: “renice” threads for xUnit tests

• POSIX.1 requires that threads share the single nice(priority)

value, but the actual Linux implementation (NPTL) not.

• Not always effective, but it’s generic and easy to get started

Namazu + xUnit tests

$ PID=$(docker inspect $(docker ps -q -f ancestor=hadoop-build-ubuntu) | jq .[0].State.Pid)$ sudo nmz inspectors proc -pid $PID

$ cd hadoop; ./start-build-env.sh[container]$ mvn test –Dtest=TestFoo#testBar

Namazu periodically sets random nice values for all the child

processes and the threads under $PID

Plus utilizes non-default kernel schedulers (e.g. SCHED_BATCH)

Namazu + xUnit tests: Reproducibility

Testcase Traditional Namazu

YARN-4548

RM/TestCapacityScheduler11% 82%

YARN-4556

RM/TestFifoScheduler2% 44%

ZOOKEEPER-2137

ReconfigTest2% 16%

YARN-4168

NM/TestLogAggregationService1% 8%

YARN-1978

NM/TestLogAggregationService0% 4%

YARN-4543

NM/TestNodeStatusUpdater0% 1%

• More information: osrg/namazu#125

Namazu + xUnit tests: Reproducibility

Testcase Traditional Namazu

ZOOKEEPER-2080

ReconfigRecoveryTest

14.0% 61.9%

• "Renicing" is not always effective...

• But even when renicing is ineffective,

sometimes you can also reproduce the flaky test

by injecting delays or reordering packets

$ sudo iptables ... -j NFQUEUE --queue-num 42$ sudo nmz inspectors ethernet -nfq-number 42

NAMAZU + INTEGRATION TESTS

• ZooKeeper: distributed coordination service

• used in Hadoop, Spark, Mesos, Kafka..

• ZooKeeper 3.5 (alpha) introduced the dynamic

configuration

• We performed an integration test so as to evaluate

the reliability of the reconfiguration

• We found a flaky bug!

Namazu + Integration tests

• We permuted some specific Ethernet packets in random

order using Namazu

• TCP retransmissions are eliminated for reducing possible state

Namazu + Integration tests

ZooKeeper cluster

Open vSwitch + Ryu SDN Framework

+ Namazu

• Bug: New node cannot participate to ZK cluster properly

New node cannot become a leader of ZK cluster itself

(More technically, it keeps being an "observer“)

• Cause: distributed race (ZAB packet vs FLE packet)

• ZAB.. atomic broadcast protocol for data

• FLE.. leader election protocol for ZK cluster itself

Found ZOOKEEPER-2212

Leader of ZK cluster New ZK node

ZAB [2888/tcp]

FLE [3888/tcp]

Uses different TCP connection

Non-deterministic packet order

• Expected: ZK cluster works even when 𝑵/𝟐 nodes

crashed

• Real: single node failure can terminate the 3-node

ensemble

Not participating properly

(keeps being an "observer")

• Reproducibility: 0.0% 21.8%

(tested 1,000 times)

• We could not reproduce the bug even after

5,000 times traditional testing (60 hours!)

• Even reproducible by “renicing” threads, but the

reproducibility is just 0.7%

How hard is it to reproduce?

We define the distributed execution pattern based on code coverage:

𝑷 =

𝒑𝟏,𝟏 ⋯ 𝒑𝟏,𝑵

⋮ ⋱ ⋮𝒑𝑳,𝟏 ⋯ 𝒑𝑳,𝑵

• 𝐿: LOC

• 𝑁: Number of nodes (==3 in this case)

• 𝑝𝑖 ,𝑗 : 1 if the node 𝑗 covers the branch in line 𝑖 , otherwise 0

• We used JaCoCo: Java Code Coverage Library (patch: ZOOKEEPER-2266)

Why we can hit the bug?

Namazu achieves faster pattern growth.

That's why we can hit the bug.

HOW TO USE NAMAZU?

Easy to install

Easy to get started

• Provides Docker-like CLI

• No code instrumentation needed

• No configuration needed (default: just renice threads)

How to use Namazu?

$ sudo apt-get install lib{netfilter-queue,zmq3}-dev$ go get github.com/osrg/namazu/nmz

$ sudo nmz container run –it –v /foo:/foo ubuntu[container]$ cd /foo && mvn test

For threads ("renicing")

$ sudo nmz inspectors proc -pid $TARGET_PID

$ sudo nmz inspectors fs -mount-point /nmzfs

$ sudo iptables ... -j NFQUEUE --queue-num 42$ sudo nmz inspectors ethernet -nfq-number 42

Need distributed mode? (for integration testing)

Just add `--orchestrator-url http://foobar:10080/api/v3` to the CLI.

For filesystem

For network packets

How to use Namazu?

Namazu API (Go)

type ExplorePolicy interface {QueueEvent(Event)ActionChan() chan Action

func (p *MyPolicy) QueueEvent(event Event) {action := event.DefaultAction()p.timeBoundedQ.Enqueue(action,

10 * Millisecond, 30 * Millisecond)}

func (p *MyPolicy) ActionChan() chan Action {return p.timeBoundedQ.DequeueChan

Action is randomly fired in [10ms, 30ms]

You can also inject fault actions here

Namazu defines REST API,

so you can also use other languages

An event can contain

Ethernet packet bytes

• We found a bug: YARN cannot detect disk failure cases

where mkdir()/rmdir() blocks

• We noticed that the bug can occur theoretically

when we are reading the code, and actually produced the

bug using Namazu

• When we should inject the fault is pre-known;

so we manually wrote a concrete scenario using Namazu API

• Much more realistic than JUnit + mocking

API use case: found YARN-4301

A case where mkdir() returns EIO explicitly A case where mkdir() blocks

func (p *MyPolicy) signalHandler() {signal.Notify(sigChan, syscall.SIGUSR1)for {

<-sigChanp.sleep = 10 * time.Minute

}}go p.signalHandler()func (p *MyPolicy) QueueEvent(event Event) {..}func (p *MyPolicy) ActionChan() chan Action {..}

$ go run mypolicy.go inspectors fs -mount-point /nmzfs

Set "yarn.nodemanager.local-dirs" to "/nmzfs/nm-local-dir",

Send SIGUSR1 to Namazu when you (and YARN) are ready

Interactive test is often easier than writing a JUnit testcase

We use SIGUSR1 here,

but it is also interesting to

implement human-friendly

CLI or GUI for

interactive testing

fault: blocks for 10 minutes

• If you have knowledge on the protocol, you can make

a hash for a packet

• Note that you have to eliminate time-dependent and random

bytes when you hash the packet

• Using the hash and Namazu API, you can "semi"-

deterministically replay the scenario

• Not fully deterministic; it just does its best effort

• Record-less! You just need to remember the "seed" for

replaying

• PoC: ZOOKEEPER-2212: up to 65% reproducibility

• More information: osrg/namazu#137

• See also (for Go): https://github.com/AkihiroSuda/go-replay

Another API use case: "semi"-deterministic replay

SIMILAR GREAT TOOLS

• Network partitioner + Linearizability tester

• Famous for "Call Me Maybe" blog: http://jepsen.io/

• “Call Me Maybe” by Carly Rae Jepsen (vevo):

https://www.youtube.com/watch?v=fWNaR-rxAic

• Randomly injects network partition using iptables

• "Linearizability" ∈ "Strong consistency"

• Integration test on a flaky network rather than a

flaky xUnit test

Similar great tool: Jepsen

• Has been used to test several Apache software

• Cassandra: 9851,10001,10068,10231,10413,10674

• http://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen

• HBase

• Kafka

• Solr: 6530, 6583, 6610

• http:///lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-

flaky-networks

• ZooKeeper

Similar great tool: Jepsen

• Namazu is much more generalized

• The bugs we found/reproduced are basically beyond the

scope of Jepsen (Threads, Disks..)

• Namazu can be also combined with Jepsen! It will be

our next work..

Namazu + Jepsen?

• causes network partition

• tests linearizablity

• increases non-determinism

• injects filesystem faults

Jepsen Namazu ...

• Make the filesystem flaky using FUSE

• Used in testing ScyllaDB (Apache Cassandra's clone)

• https://github.com/scylladb/charybdefs

• Similar to Namazu FS

• Both supports API

• Also similar to PetardFS (not active since 2007)

• CharybdeFS can be also combined with Namazu as

• CharybdeFS is specialized in FS; Namazu is much more

comprehensive.

Similar great tool: CharybdeFS

https://github.com/NetSys/demi

• Found some akka-raft bugs and reproduced a few Spark bugs

• challenge in reducing false-positives related to instrumentation

• DEMi and Namazu are complementary each other

• DEMi is powerful, but has some limitations

• Namazu is comprehensive and made easy to get started

Similar great tool: DEMi (appeared in NSDI'16)

Namazu DEMi

Target Generic

(Network,Filesystem,Thread..)

Getting Started Easy Need to write

AspectJ codes

Deterministic Replay? No Yes

Bug Cause Minimization? No Yes

SO... HOW CAN WE FIX FLAKY TESTS?

• Namazu finds/reproduces flaky tests, but it

doesn't automatically fix them😞

• Basic approach for async-related flakiness:

Adjust the values for sleep() and retries in the

test code

How can we fix flaky tests?

• Suggestion: the timeout(&retries) should be a configurable

parameter rather than a hard-coded value

Timeout value Cost

(time)

Risk (timeout) Appropriate for

Long High Low • Slow machine (e.g.CI)

• Conservative person

Short Low High • Fast machine

• Risk-appetite person

CONCLUSION

• Apache software are well tested

• But they are flaky

• Let’s improve them

• Improve asynchronous code

• Repeat tests

• Our tool can control non-determinism

so as to reproduce flaky tests

Conclusion

Flaky tests and bugs in Apache software (e.g. Hadoop)

Software

“Bugs are Bringing Bugs” - Health Sciences Centerhsc.ghs.org/.../2014/04/0304-Brenner-Bugs-Are-Bringing-Bugs.pdf“Bugs are Bringing Bugs ... •Had been working in Malawi (East

Hadoop Present - Open Enterprise Hadoop

FLEX: Fixing Flaky Tests in Machine Learning Projects by

2. Hadoop - lsd.ls.fi.upm.eslsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/Hadoop_… · Hadoop Hadoop Software Ecosystem Hadoop MapReduce Hadoop Distributed File System

Reducing entrainment of sericite in fine flaky graphite ... entrainment of.pdfparticles in terms of charge neutralization and precipitate enmeshment. Keywords: fine flaky graphite,

FAMILY TIME Bugs! Bugs! Bugs!Bugs! Bugs! Bugs! by Bob Barner Enjoy the Book Together • you read, make the action words in the story come alive by using As your fingers to mimic the

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Good Bugs & Bad Bugs

Hadoop , Hadoop , Hadoop !!!

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

Decorah Envirothon - Good bugs, bad bugs

Reducing entrainment of sericite in fine flaky graphite

Little Bugs, Little Bugs, What Do You See?€¦ · Little Bugs, Little Bugs, What Do You See? Tint tots & Velcro Dots

DeFlaker: Automatically Detecting Flaky Tests · DeFlaker: Automatically Detecting Flaky Tests ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden}}}}}

Hadoop Deployment Manual - Hyadespleiades.ucsc.edu/doc/bright/hadoop-deployment-manual.pdf2.2 Ncurses Installation Of Hadoop Using cm-hadoop-setup ... •The Hadoop Deployment Manual

Bugs for Bugs Update Autumn 2017 · suite of beneﬁcial insects that are ready to feed on them. Some of these allies include lacewings, ... bugs Bugs for Bugs Update Autumn 2017

BUGS Songbook · PDF fileBugs Bugs, bugs everywhere Bugs on your shoes, bugs in your hair Look around, you will see Bugs on you, bugs on me Max the mosquito, Fiona firefly

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University

An Empirical Analysis of Flaky Tests

Fixed, Floating, and Flaky - University of California, Berkeley