Distributed Systems at Scale: Reducing the Fail

DISTRIBUTED SYSTEMS AT SCALE: REDUCING THE FAIL

Kim Moir, Mozilla, @kmoirURES, November 13, 2015

Water pipeI often think of our continuous integration system as analogous to a municipal water system. Some days, a sewage system. We control the infrastructure that we provide but it is constrained. If someone overloads the system with inputs, we will have problems.

Picture by wili_hybrid - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)

https://www.flickr.com/photos/wili/3958348428/sizes/l

I recently read a book called “Thinking in Systems”. It’s a generalized look at complex systems and how they work. It’s not specific to computer science.

Picture: https://upload.wikimedia.org/wikipedia/commons/b/bf/Slinky_rainbow.jpg Creative commons 2.0

–Donatella H. Meadows, Thinking in Systems

“A system is a set of things…interconnected in such a way that they produce their own pattern of behaviour

over time.”

This system may be impacted by outside forcesThe response to these forces is characteristic of of the system itselfThat response is seldom simple in the real worldThese same inputs would result in a different behaviour a different system

WHAT DO WE OPTIMIZE FOR?

One of the questions the book asks is “What are we optimizing for?

WHAT DO WE OPTIMIZE FOR?

• Budget

• Shipping

• End to end time (Developer productivity)

How much budget do we have available to spend our our continuous integration farm?

Can we ship a release to fix a 0-day security issue in less than 24 hours?

Are developers getting their test results quickly that they remain productive?

WHAT ARE THE CONSTRAINTS?

Another question the books asks is, what are the constraints the system?

WHAT ARE THE CONSTRAINTS?

• Budget

• Time

Budget for in-house hardware pools and AWS billTime for us to optimize the systemTime for developers to wait for their results

I’m going to talk now about how the pain points in this large distributed system. How it can fail in a spectacular fashion.

1. UNPREDICTABLE INPUT

1

Picture is a graph of monthly branch load. We have daily spikes as Mozillians across the world come online and start pushing code. The troughs are weekends.

Complex system, release engineering does not control all the inputs. i.e. Let’s increase test load by 50% on one platform but not increase the hardware pool by a corresponding amount.

Is someone abusing the try server? Pushes are not coalesced on try.

Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason usually do so on weekends when there is less contention for the infrastructure. If not, the pending counts can get very high, especially for in-house pools where we can’t burst capacity.

Solution: Implementing smarter (semi-automatic) test selection. cmanchester work http://chmanchester.github.io/blog/2015/08/06/defining-semi-automatic-test-prioritization/ Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1184405

2. NO CANARY TESTING

Every night, we generate new AMIs from our Puppet configs for the Amazon instance types we use. These images are used to instantiate new instances on Amazon. We have scripts to recycle the instances with old AMIs after the new ones come available. Which is great, however, we don’t have any canary testing for the new AMIs.

So we can have something happen like this1) Someone releases a Puppet patch that passes tests2) However, it the AMIs it generates has a permission issue3) Which prevents all new AMIs from starting the process that connects the test instances to their server4) So we have thousands of instances up burning money that aren’t doing anything5) Pending counts go up but it looks like there plenty of machines but pending counts continue to rise

Failure: All the AWS images are coming up but no builds are running.

Solution: We need to implement canary testing for AMIs (Implement a methodology for rolling out new amis in a tiered, trackable, fashion. https://bugzilla.mozilla.org/show_bug.cgi?id=1146369) (Add automated testing and tiered roll-outs to golden ami generation)

Picture by ross_stachan https://www.flickr.com/photos/ross_strachan/6176512880/sizes/l Creative commons 2.0

3. NOT ALL AUTO SCALING

3

Several works in progress on this frontOur infrastructure does not autoscale for some platforms. Macs tests. (Can’t run tests in on Macs in virtualized environment on non-Mac hardware due to Apple licensing restrictions.) So we have racks of mac minis. AWS doesn’t allow the licenses for Windows versions we need to test. Also, we can’t run performance tests on cloud instances because the results are not consistent.

Cannot easily move test machines between pool dynamically. The importance of a platform shifts over time. We need reallocate machines between testing pools. This is kind of a manual process.

We decided to focus on Linux 64 perf testing, freeing up Linux 32 machines for use in windows testing. This required code changing, imaging new machines, trying to fix imaging process which had some bugs.

Solutions: Run Windows builds in AWS (now in production on limited branches). Cross compile Mac builds on Linux. Todo: add bug references. Still, Mac and Windows tests are a problem because we need to have the in-house capacity for peak load which is expensive, or not buy for peak load and deal with a number of pending jobs.

Picture"DB-Laaeks25804366678-7". Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DB-Laaeks25804366678-7.JPG#/media/File:DB-

4. SELF-INFLICTED DENIAL OF SERVICE

Engineering for the long game: Managing complexity in distributed systems. Astrid Atkinson https://www.youtube.com/watch?v=p0jGmgIrf_M

18:47 “Google itself is the biggest source of denial of service attacks”

At Mozilla, we are no different.

We retry jobs when they fail due to infrastructure reasons. Which is okay because perhaps there is a machine which is wonky and needs to be removed from the pool. And the next time it runs it will run on a machine that is in a clean state.

Human error. What has changed? Permissions, network and DNS. Automatic tests of all commits applied to our production infrastructure. Example: IT redirected a server name to a new host where we didn’t have ssh keys deployed. We understood the change as redirect to a new cname, not a new host. Jobs spiked because they retried trying to fetch this zip.

Solution: better monitoring of retries, communication, check for retry spike vs regular jobs and alert

Picture http://www.flickr.com/photos/ervins_strauhmanis/9554405492/sizes/l/

5. TOO MUCH TEST LOAD

We run builds on commit. There is some coaslescing on branches. Since we build and test for many platforms, we can run up to 500 jobs (all platforms excluding talos) on a single commit.

To many jobs for our infrastructure to handle - high pending counts. How can we intelligently shed load?

Solution: Do we really need to run every test on every commit given that many of them don’t historically reveal problems? We have project called SETA which analyzes historical test data and we have implemented changes to our scheduler to accommodate this. Basically we can reduce the frequency of specified test runs on a per platform, per branch basis. This allows us to shed test load and increase the throughput of the system.

Picture: https://www.flickr.com/photos/encouragement/14759554777/

6. SYSTEM RESOURCE LEAKS

I often think that managing a large scale distributed system is like being a water or sanitation engineer for a city. Ironically, LinkedIn thinks so too and advertises these jobs to me. Where are the leaks happening? Where they would look for wasted water, we look for wasted computing resources.

We recently had a problem where our windows test pending counts spiked quite drastically. We initially thought this was just due to some more test jobs being added to the platform in parallel while old corresponding tests were not disabled. In fact, the major root cause of the problem was that the additional tests caused additional overhead on the server responsible for managing jobs on the tests machines. Basically the time from when a test machine finishes a task and the server responds was getting very long, which if you multiply this by hundreds of machines and thousands of jobs, leads to a long backlog. Solution: This issue was resolved by adding additional servers to the pool that services these test machines, upgrading the RAM on each of them, and increasing the interval at which they are restarted.

We run many tests in parallel to reduce the end to end time that a build and associated tests take. Hard to balance start up time with time to run actual tests. Running more tests in parallel means balanced with startup time.

Solution: Implement more monitoring each time a system resource leak causes an outage

Test chunking. chunking by run time. Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l

Another system resource leak that was plugged by the use of a tool called runner. Runner is a project that manages starting tasks in a defined order. https://github.com/mozilla/build-runner. Basically it ensures that our the machines are in a sane state to run a job. If they are in a good state, we don’t reboot them as often which increases the overall throughput of the system.

7. TIME TO RECOVER FROM FAILURE

Our CI system is not a bank’s transaction system or a large commerce website. But still, it meets most of the characteristics of a highly available system. We have some issues regarding failover, we still have single points of failure in our system.

A few weeks ago, I watched a talk by Paul Hinze of HashiCorp on the primitives of High Availability. Primitives of high availability talk 42:20https://speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability

He states that it is inevitable that the system will fail, but the real measure is how quickly we can recover from it.

In terms of managing failure, we have managed to decouple some components in the way that we manage our code repositories. If, for instance, bad code landed on mozilla-inbound, which causes all the tests to fail, our code sheriffs can close this tree, and leave other repositories open.

However, we still have many things that are a single point of failure. For instance, our hardware in our data centre or associated network. Given that jobs automatically retry if they fail for infrastructure reasons, this allows us to bring the system up without a lot of intervention.

Solution: Distributed failure - branching model and closing trees

Picture by Mike Green - Creative Commons 2.0 Attribution-NonCommercial 2.0 Generic https://www.flickr.com/photos/30751204@N06/7328288188/sizes/l

8. MONITORING

nagios which alerts to ircpapertrailemail alertsdashboardtreeherdernew relic (doesn’t really apply to releng)

Picture is Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0https://www.flickr.com/photos/left-hand/6883500405/

We have a dedicated channel for alerts with the state of our build farm. Nagios alerts send an message to the #buildduty channel. Due the large size of our devices on our build farm, most of these alerts are threshold alerts. We don’t care if a single machine goes down. It will be automatically rebooted and if this doesn’t work, a bug will be opened for a person to look at it. However, we do care if 200 of them suddenly stop taking jobs. You can see from this page that we have threshold alerts for the number of pending jobs. If this is a sustained spike, we need to look at it in further detail.

We also have alerts for things like checking that the golden images that we create each night for or Amazon instances are actually completing. And that the process that kills old Amazon instances if their capacity is not needed is indeed running. We use papertrail as a log aggregator that allows us to quickly search our logs for issues

We use graphite for analytics that allow us to look at long term trends and spikes. For instance, this graph looks at our overall infrastructure time, green is EC2, blue is in-house hardware.

Problem: All of this data is sometimes overwhelming. Every time we have an outage around something we haven’t monitored previously, we add another alert or additional monitoring. I don’t really know the solution for dealing with the flood of data other than alert on only important things, aggregate alerts for machines classes.

9. DUPLICATE JOBS

Duplicate bits, wasted time, resources

We currently build a release twice - once in CI, once as a release job. This is inefficient and makes releng a bottleneck to getting a release out the door. To fix this - implement release promotion! https://bugzilla.mozilla.org/show_bug.cgi?id=1118794. Same thing applies to nightly builds

Picture Creative Commonshttps://upload.wikimedia.org/wikipedia/commons/c/c6/DNA_double_helix_45.PNG

By Jerome Walker,Dennis Myts (Own work) [Public domain], via Wikimedia Commons

10. SCHEDULING

Adding a new platform or new suites of tests currently requires release engineering intervention. We want to make this more self-serve, and allow developers to add new tests and platforms.

Solution: We are currently in the process of migrating to a new system for manages task queuing, scheduling, execution and provisioning of resources for our CI system. This system is called taskcluster. This system will allow developers to schedule new tests in tree, and will make standing up new platforms much easier. It’s a micro services architecture, jobs run in docker images that allows developers to have the same environment on their desktop, as the CI system runs.

http://docs.taskcluster.net/Picture by hehaden - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) https://www.flickr.com/photos/hellie55/5083003751/sizes/l

CONCLUSION

• Caring for a large distributed system is like taking care of a city’s water and sewage

• Increase throughput by constraining inputs based on the outputs you want to optimize

• May need to migrate to something new while keeping the existing system working

You need to identify system leaks and implement monitoring

For instance, in our case, we reduce test jobs to optimize end to end time

Buildbot->taskcluster

FURTHER READING

• All Your Base 2014: Laura Thomson, Director of Engineering, Cloud Services Engineering and Operations, Mozilla – Many moving parts: monitoring complex systems: https://vimeo.com/album/3108317/video/110088288

• Velocity Santa Clara 2015: Astrid Atkinson, Director Software Engineering, Google - Engineering for the long game: Managing complexity in distributed systems. https://www.youtube.com/watch?v=p0jGmgIrf_M

• Mountain West Ruby Conference, Paul Hinze’s, Hashicorp - Primitives of High Availability https://speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability

• Strange Loop 2015, Camille Fournier - Hopelessness and Confidence in Distributed Systems Design http://www.slideshare.net/CamilleFournier1/hopelessness-and-confidence-in-distributed-systems-design

https://vimeo.com/album/3108317/video/110088288

https://www.youtube.com/watch?v=p0jGmgIrf_M

https://speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability

http://www.slideshare.net/CamilleFournier1/hopelessness-and-confidence-in-distributed-systems-design

Technology

Distributed Systems at Scale: Reducing the Fail