Upload
dius
View
1.038
Download
3
Embed Size (px)
Citation preview
Antifragility and testing distributed systemsApproaches for testing and improving resiliency
FailureIt’s inevitable
Microservice Architectures
■ Bounded contexts■ Deterministic in nature■ Simple behaviour■ Independently testable (e.g. Pact)
Distributed Architectures
Conversely…
■ Unbounded context■ Non-determinism■ Exhibit chaotic behaviour■ Emergent behaviour■ Complex testing
Problems with traditional approaches
■ Integration test hell■ Need to get by without E2E environments■ Learnings are non-representative anyway■ Slower■ Costly (effort + $$)
Alternative?
Create an isolated, simulated environment
■ Run locally or on a CI environment■ Fast - no need to setup complex test data, scenarios etc.■ Enables single-variable hypothesis testing■ Automatable
Lab Testing w\ Docker ComposeHypothesis testing simulated environments
Docker Compose
■ Docker container orchestration tool■ Run locally or remotely■ Works across platforms (Windows, Mac, *nix)■ Easy to use
Nginx
Let’s take a practical, real-world example: Nginx as an API Proxy.
Simulating failure with Muxy
“A tool to help simulate distributed systems failures”
Hypothesis testing
Our job is to hypothesise, test, learn, change, and repeat
Nginx TestingH0 = Introducing network latency does not cause errors
Test setup:
● Nginx running locally, with Production configuration● DNSMasq used to resolve production urls to other Docker
containers● Muxy container setup, proxying the API● A test harness to hit the API via Nginx n times, expecting
0 failures
Demo
Fingers crossed...
Knobs and Levers
We can now have a number of levers to pull. What if we...
● Want to improve on our SLA?● Want to see how it performs if the API is hard down?● ...
AntifragilityFailure is inevitable, let’s make it normal
Titanic Architectures
Architectures
Titanic Architectures
“Titanic architectures are architectures that are good in theory, but haven’t been put into practice”
Anti-titanic architectures?
“What doesn’t kill you makes you stronger”
Antifragility
“The resilient resists shocks and stays the same; the antifragile gets better” - Nasim Taleb
Chaos Engineering
● We expect our teams to build resilient applications○ Fault tolerance across and within service boundaries
● We expect servers and dependent services to fail● Let’s make that normal● Production is a playground● Levelling up
Chaos Engineering - Principles
1. Build a hypothesis around Steady State Behavior2. Vary real-world events3. Run experiments in production4. Automate experiments to run continuously
Requires the ability to measure - you need metrics!!
http://www.principlesofchaos.org/
Production Hypothesis Testing
H0 = Loss of an AWS region does not result in errors
Test setup:
● Multi-region application setup for the video playing API● Apply Chaos Kong to us-west-2● Measure aggregate production traffic for ‘normal’ levels
Kill an AWS region
http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html
Go/Hystrix API Demo
H0 = Introducing network latency does not cause API errors
Test setup:
● API1 running with Hystrix circuit breaker enabled if API2 does not respond within SLAs
● Muxy container setup, proxying upstream API2● A test harness to hit API1 n times, expecting 0 failures
Human FactorsTechnology is only part of the problem, can we test that too?
Chernobyl
● Worst nuclear disaster of all time (1986)● Public information sketchy● Estimated > 3M Ukrainians affected● Radioactive clouds sent over Europe● Combination of system + human errors● Series of seemingly logical steps ->
catastrophe
What we know about human factors
● Accidents happen● 1am - 8am = higher incidence of human errors● Humans will ignore directions
○ They sometimes need to (e.g. override)○ Other times they think they need to
(mistake)● Computers are better at following processes
Let’s use a Production deployment as a key example:
● CI -> CD pipeline used to deploy● Production incident occurs 6 hours later (2am)● ...what do we do?● We trust the build pipeline, avoid non-standard
actions
These events help us understand and improve our systems
Translation
“ A game day exercise is where we intentionally try to break our system, with the goal of being able to understand it better and learn from it ”
Game Day Exercises
Prerequisites:
● A game plan● All team members and affected staff aware of it● Close collaboration between Dev, Ops, Test,
Product people etc.● An open mind● Hypotheses● Metrics● Bravery
Game Day Exercises
● Get entire team together● Make a simple diagram of system on a
whiteboard● Come up with ~5 failure scenarios● Write down hypotheses for each scenario● Backup any data you can’t lose● Induce each failure and observe the results
Game Day Exercises
https://stripe.com/blog/game-day-exercises-at-stripe
Examples of things that fail:
● Application dies● Hard disk fail● Machine dies < AZ < Region…● Github/Source control goes down● Build server dies● Loss of \ degraded network connectivity● Loss of dependent API● ...
Game Day Exercises
Wrapping upI hope I didn’t fail
■ Apply the scientific method■ Use metrics to make learn and make decisions■ Docker-compose + Muxy to automate failure ■ Build resilience into software & architecture■ Regularly Production resilience until it’s normal■ Production outages are opportunities to learn■ Start small!
Wrapping up
Thank you
PRESENTED BY:
@matthewfellows
■ Antifragility (https://en.wikipedia.org/wiki/Antifragile) ■ Chaos Engineering (
http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html)
■ Principles of Chaos (http://www.principlesofchaos.org/)■ Human factors in large-scale technological systems'
accidents: Three Mile Island, Bhopal, Chernobyl (http://oae.sagepub.com/content/5/2/133.abstract)
References
■ Docker Compose (https://www.docker.com/docker-compose)
■ Muxy (https://github.com/mefellows/muxy)■ Nginx resilience testing with Docker Compose (
www.onegeek.com.au/articles/resilience-testing-nginx-with-docker-dnsmasq-and-muxy)
■ Golang + Hystrix resilience testing with Docker Compose (https://github.com/mefellows/muxy/tree/mst-meetup-demo/examples/hystrix)
Code \ Tool References