(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

November 12th, 2014 | Las Vegas, NV

PFC305

Embracing FailureFault Injection and Service Resilience at Netflix

Josh Evans and Naresh Gopalani, Netflix

• ~50 million members, ~50 countries

• > 1 billion hours per month

• > 1000 device types

• 3 AWS regions, hundreds of services

• Hundreds of thousands of requests/second

• CDN serves petabytes of data at terabits/second

Netflix Ecosystem

Service

Partners

Static

ContentAkamai

Netflix CDN

AWS/Netflix

Control

PlaneInternet

Availability means that members can

● sign up

● activate a device

● browse

● watch

What keeps us up at night

Failures can happen any time

• Disks fail

• Power outages

• Natural disasters

• Software bugs

• Human error

We design for failure

• Exception handling

• Fault tolerance and isolation

• Fall-backs and degraded experiences

• Auto-scaling clusters

• Redundancy

Testing for failure is hard

• Web-scale traffic

• Massive, changing data sets

• Complex interactions and request patterns

• Asynchronous, concurrent requests

• Complete and partial failure modes

Constant innovation and change

What if we regularly inject failures

into our systems under controlled

circumstances?

Blast Radius

• Unit of isolation

• Scope of an outage

• Scope a chaos exercise

Region

Instance

Global

An Instance Fails

Edge Cluster

Cluster A

Cluster B

Cluster D

Cluster C

Chaos Monkey

• Monkey loose in your DC• Run during business hours

• What we learned– Auto-replacement works– State is problematic

A State of Xen - Chaos Monkey & Cassandra

Out of our 2700+ Cassandra nodes• 218 rebooted

• 22 did not reboot successfully

• Automation replaced failed nodes

• 0 downtime due to reboot

An Availability Zone Fails

EU-West

US-EastUS-West

AZ1AZ2

Chaos Gorilla

Simulate an Availability

Zone outage

• 3-zone configuration

• Eliminate one zone

• Ensure that others can

handle the load and

nothing breaks

Chaos Gorilla

Challenges

• Rapidly shifting traffic– LBs must expire connections quickly

– Lingering connections to caches must be addressed

• Service configuration– Not all clusters auto-scaled or pinned

– Services not configured for cross-zone calls

– Mismatched timeouts – fallbacks prevented fail-over

A Region Fails

EU-WestUS-EastUS-West

AZ1 AZ2 AZ3

Regional Load Balancers

Zuul – Traffic Shaping/Routing

Data Data Data

Geo-located

Chaos Kong

AZ1 AZ2 AZ3

Regional Load Balancers

Zuul – Traffic Shaping/Routing

Data Data Data

Customer

Device

Challenges

● Rapidly shifting traffic

○ Auto-scaling configuration

○ Static configuration/pinning

○ Instance start time

○ Cache fill time

Challenges

● Service Configuration

○ Timeout configurations

○ Fallbacks fail or don’t provide the

desired experience

● No minimal (critical) stack

○ Any service may be critical!

A Service Fails

Region

Global

Service

Services Slow Down and Fail

Simulate latent/failed service

• Inject arbitrary latency and errors at

the service level

• Observe for effects

Latency Monkey

Device ZuulELB Edge Service B

Service C

Internet

Service A

Challenges• Startup resiliency is an issue

• Services owners don’t know all dependencies

• Fallbacks can fail too

• Second order effects not easily tested

• Dependencies are in constant flux

• Latency Monkey tests function and scale

– Not a staged approach

– Lots of opt-outs

More Precise and Continuous

Service Failure Testing:FIT

Distributed Systems Fail

● Complex interactions at scale

● Variability across services

● Byzantine failures

● Combinatorial complexity

Any service can cause cascading failures

Fault Injection Testing (FIT)

Device Service B

Service C

Internet Edge

Device or Account Override

Service A

Request-level simulations

Failure Injection Points

IPC Cassandra Client Memcached Client Service Container Fault Tolerance

FIT Details

● Common Simulation Syntax

● Single Simulation Interface

● Transported via Http Request header

Integrating Failure

Service

Filter

Ribbon

Service

Filter

Ribbon

ServerRcv

ClientSend

request

Service A

response

Service B

[sendRequestHeader] >>fit.failure: 1|fit.Serializer|

2|[[{"name”:”failSocial,

”whitelist":false,

"injectionPoints”:

[“SocialService”]},{}

{"Id":

"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]

Failure Scenarios

● Set of injection points to fail

● Defined based on

○ Past outages

○ Specific dependency interactions

○ Whitelist of a set of critical services

○ Dynamic tracing of dependencies

FIT Insights : Salp● Distributed tracing inspired by Dapper paper

● Provides insight into dependencies

● Helps define & visualize scenarios

Functional Validation

● Isolated synthetic transactions

○ Set of devices

Validation at Scale

● Dial up customer traffic - % based

● Simulation of full service failure

Dialing Up Failure

Chaos!

Continuous Validation

Critical

Services

Non-critical

Services

Synthetic

Transactions

Don’t Fear The Monkeys

Take-aways• Don’t wait for random failures

– Cause failure to validate resiliency

– Remove uncertainty by forcing failures regularly

– Better to fail at 2pm than 2am

• Test design assumptions by stressing them

Embrace Failure

The Simian Army is part of the Netflix open source cloud platform

http://netflix.github.com

Netflix talks at re:InventTalk Time Title

BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix

PFC-306 Wednesday, 3:30pm Performance Tuning Amazon EC2

DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open

Source Tools can accelerate and scale your

services

ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale

PFC-304 Wednesday, 4:30pm Effective InterProcess Communications in the

Cloud: The Pros and Cons of Microservices

Architectures

ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems

APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud

Please give us your feedback on this session.

Complete session evaluations and earn re:Invent swag.

http://bit.ly/awsevals

Josh Evans

jevans@netflix.com

@josh_evans_nflx

Naresh Gopalani

ngopalani@netflix.com

(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Technology

AWS re:Invent 2015 re:Cap

NetApp Private Storage for AWS (ENT216) | AWS re:Invent 2013

Feedback on AWS re:invent 2016

Migrating My.T-Mobile.com to AWS (ENT214) | AWS re:Invent 2013

Continuous Deployment @ AWS Re:Invent

AWS Re:Invent - Securing HIPAA Compliant Apps in AWS

AWS re:Invent 2016: AWS Partners and Data Privacy (GPST303)

AWS re:Invent - Accelerating Research

AWS Security Ideas - re:Invent 2016

AWS re:Invent 2013 Recap

AWS re:Invent re:Flection - Spot Pricing

AWS re:Invent 2016: Embracing DevSecOps while Improving Compliance and Security Agility and Posture (HLC303)

(WEB305) Migrating Your Website to AWS | AWS re:Invent 2014

Bluesoft @ AWS re:Invent 2017 + AWS 101

AWS re:Invent 2017 re:View

2012 re:Invent Netflix: embracing the cloud final

AWS re:Invent 2017 Recap

AWS CloudFormation under the Hood (DMG303) | AWS re:Invent 2013

AWS re:Invent 2016 Fast Forward

(SEC201) AWS Security Keynote Address | AWS re:Invent 2014