41
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. November 12 th , 2014 | Las Vegas, NV PFC305 Embracing Failure Fault Injection and Service Resilience at Netflix Josh Evans and Naresh Gopalani, Netflix

(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Embed Size (px)

DESCRIPTION

Complex distributed systems fail. They fail more frequently, and in different ways, as they scale and evolve over time. In this session, you learn how Netflix embraces failure to provide high service availability. Netflix discusses their motivations for inducing failure in production, the mechanics of how Netflix does this, and the lessons they learned along the way. Come hear about the Failure Injection Testing (FIT) framework and suite of tools that Netflix created and currently uses to induce controlled system failures in an effort to help discover vulnerabilities, resolve them, and improve the resiliency of their cloud environment.

Citation preview

Page 1: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

November 12th, 2014 | Las Vegas, NV

PFC305

Embracing FailureFault Injection and Service Resilience at Netflix

Josh Evans and Naresh Gopalani, Netflix

Page 2: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

• ~50 million members, ~50 countries

• > 1 billion hours per month

• > 1000 device types

• 3 AWS regions, hundreds of services

• Hundreds of thousands of requests/second

• CDN serves petabytes of data at terabits/second

Netflix Ecosystem

Service

Partners

Static

ContentAkamai

Netflix CDN

AWS/Netflix

Control

PlaneInternet

Page 3: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Availability means that members can

● sign up

● activate a device

● browse

● watch

Page 4: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

What keeps us up at night

Page 5: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Failures can happen any time

• Disks fail

• Power outages

• Natural disasters

• Software bugs

• Human error

Page 6: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

We design for failure

• Exception handling

• Fault tolerance and isolation

• Fall-backs and degraded experiences

• Auto-scaling clusters

• Redundancy

Page 7: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Testing for failure is hard

• Web-scale traffic

• Massive, changing data sets

• Complex interactions and request patterns

• Asynchronous, concurrent requests

• Complete and partial failure modes

Constant innovation and change

Page 8: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

What if we regularly inject failures

into our systems under controlled

circumstances?

Page 9: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014
Page 10: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Blast Radius

• Unit of isolation

• Scope of an outage

• Scope a chaos exercise

Zone

Region

Instance

Global

Page 11: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

An Instance Fails

Edge Cluster

Cluster A

Cluster B

Cluster D

Cluster C

Page 12: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Chaos Monkey

• Monkey loose in your DC• Run during business hours

• What we learned– Auto-replacement works– State is problematic

Page 13: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

A State of Xen - Chaos Monkey & Cassandra

Out of our 2700+ Cassandra nodes• 218 rebooted

• 22 did not reboot successfully

• Automation replaced failed nodes

• 0 downtime due to reboot

Page 14: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

An Availability Zone Fails

EU-West

US-EastUS-West

AZ1AZ2

Page 15: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Chaos Gorilla

Simulate an Availability

Zone outage

• 3-zone configuration

• Eliminate one zone

• Ensure that others can

handle the load and

nothing breaks

Chaos Gorilla

Page 16: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Challenges

• Rapidly shifting traffic– LBs must expire connections quickly

– Lingering connections to caches must be addressed

• Service configuration– Not all clusters auto-scaled or pinned

– Services not configured for cross-zone calls

– Mismatched timeouts – fallbacks prevented fail-over

Page 17: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

A Region Fails

EU-WestUS-EastUS-West

Page 18: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

AZ1 AZ2 AZ3

Regional Load Balancers

Zuul – Traffic Shaping/Routing

Data Data Data

Geo-located

Chaos Kong

Chaos Kong

AZ1 AZ2 AZ3

Regional Load Balancers

Zuul – Traffic Shaping/Routing

Data Data Data

Customer

Device

Page 19: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Challenges

● Rapidly shifting traffic

○ Auto-scaling configuration

○ Static configuration/pinning

○ Instance start time

○ Cache fill time

Page 20: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Challenges

● Service Configuration

○ Timeout configurations

○ Fallbacks fail or don’t provide the

desired experience

● No minimal (critical) stack

○ Any service may be critical!

Page 21: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

A Service Fails

Zone

Region

Global

Service

Page 22: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Services Slow Down and Fail

Simulate latent/failed service

calls

• Inject arbitrary latency and errors at

the service level

• Observe for effects

Latency Monkey

Page 23: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Latency Monkey

Device ZuulELB Edge Service B

Service C

Internet

Service A

Page 24: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Challenges• Startup resiliency is an issue

• Services owners don’t know all dependencies

• Fallbacks can fail too

• Second order effects not easily tested

• Dependencies are in constant flux

• Latency Monkey tests function and scale

– Not a staged approach

– Lots of opt-outs

Page 25: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

More Precise and Continuous

Page 26: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Service Failure Testing:FIT

Page 27: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Distributed Systems Fail

● Complex interactions at scale

● Variability across services

● Byzantine failures

● Combinatorial complexity

Page 28: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Any service can cause cascading failures

ELB

Page 29: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Fault Injection Testing (FIT)

Device Service B

Service C

Internet Edge

Device or Account Override

Zuul

Service A

Request-level simulations

ELB

Page 30: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Failure Injection Points

IPC Cassandra Client Memcached Client Service Container Fault Tolerance

Page 31: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

FIT Details

● Common Simulation Syntax

● Single Simulation Interface

● Transported via Http Request header

Page 32: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Integrating Failure

Service

Filter

Ribbon

Service

Filter

Ribbon

ServerRcv

ServerRcv

ClientSend

request

Service A

response

Service B

[sendRequestHeader] >>fit.failure: 1|fit.Serializer|

2|[[{"name”:”failSocial,

”whitelist":false,

"injectionPoints”:

[“SocialService”]},{}

]],

{"Id":

"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]

Page 33: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Failure Scenarios

● Set of injection points to fail

● Defined based on

○ Past outages

○ Specific dependency interactions

○ Whitelist of a set of critical services

○ Dynamic tracing of dependencies

Page 34: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

FIT Insights : Salp● Distributed tracing inspired by Dapper paper

● Provides insight into dependencies

● Helps define & visualize scenarios

Page 35: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Functional Validation

● Isolated synthetic transactions

○ Set of devices

Validation at Scale

● Dial up customer traffic - % based

● Simulation of full service failure

Dialing Up Failure

Chaos!

Page 36: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Continuous Validation

Critical

Services

Non-critical

Services

Synthetic

Transactions

Page 37: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Don’t Fear The Monkeys

Page 38: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Take-aways• Don’t wait for random failures

– Cause failure to validate resiliency

– Remove uncertainty by forcing failures regularly

– Better to fail at 2pm than 2am

• Test design assumptions by stressing them

Embrace Failure

Page 39: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

The Simian Army is part of the Netflix open source cloud platform

http://netflix.github.com

Page 40: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Netflix talks at re:InventTalk Time Title

BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix

PFC-306 Wednesday, 3:30pm Performance Tuning Amazon EC2

DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open

Source Tools can accelerate and scale your

services

ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale

PFC-304 Wednesday, 4:30pm Effective InterProcess Communications in the

Cloud: The Pros and Cons of Microservices

Architectures

ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems

APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud

Page 41: (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Please give us your feedback on this session.

Complete session evaluations and earn re:Invent swag.

http://bit.ly/awsevals

Josh Evans

[email protected]

@josh_evans_nflx

Naresh Gopalani

[email protected]