29
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Netflix Development Patterns for Rapid Iteration, Scale, Performance, & Availability Neil Hunt, Netflix November 13, 2013

Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Embed Size (px)

DESCRIPTION

This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.

Citation preview

Page 1: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Netflix Development Patterns for Rapid Iteration, Scale, Performance, & Availability

Neil Hunt, Netflix

November 13, 2013

Page 2: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Are You Designing Systems That Are: • Web-scale • Global • Highly-available • Consumer-facing

• Cloud Native

Page 3: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Cloud Native • Service oriented architecture • Redundancy • Statelessness • NoSQL • Eventual consistency

Page 4: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Assumptions

Slowly Changing Large Scale

Rapid Change Large Scale

Slowly Changing Small Scale

Rapid Change Small Scale

Speed

Sca

le

Everything works

Everything is Broken Hardware will fail

Software will fail

Enterprise IT Telcos

Startups Web-Scale

Page 5: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Netflix Cloud Goals: Availability, Scale, Performance

Page 6: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Performance • Reduce session start by 1s

Save 1 human lifetime per day! Win more moments of truth

• Suggest choices 1% better 500k hours/day additional value delivered

Page 7: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Scale • 50% y/y traffic growth • 50 Countries, 3 continents • Tens of thousands of instances at peak • 4 AWS regions, 12 datacenters • ~$.001 per start

Page 8: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Availability • Aspire to 4 x nines (99.99% of starts successful) • Per Quarter:

– Downtime: < 3 mins (peak time) – Successful starts: 9.999B – Failures: 1M frustration, calls, lost business

Page 9: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Availabilities Compound N Service Dependencies

Availability

2 .9998 10 .999 100 .99 1000 .9

99.99N%

99.99% 99.99% 99.99% …

N dependencies

Page 10: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Availabilities Compound

99.9999% availability for each dependency

Isolation for independence

To achieve 99.99% availability with 1000 components

requires:

or

Component failure leads to degradation rather than

system failure

Component failure leads to system failure

Page 11: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Availability, Scale, Performance Are Not Enough!

Page 12: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Rapid Iteration – Rate of Change • Running tests • Rolling out tests

– Engineering the winning test experience for scale

• Adding features • Scaling up • Removing features, simplifying, minimizing

Page 13: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Testing • Up to 1,000 changes per day!

Page 14: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Rate of Change • Change leads to bugs

– New features – New configurations – New types of inputs – Scaling up

• Availability is in tension with rate of change

Page 15: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Availability / Rate of Change Tradeoff

1 10 100 1000

99.999%

99.99%

99.9%

99%

Rate of Change

Avai

labi

lity

Frontier of availability/change

Page 16: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Availability / Rate of Change Tradeoff

1 10 100 1000

99.999%

99.99%

99.9%

99%

Rate of Change

Avai

labi

lity

Frontier of availability/change

Page 17: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Shifting the Curve…

1 10 100 1000

99.999%

99.99%

99.9%

99%

Rate of Change

Avai

labi

lity

Page 18: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Shifting the Curve • Must break the chained dependencies

that compound in cascading system failure

• Subsystem isolation: – Failure in one component

should never result in cascading system failure

Page 19: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network

• Latency monkey to

test

Dependent System

Dependence

Timeout

Page 20: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network

• Latency monkey to

test

Dependent System

Dependence

Higher Tier System

Short timeout

Longer timeout

Page 21: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems Timeout with fallback default response • Network failure • Software bug

Dependent System

Dependence

Timeout & Default response

{ status=mem, plan=4, device=true }

Page 22: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems Canary Push • Network failure • Software bug

Dependent System

Dependence

Timeout

Canary instance new code

Page 23: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems Red/Black deployment • Software bugs Dependent

System

Dependence V2.3

Bad code pushed Dependence

V2.2

Fail back to old code

Page 24: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems Standby Blue system

• Independent

implementation • Simplified logic

Dependent System

Dependence V2.3

Static reference implementation

Fail to static version

Page 25: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems Zone isolation • Infrastructure failure

(e.g. power outage)

• Chaos Gorilla

Dependent System

Dependence

Zone A

Dependent System

Dependence

Zone B

Load Balancer

Page 26: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems Region isolation • Infrastructure

software bugs (e.g. load balancer fail)

• Chaos Kong

Dependent System

Dependence

Zone A

Dependent System

Dependence

Zone B

Load Balancer

Dependent System

Dependence

Zone A

Dependent System

Dependence

Zone B

Load Balancer

Region E Region W

DNS

Page 27: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Isolating Subsystems

Dependency Mode Isolating Technique Instance Failure Network failure

Redundant systems with failover and timeout Timeout with default response

Network failure Software bug

Canary push Red-black deployment Blue systems

Infrastructure failure Zone isolation Cross-zone software bugs Region isolation

Page 28: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Trying Harder Won’t Cut It • Trying harder gets a linear return on an exponential

problem

• Need to be great at execution AND Have the right architecture

• What architectural features are you using to ensure availability, scale, performance, & rapid rate of change?

Page 29: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

DMG206