Maintaining the Netflix Front Door - Presentation at Intuit Meetup

Preview:

DESCRIPTION

This presentation goes into detail on the key principles behind the Netflix API, including design, resiliency, scaling, and deployment. Among other things, I discuss our migration from our REST API to what we call our Experienced-Based API design. It also shares several of our open source efforts such as Zuul, Scryer, Hystrix, RxJava and the Simian Army.

Citation preview

Maintaining the Front Door to Netflix

Daniel Jacobson@daniel_jacobson

http://www.linkedin.com/in/danieljacobsonhttp://www.slideshare.net/danieljacobson

Global Streaming Videofor TV Shows and Movies

More than 48 Million Subscribers

More than 40 Countries

Netflix Accounts for >34% of Peak Downstream Traffic in North America

Netflix subscribers are watching more than 1 billion hours a month

Netflix Accounts for >6% of Peak Upstream Traffic in North America

Netflix subscribers are watching more than 1 billion hours a month

Team Focus:Build the Best Global Streaming Product

Three aspects of the Streaming Product:• Non-Member • Discovery• Streaming

The Netflix API - Background

Netflix API

Netflix API Requests by AudienceAt Launch In 2008

Netflix DevicesOpen API Developers

Netflix API

Netflix API Requests by AudienceFrom 2011

Netflix DevicesOpen API Developers

Current Emphasis of Netflix API

Netflix Devices

Netflix API : Key Responsibilities

• Broker data between services and Devices

• Provide features and business logic

• Maintain a resilient front-door

• Scale the system

• Maintain high velocity

• Provide detailed insights into the system health

Netflix API : Key Responsibilities

• Broker data between services and Devices

• Provide features and business logic

• Maintain a resilient front-door

• Scale the system

• Maintain high velocity

• Provide detailed insights into the system health

APIs DoLots of Things!

Data Gathering

Data Formatting

Data Delivery

Security

Authorization

Authentication

System Scaling

Discoverability

Data Consistency

Translations

Throttling

Orchestration

APIs DoLots of Things!

These are some of themany things APIs do.

Data Gathering

Data Formatting

Data Delivery

Security

Authorization

Authentication

System Scaling

Discoverability

Data Consistency

Translations

Throttling

Orchestration

APIs DoLots of Things!

These three are at the core.All others ultimately

support them.

Definitions

• Data Gathering– Retrieving the requested data from one or many local

or remote data sources

• Data Formatting– Preparing a structured payload to the requesting agent

• Data Delivery– Delivering the structured payload to the requesting

agent

Meanwhile…

There are two players in APIs

API Provider API Consumer

API Provider

PROVIDES

API Consumer

CONSUMES

Traditional API Interactions

API Provider

PROVIDES EVERYTHING

API ConsumerCONSUMES

WHAT IS PROVIDED

Everything means, API Provider does:• Data Gathering• Data Formatting• Data Delivery• (among other things)

Traditional API Interactions

Why do most API providers provide everything?

• API design tends to be easier for teams closer to the source

• Centralized API functions makes them easier to support

• Many APIs have a large set of unknown and external developers

Why do most API providers provide everything?

• API design tends to be easier for teams closer to the source

• Centralized API functions makes them easier to support

• Many APIs have a large set of unknown and external developers

At Netflix, we see it a different way…

Data Gathering Data Formatting Data Delivery

API Consumer

API Provider

Separation of Concerns

To be a better provider, the API should address the separation of concerns of the three core functions

Data Gathering Data Formatting Data Delivery

API ConsumerDon’t care how data is gathered, as long

as it is gathered

API ProviderCare a lot about how the data is

gathered

Separation of Concerns

Data Gathering Data Formatting Data Delivery

API ConsumerDon’t care how data is gathered, as long

as it is gathered

Each consumer cares a lot about the format for that specific use

API ProviderCare a lot about how the data is

gathered

Only cares about the format to the extent it

is easy to support

Separation of Concerns

Data Gathering Data Formatting Data Delivery

API ConsumerDon’t care how data is gathered, as long

as it is gathered

Each consumer cares a lot about the format for that specific use

Each consumer cares a lot about how payload

is delivered

API ProviderCare a lot about how the data is

gathered

Only cares about the format to the extent it

is easy to support

Only cares about delivery method to the

extent it is easy to support

Separation of Concerns

Because of our separation of concerns, the Netflix API team is

enabled to focus on different charters

Brokering Data to 1,000+ Device Types

Screen Real Estate

Controller

Technical Capabilities

One-Size-Fits-AllAPI

Request

RequestRequest

Request

Request

Request

RequestRequest

Request

Request

RequestRequest

Request

Request

Request

Request

Courtesy of South Florida Classical Review

Resource-Based API

vs.

Experience-Based API

Resource-Based Requests

• /users/<id>/ratings/title• /users/<id>/queues• /users/<id>/queues/instant• /users/<id>/recommendations• /catalog/titles/movie• /catalog/titles/series• /catalog/people

OSFA API

RECOMMENDATIONS

MOVIE DATA

SIMILAR MOVIES

AUTH MEMBERDATA

A/B TESTS

START-UP

RATINGS

Network Border Network Border

RECOMMENDATIONS

MOVIE DATA

SIMILAR MOVIES

AUTH MEMBERDATA

A/B TESTS

START-UP

RATINGS

OSFA API

Network Border Network Border

SERVER CODE

CLIENT CODE

RECOMMENDATIONS

MOVIE DATA

SIMILAR MOVIES

AUTH MEMBERDATA

A/B TESTS

START-UP

RATINGS

OSFA API

Network Border Network Border

DATA GATHERING,FORMATTING,AND DELIVERY

USER INTERFACERENDERING

Experience-Based Requests

• /ps3/homescreen

JAVA API

Network Border Network Border

RECOMMENDATIONS

MOVIE DATA

SIMILAR MOVIES

AUTH MEMBERDATA

A/B TESTS

START-UP

RATINGS

Groovy Layer

RECOMMENDATIONSA

ZXSXX C CCC

MOVIE DATA

SIMILAR MOVIES

AUTH MEMBERDATA

A/B TESTS

START-UP

RATINGS

JAVA API

SERVER CODE

CLIENT CODE

CLIENT ADAPTER CODE(WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER)

Network Border Network Border

RECOMMENDATIONSA

ZXSXX C CCC

MOVIE DATA

SIMILAR MOVIES

AUTH MEMBERDATA

A/B TESTS

START-UP

RATINGS

JAVA API

DATA GATHERING

DATA FORMATTINGAND DELIVERY

USER INTERFACERENDERING

Network Border Network Border

Netflix API : Key Responsibilities

• Broker data between services and Devices

• Provide features and business logic

• Maintain a resilient front-door

• Scale the system

• Maintain high velocity

• Provide detailed insights into the system health

1000+ Device Types

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies Reviews A/B Test

Engine

Dozens of Dependencies

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Dependency Relationships

2,000,000,000Incoming Requests Per Day

to the Netflix API

30Distinct Dependent

Services for the Netflix API

~500Dependency jars Slurped

into the Netflix API

14,000,000,000Netflix API Outbound Calls

Per Day to those Dependent Services

0Dependent Services with

100% SLA

99.99% = 99.7%30

0.3% of 2B = 6M failures per day

2+ Hours of Downtime Per Month

99.99% = 99.7%30

0.3% of 2B = 6M failures per day

2+ Hours of Downtime Per Month

99.9% = 97%30

3% of 2B = 60M failures per day

20+ Hours of Downtime Per Month

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Circuit Breaker Dashboard

Call Volume and Health / Last 10 Seconds

Call Volume / Last 2 Minutes

Successful Requests

Successful, But Slower Than Expected

Short-Circuited Requests, Delivering Fallbacks

Timeouts, Delivering Fallbacks

Thread Pool & Task Queue Full, Delivering Fallbacks

Exceptions, Delivering Fallbacks

Error Rate# + # + # + # / (# + # + # + # + #) = Error Rate

Status of Fallback Circuit

Requests per Second, Over Last 10 Seconds

SLA Information

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Fallback

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Fallback

Netflix API : Key Responsibilities

• Broker data between services and Devices

• Provide features and business logic

• Maintain a resilient front-door

• Scale the system

• Maintain high velocity

• Provide detailed insights into the system health

Netflix API : Requests Per Month

Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11 May-11 Jun-11 Jul-11 -

5

10

15

20

25

30

35

Requ

ests

in B

illio

ns

50x growth in 18 months

AWS Cloud

Netflix API : Requests Per Month

Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11 May-11 Jun-11 Jul-11 -

5

10

15

20

25

30

35

Requ

ests

in B

illio

ns

Autoscaling

Autoscaling

Scryer : Predictive Auto Scaling

Not yet…

Typical Traffic Patterns Over Five Days

Predicted RPS Compared to Actual RPS

Scaling Plan for Predicted Workload

What is Scryer Doing?

• Evaluating needs based on historical data– Week over week, month over month metrics

• Adjusts instance minimums based on algorithms

• Relies on Amazon Auto Scaling for unpredicted events

Results

Results : Load Average

ReactivePredictive

Results : Response Latencies

ReactivePredictive

Results : Outage Recovery

Results : AWS Costs

Scaling Globally

More than 48 Million Subscribers

More than 40 Countries

ZuulGatekeeper for the Netflix Streaming Application

Zuul *

• Multi-Region Resiliency

• Insights• Stress Testing• Canary Testing• Dynamic Routing

• Load Shedding• Security• Static Response

Handling• Authentication

* Most closely resembles an API proxy

All of these approaches are designed to prevent failures…

But sometimes the best way to prevent failures is to force them!

I randomly terminate instances

in production to identify dormant

failures.

Chaos Monkey

Chaos Gorilla

I simulate an outage of an

entire Amazon availability zone.

I simulate an outage in an AWS

region.

Chaos Kong

I find instances that don’t adhere to best practices.

Conformity Monkey

I extend Conformity Monkey to find

security violations.

Security Monkey

I detect unhealthy instances and remove them from service.

Doctor Monkey

I clean up the clutter and waste that runs in the

cloud.

Janitor Monkey

I induce artificial delays and errors into services to determine

how upstream services will respond.

Latency Monkey

Netflix API : Key Responsibilities

• Broker data between services and Devices

• Provide features and business logic

• Maintain a resilient front-door

• Scale the system

• Maintain high velocity

• Provide detailed insights into the system health

Personalization

EngineUser Info Movie

MetadataMovie Ratings

Similar Movies

API

Reviews A/B Test Engine

Dependency Relationships

Testing Philosophy:

Act Fast, React Fast

That Doesn’t Mean We Don’t Test

Automated Delivery Pipeline

Cloud-Based Deployment Techniques

Current Code

In Production

API Requests from the Internet

Single Canary InstanceTo Test New Code with Production Traffic

(around 1% or less of traffic)

Current Code

In Production

API Requests from the Internet

Canary Analysis Automation

Single Canary InstanceTo Test New Code with Production Traffic

(around 1% or less of traffic)

Current Code

In Production

API Requests from the Internet

Error!

Current Code

In Production

API Requests from the Internet

Current Code

In Production

API Requests from the Internet

Current Code

In Production

API Requests from the Internet

Perfect!

Stress Test with Zuul

Current Code

In Production

API Requests from the Internet

New Code

Getting Prepared for Production

Current Code

In Production

API Requests from the Internet

New Code

Getting Prepared for Production

Error!

Current Code

In Production

API Requests from the Internet

New Code

Getting Prepared for Production

Current Code

In Production

API Requests from the Internet

New Code

Getting Prepared for Production

Current Code

In Production

API Requests from the Internet

Perfect!

Stress Test with Zuul

Current Code

In Production

API Requests from the Internet

New Code

Getting Prepared for Production

Current Code

In Production

API Requests from the Internet

New Code

Getting Prepared for Production

API Requests from the Internet

New Code

Getting Prepared for Production

https://www.github.com/Netflix

Maintaining the Front Door to Netflix

Daniel Jacobson@daniel_jacobson

http://www.linkedin.com/in/danieljacobsonhttp://www.slideshare.net/danieljacobson

Recommended