Netflix Edge Engineering Open House Presentations - June 9, 2016

Preview:

Citation preview

Daniel Jacobson@daniel_jacobson

Satish Gudiboina@sgudiboina

Suudhan Rangarajan@suudhan

Vasanth Asokan@vasanthasokan

Edge Engineering Open House - June 9, 2016

190 Countries (not China and a few others)

81+ Million Subscribers

1000+ Different Device Types

Over 42 Billion Hours Streamed in 2015

Streaming Hours Per Year in Billions

Streaming Hours Per Year in Billions

Over 42 Billion Hours Streamed in 2015

Over 42 BillionSuccesses!

Of Course, There Are Failures Too…

Two Primary Drivers Behind Our Successes

People Desire to Watch Netflix

Two Primary Drivers Behind Our Successes

People Desire to Watch Netflix

Systems Scale to Meet Desires

Two Primary Drivers Behind Our Successes

Sign-Up

Sign-Up

Discovery / Browse

Sign-Up

Discovery / Browse

Playback

Edge Engineering provides data and

functionality to support these

three experiences

Designing APIs

EnablingPlayback Scaling

Routing

InsightsDX

Resiliency

Tools

Edge Engineering provides data and

functionality to support these

three experiences

DEVICES

DEVICES

ROUTING

DEVICES

ROUTING

DEVICES

ROUTING

API

API API API API API API

DEVICES

ROUTING

API

API API API API API API

SERVICES

S2S2RecsS2S2Member

S2S2RatingsS2S2Playback LifecycleS2S2Authn/z

S2S2A/BS2S2Search

S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata

DEVICES

ROUTING

API

API API API API API API

SERVICES

S2S2RecsS2S2Member

S2S2RatingsS2S2S2S2Authn/z

S2S2A/BS2S2Search

S2S2IdentityS2S2Metadata

S2S2Playback Data S2S2DRM

Ownedby Edge

Engineering

Playback Lifecycle

DEVICES

ROUTING

API

API API API API API API

SERVICES

S2S2RecsS2S2Member

S2S2RatingsS2S2S2S2Authn/z

S2S2A/BS2S2Search

S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata

Playback Lifecycle

DEVICES

ROUTING

API

API API API API API API

SERVICES

S2S2RecsS2S2Member

S2S2RatingsS2S2S2S2Authn/z

S2S2A/BS2S2Search

S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata

Playback Lifecycle

DEVICES

ROUTING

API

API API API API API API

SERVICES

S2S2RecsS2S2Member

S2S2RatingsS2S2S2S2Authn/z

S2S2A/BS2S2Search

S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata

Playback Lifecycle

DEVICES

ROUTING

API

API API API API API API

SERVICES

S2S2RecsS2S2Member

S2S2RatingsS2S2S2S2Authn/z

S2S2A/BS2S2Search

S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata

Playback Lifecycle

API API API API API API

S2S2S2S2Authn/z

S2S2Playback Data S2S2DRM

INSIGHTS

TOOLS

DX

Playback Lifecycle

42 Billion Hours2015

200 Billion Hours

2015

Future

42 Billion Hours

The rest of

Netflix’s AWS Cloud Footprint by %

Talking About the Future of Edge Engineering

Satish GudiboinaAPI and Upcoming Re-Architecture

Suudhan RangarajanPlayback Experience

Vasanth AsokanDeveloper Tools, Velocity and Experience

The Netflix API Platform for Server-Side Scripting

Current and The FutureSatish Gudiboina

The Netflix API

Streaming Hours Per Year in Billions

Scale is multi-faceted

Growing number of users ( → RPS)

Growing number of device types

Growing number of A/B tests

Growing number of languages

Growing number of countries

What we need to build for

Velocity

Resiliency

Other requirements:PerformanceGreat developer experienceOperational insightsTooling

SERV

ICE

LAYE

R

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...

Netflix Microservices

script

script

script

script

...

script

script

script

script

Network boundary

API Server JVM

Today’s architecture

Resiliency with Hystrix

Developer Velocity: Decoupled deployments of versions

n+3

Day 1

Day 2

Day 3

Day 4

Day 5

API device 1 device 2 device 3 device 4

i+4

i+1i+2i+3

i

n+2

n+1

n

k+1

k j

j+1

l

Changing risk profile

Growing number of users ( → RPS)

Growing number of devices

Growing number of A/B tests

Growing number of languages

Growing number of countries

Growing number and complexity of scripts (scripts → apps)

SERV

ICE

LAYE

R

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...

Netflix Microservices

script

script

...

script

script

Network boundary

API Server JVM

Today’s system (T-3yrs)

few, small scriptsfewer uploads

SERV

ICE

LAYE

RJs

(mostly)java

Client AClient BClient C

Client A

Client YClient Z

...

...

Netflix Microservices

script

script

script

script

...

script

script

script

script

Network boundary

API Server JVM

Today’s system (T)

scripts

scripts

hundreds of more complex scripts,10-50 uploads per day

What we need

Velocity

Resiliency?

Lack of process isolation is a growing risk.

Moving toward our ideal API:What will change

Scripts will run in containers

Scripts will call API remotely

SERV

ICE

LAYE

RJs

(mostly)java

Client AClient BClient C

Client A

Client YClient Z

...

...

Netflix Microservices

node script

node script

...

node script

node script

Network boundary API Server JVM

The (near) future

node.js

process isolation

node for device teams

Why containers?

Process isolation

Fast startup

Consistent developer experience across environments

Isolated failures: scripts don’t affect each other

API

device 1 device 2 device 3 device 4Temporarily unavailable!

Independent autoscaling

API

device 1 device 2 device 3 device 4

Fast startup

New API server: minutesNew container: seconds

Fast rollout, fast rollback, fast MTTR

The Netflix API

Edge Developer ExperienceTranslating developer productivity to Netflix customer delight

Developer Experience?

DEVELOP(rapidly)

DEPLOY(reliably)

OPERATE(effectively)

Experimentation driven innovation

~700 apps, dozens of pushes a day15+ client teams, ~200 developers

~50 direct services, 100s of AB tests, dozens of new features

The Innovation Funnel

API

Devices

Netflix Services

Client Adaptor Applications

Why care about DevEx?

DeveloperProductivity

ProductInnovation

Tools

Automation

Insights

CustomerSatisfaction

App Development and Management

DEVELOP(rapidly)

DEPLOY(reliably)

OPERATE(effectively)

SERV

ICE

LAYE

R

Netflix Microservices

appW

AN

Boun

dary API SERVER JVM

js java

Developer Ergonomics

app

...

app

app

CLI

EN

T LI

BR

AR

IES

Large / Complex

SERV

ICE

LAYE

R

REM

OTE

SERV

ICE

LAYE

Rapp

API SERVER JVM

Developer Ergonomics ...

app

...

app

app

CLI

EN

T LI

BR

AR

IES

js javajs

DOCKER CONTAINERS

WAN

Bo

unda

ryNetflix

Microservices

Setup Canary

SupportProd Push

Pre-Prod

MetricsTracing

Lifecycle

Alerts

Build

Bootstrap

API Discovery

REPL

Unit Test

SDK Debug Logging

Profiling

Audits

Security

Custom Routing

Dependency Management

Client Application Development Critical Component!

Dx Developer Experience

$ newt init

Just bring your Javascript business logic

NeWT: Netflix Workflow Toolkit

Continuous Integration

Deployment Pipelines

Autoscaling

Dashboards

Alerting

Logging

Lifecycle Management

Audits and Analytics

Container tooling

Canaries

Dependency Management

Titus

ATLAS

NeWT: Netflix Workflow Toolkit

Edge PaaS UI

$ newt auto-deploy -d

nodeJSproject

Docker Machine

node-inspector

DebuggerFile watcher / live reload trigger

File watcher agent

NeWT: Local Container Development

Local Container

docker build / run

$ newt auto-deploy -d

Docker Machine

NeWT: Local Container Development

Local Container

CloudMicroservices

Cloud Proxy

Terminate security

Disc

over

y Ag

ent

Service Discover

y

Loca

l Sy

stem

Clou

d

App Operations and Insights

DEVELOP(rapidly)

DEPLOY(reliably)

OPERATE(effectively)

• Low Latency, High throughput, Highly Efficient• Handle bursty or large scale loads• Extensible programming model

600 jobs in production, 8M messages/sec at peak, 100Gbps network throughput

Mantis - Stream Processing Platform

Monitoring facets of aggregate application health, globally

Aggregate Insights

Aggregate Insights

Analyze in real-time, requests matching a precise set of conditions

Surgical Insights

Surgical Insights - Real-time Stream Queries

Surgical Insights - Real-time Stream Queries

Surgical Insights - Real-time Stream Queries

Monitoring server side calling pattern and internal application profile

Session Tracing

Session Tracing

Session Tracing - Request Profile

Session Tracing - Per Node Profile

Automatic monitoring of high cardinality data across multiple dimensions

Real-time Anomaly Detection

Real-time Anomaly Detection

• Scaling developer productivity with business growth

• Provide fully managed PaaS experience to client developers • Shift Left Insights to power smart development• Curated, blended visualizations that simplify devops

In conclusion...

Tech Soup

Scaling Playback Services

Suudhan Rangarajan Senior Software Engineer, Playback Features

@suudhan

Playback Lifecycle

DECIDE

COLLECT & LEARN

AUTHORIZE

Decide

MANIFEST (Tracks and URLs)

Authorize

LICENSE

❏ Content usage / resolution policies

❏ Plan / device limits enforcement

❏ DRM / License generation

Collect & Learn

Bookmarks & Hours Watched

Streaming Errors and Metrics

Quality Of Experience metrics

4

Lets look at Play Decisions

DECIDE

MANIFEST

AUTHORIZE

COLLECT & LEARN

LICENSE

SESSION

Huge number of Streams

Resolutions - 720p, 1080p, 4K etcCodecs - H.264,HEVC etcBitrates - 230, 780, 3000 etc

Channels - Stereo, Surround SoundLanguages - English, French etc

Types - Subtitles, Closed Captions, Forced NarrativesLanguages - English, French etc

Suudhan Rangarajan
I went through all my icons and replaced it with the ones with Creative Commons license and added a image attribution slide at the end as well
Daniel Jacobson
they are on quite a few slides...
Daniel Jacobson
+srangarajan@netflix.com these little images look like clipart type of images. do we have rights to use them?

Streams to Tracks

- H.264 Main Profile- English 5.1 Audio- No Subtitle

- HEVC Dash Profile- French 2.0 Audio- English CC

- HDR Dash Profile- Spanish AAC Audio- English Forced Narrative

Decide & Filter

MANIFEST SERVICE

Many Many Dimensions

PLAYBACKMANIFEST

USER PREFERENCES

TITLEMETADATA

COUNTRY

DEVICE

NETWORK

Big Opportunity

Rich playback experiences

Tremendous increase in scale

Customer growth

Challenge: Efficient Scaling

Targeting sub-linear growth

# of Requests

Cloud Costs

Predictable Viewing Patterns

Key Insight

Key Insight

CONTENT RANK

PLAY

RE

QUES

TS

Also..Manifest Request for one title

PLAY

RE

QUES

TS

TIME

Current: Completely Real-time

Real-time manifest generation

With Caching

Real-time manifest generation

80% Cached20% Real-time

Challenges

How do we determine the optimal combination of attributes to cache on?

Challenges

Cache Considerations: ●When to populate?●When to bust?●How to scale for

cache-miss or failures?

Potential Win

10x increase in requests with only 4x increase in costs

Optimize computation

Can we re-imagine our service processing to dramatically increase throughput?

Anatomy of a Playback Manifest Request

Metadata Access

27%

36%

Tracks Generation

16%

Streams Filtering

21%

Serialization

Potential Win

10x increase in requests with just 2x increase in service costs

Two-pronged Strategy to Scaling

Cache Manifests

Re-architect code to reduce processing time

Scaling Problems Across Services

Decide Authorize Collect & Learn

Playback Features

Playback Access

Playback Data Systems

Thanks!

@suudhan

Come Talk to Us!

Image AttributionAll Images used are under creative commons or public domain license:

● Video icon - http://simpleicon.com/wp-content/uploads/video-camera-1.png● Speaker icon -

https://upload.wikimedia.org/wikipedia/commons/thumb/2/21/Speaker_Icon.svg/1024px-Speaker_Icon.svg.png

● Subtitle icon - https://thenounproject.com/term/subtitles/78795/ ● Uptrend image - https://pixabay.com/en/chart-line-line-chart-diagram-trend-148256/ ● Funnel image - https://commons.wikimedia.org/wiki/File:Funnel_Mech.svg ● Business Intelligence image -

https://pixabay.com/static/uploads/photo/2015/04/14/23/17/it-business-722950_960_720.png ● Key icon - https://pixabay.com/static/uploads/photo/2014/04/03/10/55/key-311738_960_720.png ● Person icon-

https://pixabay.com/static/uploads/photo/2015/12/22/04/00/photo-1103596_960_720.png ● Mobile icon-

https://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Mobile_phone_font_awesome.svg/1024px-Mobile_phone_font_awesome.svg.png

● Globe image - https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Simple_Globe.svg/1024px-Simple_Globe.svg.png

● Devices icon- https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Simple_Globe.svg/1024px-Simple_Globe.svg.png

● wifi icon - https://pixabay.com/static/uploads/photo/2016/01/03/11/32/wireless-signal-1119306_960_720.png

● cell tower - https://pixabay.com/static/uploads/photo/2012/04/13/00/23/tower-31235_960_720.png