The Netflix API Platform for Server-Side Scripting

Preview:

Citation preview

The Netflix API Platform for Server-Side Scripting

The risks of modifying running production servers

Problem identified: new servers aren’t coming up healthy!

Ugh! There’s a problem. Errors from API are up.

Escalating customer impact

Stream starts per second more and more off.

Expected value

Actual value

Resolving the issue

Finally root-caused!Now restarting all unhealthy servers.

Back to normal!

Resolving the issue

Stream starts per second also back to normal.

Expected value

Actual value

The Netflix API

Access to Netflix mid-tier services

Today’s system

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

Network boundary

API Server JVM

Today’s system (simplified)

What we need

What we need

What we need

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

scripts

scripts

scripts

scripts

...

scripts

scripts

scripts

scripts

Network boundary

API Server JVM

~700 active

Today’s architecture

groovy

Flexibility for devices

[...]

Device1VideoCommon. formatKidsSeason (apiRequest,[...], imageUrl)

[...]

[...]

Device2Common.formatAllSeasons([...])

[...]

[...]

dataPublishingService.getShowFeedbackBuilder(user, video)

[...]

What we need

Developer Velocity: Decoupled deployments of versions

n+3i+4

i+1i+2i+3

i

n+2

n+1

n

k+1

k j

j+1

l

What we need

Challenge #1: Resiliency

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

script

script

...

script

script

Network boundary

API Server JVM

Resiliency in today’s system

Strong resiliency with Hystrix

What about resiliency on this side?

groovy

Example: memory usage

Periodic cleanup

New upload increases memory usage.

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

script

script

...

script

script

Network boundary

API Server JVM

1-2 years ago

few, small scriptsfewer uploads

groovy

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

script/app

script/app

script/app

script/app

...

script/app

script/app

script/app

script/app

Network boundary

API Server JVM

Today

script/app

script/app

~700 more complex scripts/apps,10-50 uploads per day

groovy

Streaming Hours Per Year in Billions

Changing risk profile

Lack of process isolation is a growing risk.

Some possible mitigations

Velocity vs. Resiliency

Moving toward our ideal API:What will change

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

node script

node script

...

node script

node script

Network boundary API Server JVM

The (near) future

node.js

process isolation

Why containers?

Isolated failures: scripts don’t affect each other(usually)

API

Temporarily unavailable!

Independent autoscaling

API

Fast startup

Challenge #2: Great developer experience

Step-through-debugging (today)

Docker Machine

localproject

Local Container

live reload file watcher

docker build / run

File watcher agent

Proxy

NetworkAgent

Local development (future)

node-inspector

debugger

Run-time debugging/optimization (today)

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

script

script

script

script

...

script

script

script

script

Network boundary

API Server JVM

script

script

Problems hard to root cause, hard to measure/optimize performance

groovy

Script → API interaction (today)

API

device server-side script

device client

Script → API interaction (future)

API

device server-side script

Platform for device teams

Default configuration

Default configuration

Default configuration

Titus

ATLAS

NeWT: Netflix Workflow Toolkit (CLI)

Corresponding UI

VersioningEasy access to instances

Rollback

Initial impressions

Client AClient BClient CClient E

Netflix Microservicesnode script

Network boundary API Server JVM

First end-to-end implementation and shadow traffic

Client AClient BClient CClient E

Netflix Microservicesnode script

Network boundary API Server JVM

Problem isolation (ex: memory leak)

Memory leak makes RSL blow up. Clearer idea of where the problem is.

node.js

Client AClient BClient CClient E

Netflix Microservicesnode script

Network boundary API Server JVM

Problem isolation (ex: memory leak)

Same with node script.

Request tracing: clearer picture of fan-out

Js(mostly)

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

node script

node script

...

node script

node script

Network boundary API Server JVM

node.js

How not to compromise what we’re good at

What we need

Other Netflix talks at QCon New York

Thank you!

Script management

Outtakes

Limited device-server chattiness

Operational insights

Js(mostly)

java

Client AClient BClient C

Client A

Client YClient Z

...

...Netflix Microservices

node script

node script

...

node script

node script

Network boundary API Server JVM

node.js

process isolation

Recommended