20140708 - Jeremy Edberg: How Netflix Delivers Software

Preview:

DESCRIPTION

Jeremy Edberg: How Netflix Delivers Software

Citation preview

How Netflix Delivers Software

!July 8th, 2014

Email: jedberg@{gmail,netflix}.com

Twitter: @jedberg

Web: www.jedberg.net

Facebook: facebook.com/jedberg

Linkedin: www.linkedin.com/in/jedberg

When your software fails...

will your system survive?

The Netflix way

• Fully automated build tools to test and make packages

• Fully automated machine image bakery

• Fully automated image deployment

• Everything is “built for three”

• Independent teams responsible for both Dev and Ops

• Redundancy through multi-region deployment

The Netflix way

Philosophy

• We hire responsible adults and keep rules and policies to a minimum

• Developers can change any code in production at any time

• And things don’t break (usually)

Freedom and Responsibility

Automate all the things!

http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

• Application startup

• Configuration

• Code deployment

• System deployment

Automate all the things!

• Standard base image

• Tools to manage all the systems

• Reduce errors through reproducibility

Automation

Shared state should be stored in a shared service

!

Data on an instance should be replicated to other instances

“Build for three”

We hold a boot camp for new engineers to teach them how to

build for a highly distributed environment.

“Build for three”We hold a boot camp for new

engineers to teach them how to build for a highly distributed

environment.

12B  outbound  requests  per  day  

to  API  dependencies

Movie  Ra)ngs

Personaliza)on  Engine User  Info Movie  

MetadataSimilar  Movies Reviews A/B  Test  

Engine

2B  requests  per  day    

into  the  NeHlix  API

Discovery API

Streaming API

Movie  Ra)ngs

Personaliza)on  Engine User  Info

Movie  Metadata

Similar  Movies

Reviews

A/B  Test  Engine

Discovery API

Streaming API

Content Encoding

CDN Management

QOS Logging

DRM

OpenConnect Edge Locations

Browse

Play

Watch

• Services are built by different teams who work together to figure out what each service will provide.

• The service owner publishes an API that anyone can use.

Highly aligned, loosely coupled

• Easier auto-scaling

• Easier capacity planning

• Identify problematic code-paths more easily

• Narrow in the effects of a change

• More efficient local caching

Advantages to a Service Oriented Architecture

• Developers deploy when they want

• They also manage their own capacity and autoscaling

• And fix anything that breaks at 4am!

Freedom and Responsibility

All systems choices assume

some part will fail at some point.

• Simulate things that go wrong

• Find things that are different

The Monkey Theory

Execution

AWS

Netflix OSS

Netflix Application Code

AWS

Netflix OSS

YOUR Application Code

• Instances

• Machine Images

• Elastic IPs

• Load Balancers

• Security groups / Autoscaling

What AWS Provides

AWS

AWS

Netflix OSS

YOUR Application Code

• Service Oriented Architecture

• HTTP/Rest interfaces between services

Netflix built a global PaaS

Netflix OSS

• Supports all regions and zones

• Multiple accounts

• Cross region/account replication

• Internationalized, localized and GeoIP routed

• Advanced key management

• Autoscaling with 1000s of instances

• Monitoring and alerting on millions of metrics

Netflix PaaS featuresNetflix OSS

Open Source at Netflix

Netflix OSS

Be liberal in what you accept, strict in what you send

Circuit Breakers (Hystrix)

• Simulate things that go wrong

• Find things that are different

The Monkey Theory

• Chaos -- Kills random instances

• Chaos Gorilla -- Kills zones

• Chaos Kong -- Kills regions

• Latency -- Degrades network and injects faults

• Conformity -- Looks for outliers

The simian army• Circus -- Kills and launches

instances to maintain zone balance

• Doctor -- Fixes unhealthy resources

• Janitor -- Cleans up unused resources

• Howler -- Yells about bad things like Amazon limit violations

• Security -- Finds security issues and expiring certificates

Netflix OSS

• Blueprint for the rest of the platform libraries

• Pluggable architecture

• On instance software load balancer

• Zone aware / Zone affinity

• Handles retry logic

• Global variables

• Support for staged rollout

• Feature flags

Netflix OSS

• Application to instance mapping

• Heartbeat to keep track of health

DQ Transport Routing

Suro

etc

Eventbus

Druid

Netflix OSS

Why Bake?

Generic AMI InstanceTraditional: •launch OS •install packages •install app

Netflix: •launch OS+app

App AMI Instance

Getting Baked

Perforce / Git

libraries

source

Ant targets

Ivy

Groovy all over

app bundles

Jenkins

sync

resolve

buildcompile report

publishtest

Artifactory

snapshot / release libraries / apps

Base Image Baking

Yum / Apt

Linux: CentOS, Fedora, Ubuntu

RPMs: Apache, Java...

ec2 slave instances

S3 / EBSfoundation

AMI

base AMI

Bakery

mount

install

Ready for app bake

snapshot

AWS

App Image Baking

Jenkins / Yum / Artifactory

Linux, Apache, Java, Tomcat

AWS

app bundle

ec2 slave instances

S3 / EBS

base AMI

app AMI

Bakery

mount

install

Ready to launch!

snapshot

app AMI Linux Base AMI (CentOS or Ubuntu)

Java

Tomcat

Optional Apache

Monitoring !

Log Rotation to S3

monitoring

GC and thread dump

logging

Application war file, base servlet, platform, interface

jars for dependent services

Healthcheck, status servelets, JMX interface,

Servo autoscale

Linux Base AMI (CentOS or Ubuntu)

Java

Tomcat

Optional Apache

Monitoring !

Log Rotation to S3

monitoring

GC and thread dump

logging

Application war file, base servlet, platform, interface

jars for dependent services

Healthcheck, status servelets, JMX interface,

Servo autoscale

app AMI

Application war file

Linux Base AMI (CentOS or Ubuntu)

Java

JBoss

Optional Apache

Monitoring !

Log Rotation to S3

monitoring

GC and thread dump

logging

Application war file, base servlet, platform, interface

jars for dependent services

Healthcheck, status servelets, JMX interface,

Servo autoscale

app AMI

Linux Base AMI (CentOS or Ubuntu)

Python

Bottle

Optional Apache

Monitoring !

Log Rotation to S3

monitoring

logging

Application file, base server, platform, interface

libs for dependent services

app AMI

Netflix OSS

Deploying Code; Step 1

Auto Scaling Group

Launch Configuration

Security Group

Amazon Machine Image

Instances

Load Balancer

Netflix has moved the granularity

from the instance to the cluster

Data is the most important asset Netflix

has. It’s what differentiates us from our competitors.

Netflix OSS

EVCache

• Wrapper on top of memcached

• Automatically replicates writes to multiple regions

• Pulls cache data intelligently via zone affinity

Cassandra

• Availability over consistency

• Writes over reads

• We know Java

• Open source + support

Why Cassandra?

• Priam

• Zero touch auto-config

• State management

• Token assignment

• Node replacement

• Backup/restore to/from S3

Using Cassandra at Netflix

• Astyanax

• OO abstraction to Cassandra

• Multi-region support

Cassandra Architecture

Going Multi-region

• 100% uptime is theoretically possible.

• You have to replicate your data

• This will cost money

Leveraging Multi-region

us-east-1 us-west-2 etc

eu-west-1

us-east-1 us-west-2 etc

eu-west-1

us-east-1 us-west-2 etc

eu-west-1

What’s going on?!

Atlas

!

alerting

api

api

Central Event

Gateway

Paging Service

Amazon SES

CORE Agent

Other Team’s Agent

CORE Agent

Alert Systems

Central Event

Gateway

• Parse raw alerts, match application to owner

• Add image captures and links to related graphs for easy mobile use

• Send to the right service based on priority

• Register the event in Chronos, the timeline application

• Correlate low priority alerts and generate new high priority alerts

Metrics in Production• 796B Daily metric

points

• Peaks at 1.4B / min

• 50% daily metric churn

What is a metric?com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US

How we built it• Built our own big data

system

• Based on S3 and EMR

• Less copies, lower resolution, and slower speed retrieval based on age of data

Self Serve is the Key

• Developers choose what metrics to submit

• What graphs they put on their dashboards

• What to alert on

Example Alert Config

Atlas

When something breaks..

Breakdown of an outage

Is something wrong? Alerting

Where is the problem? Telemetry and Dashboards

What changed? ???

Breakdown of an outage

Is something wrong? Alerting

Where is the problem? Telemetry and Dashboards

What changed? Change control?

Change control, the good• Tells you what changed

• Tells you what’s about to change

• Great for coordination when one change gates another change

Change control, the bad• It’s manual

• It expresses intent, not reality

• It forces you to serialize your changes to an extent

Breakdown of an outage

Is something wrong? Alerting

Where is the problem? Telemetry and Dashboards

What changed? Chronos

(Some of) Netflix is open source:

https://netflix.github.io

Just a quick reminder...

Netflix is hiring!

If you like what you see here, feel free to reach out!

Questions?

Getting in touch

Email: jedberg@{gmail,netflix}.com

Twitter: @jedberg

Web: www.jedberg.net

Facebook: facebook.com/jedberg

Linkedin: www.linkedin.com/in/jedberg