49
Now Playing on Netflix: Adventures in a Cloudy Future CMG November 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft

Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Embed Size (px)

DESCRIPTION

Flowcon keynote was a few days before CMG, a few tweaks and some extra content added at the start and end. Opening Keynote talk for both conferences on how Speed Wins and how Netflix is doing Continuous Delivery

Citation preview

Page 1: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Now Playing on Netflix:Adventures in a Cloudy Future

CMG November 2013Adrian Cockcroft@adrianco @NetflixOSS

http://www.linkedin.com/in/adriancockcroft

Page 2: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?

Page 3: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

How Netflix Used to Work

Customer Device (PC, PS3, TV…)

Monolithic Web App

Oracle

MySQL

Monolithic Streaming App

Oracle

MySQL

Limelight/Level 3 Akamai CDNs

Content Management

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Datacenter

Page 4: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

How Netflix Streaming Works Today

Customer Device (PC, PS3, TV…)

Web Site or Discovery API

User Data

Personalization

Streaming API

DRM

QoS Logging

OpenConnect CDN Boxes

CDN Management and Steering

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Datacenter

Page 5: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Nov2012StreamingBandwidth

March2013

MeanBandwidth+39% 6mo

Page 6: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Netflix Scale

• Tens of thousands of instances on AWS– Typically 4 core, 30GByte, Java business logic– Thousands created/removed every day

• Thousands of Cassandra NoSQL storage nodes– Mostly 8 core, 60Gbyte, 2TByte of SSD– 65 different clusters, over 300TB data, triple zone– Over 40 are multi-region clusters (6, 9 or 12 zone)– Biggest 288 nodes, 300K rps, 1.3M wps

Page 7: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Reactions over time

2009 “You guys are crazy! Can’t believe it”

2010 “What Netflix is doing won’t work”

2011 “It only works for ‘Unicorns’ like Netflix”

2012 “We’d like to do that but can’t”

2013 “We’re on our way using Netflix OSS code”

Page 8: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

"This is the IT swamp draining manual for anyone who is neck deep in alligators."- Adrian Cockcroft, Cloud Architect at Netflix

Page 9: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery
Page 10: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Mainframe

Client-Server

Commodity

Web-scaleCloud

Page 11: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Goal of Traditional IT:Reliable hardware

running stable software

Page 12: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery
Page 13: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

SCALEBreaks hardware

Page 14: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

….SPEEDBreaks software

Page 15: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

SPEED at SCALE

Breaks everything

Page 16: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery
Page 17: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Incidents – Impact and Mitigation

PRX Incidents

CSXX Incidents

Metrics impact – Feature disableXXX Incidents

No Impact – fast retry or automated failoverXXXX Incidents

Public Relations Media Impact

High Customer Service Calls

Affects AB Test Results

Y incidents mitigated by Active Active, game day practicing

YY incidents mitigated by

better tools and practices

YYY incidents mitigated by better

data tagging

Page 18: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Web Scale Architecture

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

UltraDNSDynECT

DNS

AWS Route53

DNSAutomation

Page 19: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

CIO Says Speed IT Up!

Page 20: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Colonel Boyd, USAF

“Get inside your adversaries' OODA loop to disorient them”

Page 21: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Observe

Orient

Decide

Act

Land grab opportunity Competitive

Move

Customer Pain Point

Analysis

Get Buy-in

Plan Response

Commit Resources

Implement

Deliver

Engage customers

Model Hypotheses

Measure Customers

Colonel Boyd, USAF

“Get inside your adversaries'

OODA loop to disorient them”

Page 22: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Observe

Orient

Decide

Act

Territory Expansion Foreign

Competition

Customer Pain Point

Systems Analysis

Board Level Buy-in

5 year PlanVendor

Evaluation

Customize Vendor SW

Upgrade Mainframe

Print Ad Campaign

Capacity Model

Measure Revenue

Mainframe Era - 1 year

cycle

Page 23: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

80’s Mainframe Innovation Cycle

• Cost $1M to $100M• Duration 1 to 5 years• Bet the whole company• Cost of failure – bankrupt or bought• Cobol and DB2 on MVS

Page 24: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Observe

Orient

Decide

Act

Territory Expansion Foreign

Competition

Customer Pain Point

Data Warehouse

CIO Level Buy-in

1 year PlanVendor

Evaluation

Customize Vendor SW

Install Servers

TV Advert Campaign

Capacity Estimate

Measure Revenue

Client/Server Era – 3

month cycle

Page 25: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

90’s Client Server Innovation Cycle

• Cost $100K to $10M• Duration 3 – 12 months• Bet a product line or division• Cost of failure – revenue hit, CIO’s job• C++ and Oracle on Solaris

Page 26: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Observe

Orient

Decide

Act

Territory Expansion Competitive

Moves

Customer Pain Point

Data Warehouse

Business Buy-in

2 Week Plan

Feature Priority

Code Feature

Install Capacity

Web Display Ads

Capacity Estimate

Measure Sales

Commodity Era – 2 week

agile train

Page 27: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

00’s Commodity Agile Innovation Cycle

• Cost $10K to $1M• Duration 2 – 12 weeks• Bet a product feature• Cost of failure – product mgr reputation• Java and MySQL on RedHat Linux

Page 28: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Train Model Process Hand-Off Steps

Product Manager

Developer

QA Integration Team

Operations Deploy Team

BI Analytics Team

Page 29: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

What Happened?

Rate of change increased

Cost and size and risk of

change reduced

Page 30: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Cloud Native

Construct a highly agile and highly available service from ephemeral and

assumed broken components

Page 31: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)

Start Here

memcached

Cassandra

Web service

S3 bucket

Personalization movie group choosers (for US, Canada and Latam)

Each icon is three to a few hundred instances across three AWS zones

Page 32: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Continuous Deployment

No time for handoff to IT

Page 33: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Developer Self Service

Freedom and Responsibility

Page 34: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Developers run what they wrote

Root access and pagerduty

Page 35: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

IT is a Cloud API

DEVops automation

Page 36: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Github all the things!

Leverage social coding

Page 37: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery
Page 38: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Putting it all together…

Page 39: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Observe

Orient

Decide

Act

Land grab opportunity Competitive

Move

Customer Pain Point

Analysis

JFDI

Plan Response

Share Plans

Increment Implement

Automatic Deploy

Launch AB Test

Model Hypotheses

BIG DATA

INNOVATION

CULTURE

CLOUD

Measure Customers

Continuous Delivery on

Cloud

Page 40: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Continuous Innovation Cycle

• Cost near zero, variable expense• Duration hours to days• Bet a decoupled microservice code push• Cost of failure – near zero, instant rollback• Clojure/Scala/Python on NoSQL on Cloud

Page 41: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Continuous Deploy Hand-Off Steps

Product ManagerA/B test setup and enableSelf service hypothesis test results

DeveloperAutomated testSelf service deploy, on callSelf service analytics

Page 42: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Continuous Deploy Automation

Check in code, Jenkins build

Bake AMI, launch in test env

Functional and performance test

Production canary test

Production red/black push

Page 43: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Bad Canary Signature

Page 44: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Happy Canary Signature

Page 45: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Global Deploy Automation

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

West Coast Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

East Coast Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Europe Load Balancers

Afternoon in CaliforniaNight-time in Europe

Next day on East CoastNext day on West Coast After peak in Europe

If passes test suite, canary then deploy

Canary then deployCanary then deploy

Page 46: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Ephemeral Instances

• Largest services are autoscaled• Average lifetime of an instance is 36 hours

Push

Autoscale UpAutoscale Down

Page 47: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

(New Today!) Predictive Autoscaling

More morning loadSat/Sun high traffic

Lower load on Weds

24 Hours predicted traffic vs. actual

Prediction driving AWS Autoscaler to plan capacity

Page 48: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Inspiration

Page 49: Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is doing Continuous Delivery

Takeaway

Speed WinsAssume Broken

Cloud Native AutomationGithub is your “app store” and resumé

@adrianco @NetflixOSShttp://netflix.github.com