Upload
adrian-cockcroft
View
30.214
Download
2
Embed Size (px)
DESCRIPTION
Flowcon keynote was a few days before CMG, a few tweaks and some extra content added at the start and end. Opening Keynote talk for both conferences on how Speed Wins and how Netflix is doing Continuous Delivery
Citation preview
Now Playing on Netflix:Adventures in a Cloudy Future
CMG November 2013Adrian Cockcroft@adrianco @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?
How Netflix Used to Work
Customer Device (PC, PS3, TV…)
Monolithic Web App
Oracle
MySQL
Monolithic Streaming App
Oracle
MySQL
Limelight/Level 3 Akamai CDNs
Content Management
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge Locations
Datacenter
How Netflix Streaming Works Today
Customer Device (PC, PS3, TV…)
Web Site or Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect CDN Boxes
CDN Management and Steering
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge Locations
Datacenter
Nov2012StreamingBandwidth
March2013
MeanBandwidth+39% 6mo
Netflix Scale
• Tens of thousands of instances on AWS– Typically 4 core, 30GByte, Java business logic– Thousands created/removed every day
• Thousands of Cassandra NoSQL storage nodes– Mostly 8 core, 60Gbyte, 2TByte of SSD– 65 different clusters, over 300TB data, triple zone– Over 40 are multi-region clusters (6, 9 or 12 zone)– Biggest 288 nodes, 300K rps, 1.3M wps
Reactions over time
2009 “You guys are crazy! Can’t believe it”
2010 “What Netflix is doing won’t work”
2011 “It only works for ‘Unicorns’ like Netflix”
2012 “We’d like to do that but can’t”
2013 “We’re on our way using Netflix OSS code”
"This is the IT swamp draining manual for anyone who is neck deep in alligators."- Adrian Cockcroft, Cloud Architect at Netflix
Mainframe
Client-Server
Commodity
Web-scaleCloud
Goal of Traditional IT:Reliable hardware
running stable software
SCALEBreaks hardware
….SPEEDBreaks software
SPEED at SCALE
Breaks everything
Incidents – Impact and Mitigation
PRX Incidents
CSXX Incidents
Metrics impact – Feature disableXXX Incidents
No Impact – fast retry or automated failoverXXXX Incidents
Public Relations Media Impact
High Customer Service Calls
Affects AB Test Results
Y incidents mitigated by Active Active, game day practicing
YY incidents mitigated by
better tools and practices
YYY incidents mitigated by better
data tagging
Web Scale Architecture
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNSDynECT
DNS
AWS Route53
DNSAutomation
CIO Says Speed IT Up!
Colonel Boyd, USAF
“Get inside your adversaries' OODA loop to disorient them”
Observe
Orient
Decide
Act
Land grab opportunity Competitive
Move
Customer Pain Point
Analysis
Get Buy-in
Plan Response
Commit Resources
Implement
Deliver
Engage customers
Model Hypotheses
Measure Customers
Colonel Boyd, USAF
“Get inside your adversaries'
OODA loop to disorient them”
Observe
Orient
Decide
Act
Territory Expansion Foreign
Competition
Customer Pain Point
Systems Analysis
Board Level Buy-in
5 year PlanVendor
Evaluation
Customize Vendor SW
Upgrade Mainframe
Print Ad Campaign
Capacity Model
Measure Revenue
Mainframe Era - 1 year
cycle
80’s Mainframe Innovation Cycle
• Cost $1M to $100M• Duration 1 to 5 years• Bet the whole company• Cost of failure – bankrupt or bought• Cobol and DB2 on MVS
Observe
Orient
Decide
Act
Territory Expansion Foreign
Competition
Customer Pain Point
Data Warehouse
CIO Level Buy-in
1 year PlanVendor
Evaluation
Customize Vendor SW
Install Servers
TV Advert Campaign
Capacity Estimate
Measure Revenue
Client/Server Era – 3
month cycle
90’s Client Server Innovation Cycle
• Cost $100K to $10M• Duration 3 – 12 months• Bet a product line or division• Cost of failure – revenue hit, CIO’s job• C++ and Oracle on Solaris
Observe
Orient
Decide
Act
Territory Expansion Competitive
Moves
Customer Pain Point
Data Warehouse
Business Buy-in
2 Week Plan
Feature Priority
Code Feature
Install Capacity
Web Display Ads
Capacity Estimate
Measure Sales
Commodity Era – 2 week
agile train
00’s Commodity Agile Innovation Cycle
• Cost $10K to $1M• Duration 2 – 12 weeks• Bet a product feature• Cost of failure – product mgr reputation• Java and MySQL on RedHat Linux
Train Model Process Hand-Off Steps
Product Manager
Developer
QA Integration Team
Operations Deploy Team
BI Analytics Team
What Happened?
Rate of change increased
Cost and size and risk of
change reduced
Cloud Native
Construct a highly agile and highly available service from ephemeral and
assumed broken components
Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie group choosers (for US, Canada and Latam)
Each icon is three to a few hundred instances across three AWS zones
Continuous Deployment
No time for handoff to IT
Developer Self Service
Freedom and Responsibility
Developers run what they wrote
Root access and pagerduty
IT is a Cloud API
DEVops automation
Github all the things!
Leverage social coding
Putting it all together…
Observe
Orient
Decide
Act
Land grab opportunity Competitive
Move
Customer Pain Point
Analysis
JFDI
Plan Response
Share Plans
Increment Implement
Automatic Deploy
Launch AB Test
Model Hypotheses
BIG DATA
INNOVATION
CULTURE
CLOUD
Measure Customers
Continuous Delivery on
Cloud
Continuous Innovation Cycle
• Cost near zero, variable expense• Duration hours to days• Bet a decoupled microservice code push• Cost of failure – near zero, instant rollback• Clojure/Scala/Python on NoSQL on Cloud
Continuous Deploy Hand-Off Steps
Product ManagerA/B test setup and enableSelf service hypothesis test results
DeveloperAutomated testSelf service deploy, on callSelf service analytics
Continuous Deploy Automation
Check in code, Jenkins build
Bake AMI, launch in test env
Functional and performance test
Production canary test
Production red/black push
Bad Canary Signature
Happy Canary Signature
Global Deploy Automation
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
West Coast Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
East Coast Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Europe Load Balancers
Afternoon in CaliforniaNight-time in Europe
Next day on East CoastNext day on West Coast After peak in Europe
If passes test suite, canary then deploy
Canary then deployCanary then deploy
Ephemeral Instances
• Largest services are autoscaled• Average lifetime of an instance is 36 hours
Push
Autoscale UpAutoscale Down
(New Today!) Predictive Autoscaling
More morning loadSat/Sun high traffic
Lower load on Weds
24 Hours predicted traffic vs. actual
Prediction driving AWS Autoscaler to plan capacity
Inspiration
Takeaway
Speed WinsAssume Broken
Cloud Native AutomationGithub is your “app store” and resumé
@adrianco @NetflixOSShttp://netflix.github.com