Upload
devops-chicago
View
867
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Jeremy Edberg: How Netflix Delivers Software
Citation preview
How Netflix Delivers Software
!July 8th, 2014
Email: jedberg@{gmail,netflix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg
When your software fails...
will your system survive?
The Netflix way
• Fully automated build tools to test and make packages
• Fully automated machine image bakery
• Fully automated image deployment
• Everything is “built for three”
• Independent teams responsible for both Dev and Ops
• Redundancy through multi-region deployment
The Netflix way
Philosophy
• We hire responsible adults and keep rules and policies to a minimum
• Developers can change any code in production at any time
• And things don’t break (usually)
Freedom and Responsibility
Automate all the things!
http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html
• Application startup
• Configuration
• Code deployment
• System deployment
Automate all the things!
• Standard base image
• Tools to manage all the systems
• Reduce errors through reproducibility
Automation
Shared state should be stored in a shared service
!
Data on an instance should be replicated to other instances
“Build for three”
We hold a boot camp for new engineers to teach them how to
build for a highly distributed environment.
“Build for three”We hold a boot camp for new
engineers to teach them how to build for a highly distributed
environment.
12B outbound requests per day
to API dependencies
Movie Ra)ngs
Personaliza)on Engine User Info Movie
MetadataSimilar Movies Reviews A/B Test
Engine
2B requests per day
into the NeHlix API
Discovery API
Streaming API
Movie Ra)ngs
Personaliza)on Engine User Info
Movie Metadata
Similar Movies
Reviews
A/B Test Engine
Discovery API
Streaming API
Content Encoding
CDN Management
QOS Logging
DRM
OpenConnect Edge Locations
Browse
Play
Watch
• Services are built by different teams who work together to figure out what each service will provide.
• The service owner publishes an API that anyone can use.
Highly aligned, loosely coupled
• Easier auto-scaling
• Easier capacity planning
• Identify problematic code-paths more easily
• Narrow in the effects of a change
• More efficient local caching
Advantages to a Service Oriented Architecture
• Developers deploy when they want
• They also manage their own capacity and autoscaling
• And fix anything that breaks at 4am!
Freedom and Responsibility
All systems choices assume
some part will fail at some point.
• Simulate things that go wrong
• Find things that are different
The Monkey Theory
Execution
AWS
Netflix OSS
Netflix Application Code
AWS
Netflix OSS
YOUR Application Code
• Instances
• Machine Images
• Elastic IPs
• Load Balancers
• Security groups / Autoscaling
What AWS Provides
AWS
AWS
Netflix OSS
YOUR Application Code
• Service Oriented Architecture
• HTTP/Rest interfaces between services
Netflix built a global PaaS
Netflix OSS
• Supports all regions and zones
• Multiple accounts
• Cross region/account replication
• Internationalized, localized and GeoIP routed
• Advanced key management
• Autoscaling with 1000s of instances
• Monitoring and alerting on millions of metrics
Netflix PaaS featuresNetflix OSS
Open Source at Netflix
Netflix OSS
Be liberal in what you accept, strict in what you send
Circuit Breakers (Hystrix)
• Simulate things that go wrong
• Find things that are different
The Monkey Theory
• Chaos -- Kills random instances
• Chaos Gorilla -- Kills zones
• Chaos Kong -- Kills regions
• Latency -- Degrades network and injects faults
• Conformity -- Looks for outliers
The simian army• Circus -- Kills and launches
instances to maintain zone balance
• Doctor -- Fixes unhealthy resources
• Janitor -- Cleans up unused resources
• Howler -- Yells about bad things like Amazon limit violations
• Security -- Finds security issues and expiring certificates
Netflix OSS
• Blueprint for the rest of the platform libraries
• Pluggable architecture
• On instance software load balancer
• Zone aware / Zone affinity
• Handles retry logic
• Global variables
• Support for staged rollout
• Feature flags
Netflix OSS
• Application to instance mapping
• Heartbeat to keep track of health
DQ Transport Routing
Suro
etc
Eventbus
Druid
Netflix OSS
Why Bake?
Generic AMI InstanceTraditional: •launch OS •install packages •install app
Netflix: •launch OS+app
App AMI Instance
Getting Baked
Perforce / Git
libraries
source
Ant targets
Ivy
Groovy all over
app bundles
Jenkins
sync
resolve
buildcompile report
publishtest
Artifactory
snapshot / release libraries / apps
Base Image Baking
Yum / Apt
Linux: CentOS, Fedora, Ubuntu
RPMs: Apache, Java...
ec2 slave instances
S3 / EBSfoundation
AMI
base AMI
Bakery
mount
install
Ready for app bake
snapshot
AWS
App Image Baking
Jenkins / Yum / Artifactory
Linux, Apache, Java, Tomcat
AWS
app bundle
ec2 slave instances
S3 / EBS
base AMI
app AMI
Bakery
mount
install
Ready to launch!
snapshot
app AMI Linux Base AMI (CentOS or Ubuntu)
Java
Tomcat
Optional Apache
Monitoring !
Log Rotation to S3
monitoring
GC and thread dump
logging
Application war file, base servlet, platform, interface
jars for dependent services
Healthcheck, status servelets, JMX interface,
Servo autoscale
Linux Base AMI (CentOS or Ubuntu)
Java
Tomcat
Optional Apache
Monitoring !
Log Rotation to S3
monitoring
GC and thread dump
logging
Application war file, base servlet, platform, interface
jars for dependent services
Healthcheck, status servelets, JMX interface,
Servo autoscale
app AMI
Application war file
Linux Base AMI (CentOS or Ubuntu)
Java
JBoss
Optional Apache
Monitoring !
Log Rotation to S3
monitoring
GC and thread dump
logging
Application war file, base servlet, platform, interface
jars for dependent services
Healthcheck, status servelets, JMX interface,
Servo autoscale
app AMI
Linux Base AMI (CentOS or Ubuntu)
Python
Bottle
Optional Apache
Monitoring !
Log Rotation to S3
monitoring
logging
Application file, base server, platform, interface
libs for dependent services
app AMI
Netflix OSS
Deploying Code; Step 1
Auto Scaling Group
Launch Configuration
Security Group
Amazon Machine Image
Instances
Load Balancer
Netflix has moved the granularity
from the instance to the cluster
Data is the most important asset Netflix
has. It’s what differentiates us from our competitors.
Netflix OSS
EVCache
• Wrapper on top of memcached
• Automatically replicates writes to multiple regions
• Pulls cache data intelligently via zone affinity
Cassandra
• Availability over consistency
• Writes over reads
• We know Java
• Open source + support
Why Cassandra?
• Priam
• Zero touch auto-config
• State management
• Token assignment
• Node replacement
• Backup/restore to/from S3
Using Cassandra at Netflix
• Astyanax
• OO abstraction to Cassandra
• Multi-region support
Cassandra Architecture
Going Multi-region
• 100% uptime is theoretically possible.
• You have to replicate your data
• This will cost money
Leveraging Multi-region
us-east-1 us-west-2 etc
eu-west-1
us-east-1 us-west-2 etc
eu-west-1
us-east-1 us-west-2 etc
eu-west-1
What’s going on?!
Atlas
!
alerting
api
api
Central Event
Gateway
Paging Service
Amazon SES
CORE Agent
Other Team’s Agent
CORE Agent
Alert Systems
Central Event
Gateway
• Parse raw alerts, match application to owner
• Add image captures and links to related graphs for easy mobile use
• Send to the right service based on priority
• Register the event in Chronos, the timeline application
• Correlate low priority alerts and generate new high priority alerts
Metrics in Production• 796B Daily metric
points
• Peaks at 1.4B / min
• 50% daily metric churn
What is a metric?com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US
How we built it• Built our own big data
system
• Based on S3 and EMR
• Less copies, lower resolution, and slower speed retrieval based on age of data
Self Serve is the Key
• Developers choose what metrics to submit
• What graphs they put on their dashboards
• What to alert on
Example Alert Config
Atlas
When something breaks..
Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? ???
Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? Change control?
Change control, the good• Tells you what changed
• Tells you what’s about to change
• Great for coordination when one change gates another change
Change control, the bad• It’s manual
• It expresses intent, not reality
• It forces you to serialize your changes to an extent
Breakdown of an outage
Is something wrong? Alerting
Where is the problem? Telemetry and Dashboards
What changed? Chronos
(Some of) Netflix is open source:
https://netflix.github.io
Just a quick reminder...
Netflix is hiring!
If you like what you see here, feel free to reach out!
Questions?
Getting in touch
Email: jedberg@{gmail,netflix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg