Upload
masstlc
View
298
Download
3
Embed Size (px)
DESCRIPTION
Ariel Tseitlin, Director of the Netflix Cloud presented on the elasticity and redundancy of its Cloud service.
Citation preview
@atseitlin
Ne#lix Cloud Pla#orm
Ne#lix's evolu3on in the cloud
Ariel Tseitlin
h.p://www.linkedin.com/in/atseitlin @atseitlin
@atseitlin
About Ne<lix Ne#lix is the world’s leading Internet television network with nearly 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1]
[1] h.p://ir.ne<lix.com/
@atseitlin
Original Content
@atseitlin
CriDcal Acclaim
@atseitlin
A complex distributed system
@atseitlin
How Ne<lix Streaming Works
Customer Device (PC, PS3, TV…)
Web Site or Discovery API
User Data
PersonalizaDon
Streaming API
DRM
QoS Logging
OpenConnect CDN Boxes
CDN Management and
Steering
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge LocaDons
Browse
Play
Watch
@atseitlin
Highly Available Architecture
Micro-‐services, redundancy, resiliency
@atseitlin
Web Server Dependencies Flow Home page business transacDon
Start Here
memcached
Cassandra
Web service
S3 bucket
PersonalizaDon movie group chooser
Each icon is three to a few hundred instances across three AWS zones
@atseitlin
Component Micro-‐Services Test With Chaos Monkey, Latency Monkey
@atseitlin
Three Balanced Availability Zones Test with Chaos Gorilla
Cassandra and Evcache Replicas
Zone A
Cassandra and Evcache Replicas
Zone B
Cassandra and Evcache Replicas
Zone C
Load Balancers
@atseitlin
Triple Replicated Persistence Cassandra maintenance affects individual replicas
Cassandra and Evcache Replicas
Zone A
Cassandra and Evcache Replicas
Zone B
Cassandra and Evcache Replicas
Zone C
Load Balancers
@atseitlin
Isolated Regions Will someday test with Chaos Kong
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-‐East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-‐West Load Balancers
@atseitlin
Failure Modes and Effects Failure Mode Probability Current Mi3ga3on Plan
ApplicaDon Failure High AutomaDc degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium ConDnue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more funcDons to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
UnDl we got really good at miDgaDng high and medium probability failures, the ROI for miDgaDng regional failures didn’t make sense. Gedng there…
@atseitlin
ApplicaDon Resilience
Run what you wrote Rapid detecDon Rapid Response
Fail oeen
@atseitlin
Run What You Wrote
• Make developers responsible for failures – Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame”
• Keep Dmeouts short, fail fast – Don’t let cascading Dmeouts stack up
@atseitlin
Rapid DetecDon
• If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind
• Monitor services, not instances – Make instance failure a non-‐event
• Don’t pay people to watch screens – Instead pay them to build alerDng
@atseitlin
Rapid Rollback
• Use a new Autoscale Group to push code
• Leave exisDng ASG in place, switch traffic
• If OK, auto-‐delete old ASG a few hours later
• If “whoops”, switch traffic back in seconds
@atseitlin
Asgard h.p://techblog.ne<lix.com/2012/06/asgard-‐web-‐based-‐cloud-‐management-‐and.html
@atseitlin
Made possible in the cloud
APIs, ElasDcity, Efficiency
@atseitlin
APIs
• Control everything (start, terminate, scale)
• Inject failure
• Monitor & audit
• Automate operaDons
@atseitlin
ElasDcity
• Capacity planning replaced with forecasDng
• Dynamic load-‐based auto-‐scaling
• New data centers at the click of a bu.on
@atseitlin
Efficiency
• ~10x trough to peak raDo. Fill trough with batch workloads
• OpDmize machine class for each service
• Highly available red/black deployments
@atseitlin
Coming soon to a cloud near you
Billing & Payments, Big Data & AnalyDcs, SaaS
@atseitlin
Billing & Payments
• PCI compliance
• Privacy & security
• Intermediate step of cache in the cloud
@atseitlin
Big Data & AnalyDcs
• On deck for cloud migraDon
• ETL already in cloud with EMR (Hadoop)
• Many cloud alternaDves but not yet as mature as the old guard
@atseitlin
Corporate system moving to SaaS
• Email (Exchange-‐>Google Apps)
• Expense Management (Concur-‐>Workday)
• Document sharing (File Servers-‐>Box)
• Goal is 100% SaaS
@atseitlin
@atseitlin
Open Source Projects Github / Techblog
Apache ContribuDons
Techblog Post
Coming Soon
Priam Cassandra as a Service
Astyanax Cassandra client for Java
CassJMeter Cassandra test suite
Cassandra MulD-‐region EC2 datastore
support
Aegisthus Hadoop ETL for Cassandra
Ice Spend analyDcs
Governator Library lifecycle and dependency
injecDon
Odin Cloud orchestraDon
Blitz4j Async logging
Exhibitor Zookeeper as a Service
Curator Zookeeper Pa.erns
EVCache Memcached as a Service
Eureka / Discovery Service Directory
Archaius Dynamics ProperDes Service
Edda Config state with history
Denominator
Ribbon REST Client + mid-‐Der LB
Karyon Instrumented REST Base Serve
Servo and Autoscaling Scripts
Genie Hadoop PaaS
Hystrix Robust service pa.ern
RxJava ReacDve Pa.erns
Asgard AutoScaleGroup based AWS
console
Chaos Monkey Robustness verificaDon
Latency Monkey
Janitor Monkey
Bakeries / Aminotor
Legend
@atseitlin
@atseitlin
Our Current Catalog of Releases Free code available at h.p://ne<lix.github.com
@atseitlin
We’re hiring!
• Simian Army • Cloud Tools • Ne<lixOSS • Cloud OperaDons • Reliability Engineering • Many, many more
jobs.ne<lix.com
@atseitlin
Takeaways
Ne#lix has built and deployed a scalable global and highly available Pla#orm as a Service and opened sourced it (Ne#lixOSS)
The Cloud enables elasNcity, efficiency and fine-‐grained control via APIs
Credit cards, Big Data, and rest of corporate systems are next to move to the Cloud
h.p://ne<lix.github.com h.p://techblog.ne<lix.com h.p://slideshare.net/Ne<lix
h.p://www.linkedin.com/in/atseitlin
@atseitlin @Ne<lixOSS
@atseitlin
Thank you!
Any quesDons?
Ariel Tseitlin h.p://www.linkedin.com/in/atseitlin
@atseitlin