Upload
george-nassef
View
34
Download
1
Embed Size (px)
Citation preview
Self-‐Healing Amazon AWS-‐Based Open Source Distributed Rendering System
Presenta?on to Internet Tech Forum George Nassef
Chief Technology Officer June 2, 2014
Our Environment l LAMPR (Linux Apache MySQL PHP And Redis)
l Millions of IOS and Android Client Mobile Devices
l RESTFul API MVC
l Ad-‐hoc video rendering jobs submiXed with real-‐?me expecta?ons.
l Need for job queue, dispatch, processing and most-‐importantly: self-‐healing recovery.
The Problem l I wanted to deploy only Amazon SPOT instances to save money.
l However, Spot instances can come and go and be terminated mid-‐job at any ?me due to price/availability changes.
l Also, wanted full horizontal scaling with queue-‐based communica?ons.
l Any render job which did not finish must be rapidly iden?fied and requeued.
l However, geo-‐distribu?on makes tradi?onal requeue so`ware moot.
Solu?on l Use REDIS for queue by crea?ng work queue keys.
l Place to-‐be-‐done work in “Messages” key.
l Have each render server pick-‐up work, move the jobid to the “In Work” queue.
l Created an outboard Mongodb system to capture all system and applica?on logs.
l Separate process intelligently watches Mongodb system and interprets program alerts, system messages, warnings, errors.
l “Watching” processor requeues render jobs before they fail.
Progress l A`er hundreds of thousands of submiXed render jobs, only 2 failures.
l Failures were unrelated to “chaos” of environment or spot instances.
l Real?me analysis of mongodb logs closes/catches security holes, system errors and provides data and user response ?me informa?on.
l Huge savings in AWS costs by being able to use ANY availability zone, worldwide and spin-‐up or down instances at will without regard to workload.
l System “self heals” as it scales.
Challenges / Choices l Hosted REDIS by AWS is Elas?cCache.
l Accessible seamlessly across Availability Zones, but not Regions.
l Autoscale groups for spot instances take ?me to “spin up.”
l CloudWatch metrics may not fit your workload for rapid spinup.