6
SelfHealing Amazon AWSBased Open Source Distributed Rendering System Presenta?on to Internet Tech Forum George Nassef Chief Technology Officer June 2, 2014

Self-Healing AWS-Based Open Source Distributed Rendering System by George Nassef

Embed Size (px)

Citation preview

Page 1: Self-Healing AWS-Based Open Source Distributed Rendering System by George Nassef

Self-­‐Healing  Amazon  AWS-­‐Based  Open  Source  Distributed  Rendering  System  

Presenta?on  to  Internet  Tech  Forum  George  Nassef  

Chief  Technology  Officer  June  2,  2014  

Page 2: Self-Healing AWS-Based Open Source Distributed Rendering System by George Nassef

Our  Environment  l  LAMPR  (Linux  Apache  MySQL  PHP  And  Redis)  

l  Millions  of  IOS  and  Android  Client  Mobile  Devices  

l  RESTFul  API  MVC    

l  Ad-­‐hoc  video  rendering  jobs  submiXed  with  real-­‐?me  expecta?ons.  

l  Need  for  job  queue,  dispatch,  processing  and  most-­‐importantly:  self-­‐healing  recovery.  

Page 3: Self-Healing AWS-Based Open Source Distributed Rendering System by George Nassef

The  Problem  l  I  wanted  to  deploy  only  Amazon  SPOT  instances  to  save  money.  

l  However,  Spot  instances  can  come  and  go  and  be  terminated  mid-­‐job  at  any  ?me  due  to  price/availability  changes.    

l  Also,  wanted  full  horizontal  scaling  with  queue-­‐based  communica?ons.  

l  Any  render  job  which  did  not  finish  must  be  rapidly  iden?fied  and  requeued.  

l  However,  geo-­‐distribu?on  makes  tradi?onal  requeue  so`ware  moot.    

Page 4: Self-Healing AWS-Based Open Source Distributed Rendering System by George Nassef

Solu?on  l  Use  REDIS  for  queue  by  crea?ng  work  queue  keys.  

l  Place  to-­‐be-­‐done  work  in  “Messages”  key.  

l  Have  each  render  server  pick-­‐up  work,  move  the  jobid  to  the  “In  Work”  queue.  

l  Created  an  outboard  Mongodb  system  to  capture  all  system  and  applica?on  logs.    

l  Separate  process  intelligently  watches  Mongodb  system  and  interprets  program  alerts,  system  messages,  warnings,  errors.    

l  “Watching”  processor  requeues  render  jobs  before  they  fail.    

Page 5: Self-Healing AWS-Based Open Source Distributed Rendering System by George Nassef

Progress  l  A`er  hundreds  of  thousands  of  submiXed  render  jobs,  only  2  failures.    

l  Failures  were  unrelated  to  “chaos”  of  environment  or  spot  instances.    

l  Real?me  analysis  of  mongodb  logs  closes/catches  security  holes,  system  errors  and  provides  data  and  user  response  ?me  informa?on.    

l  Huge  savings  in  AWS  costs  by  being  able  to  use  ANY  availability  zone,  worldwide  and  spin-­‐up  or  down  instances  at  will  without  regard  to  workload.    

l  System  “self  heals”  as  it  scales.    

Page 6: Self-Healing AWS-Based Open Source Distributed Rendering System by George Nassef

Challenges  /  Choices  l  Hosted  REDIS  by  AWS  is  Elas?cCache.  

l  Accessible  seamlessly  across  Availability  Zones,  but  not  Regions.    

l  Autoscale  groups  for  spot  instances  take  ?me  to  “spin  up.”  

l  CloudWatch  metrics  may  not  fit  your  workload  for  rapid  spinup.