44
Embracing Failure Fault Injec,on and Service Resilience at Ne6lix Josh Evans Director of Opera,ons Engineering, Ne6lix

wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Embracing  Failure  Fault  Injec,on  and  Service  Resilience  

at  Ne6lix  

Josh  Evans  Director  of  Opera,ons  Engineering,  Ne6lix  

Page 2: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Josh  Evans  ●  24  years  in  technology  ●  Tech  support,  Tools,  Test  Automa,on,  IT  &  QA  

Management    ●  Time  at  Ne6lix  ~15  years  ●  Ecommerce,  streaming,  tools,  services,  opera,ons    ●  Current  Role:  Director  of  Opera,ons  Engineering  

Page 3: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

•  48  million  members,  41  countries  •  >  1  billion  hours  per  month  •  >  1000  device  types  •  3  AWS  Regions,  hundreds  of  services  •  Hundreds  of  thousands  of  requests/second  •  Partner  provided  services  (Xbox  Live,  PSN)  •  CDN    serving  petabytes  of  data  at  terabits/second  

Ne5lix  Ecosystem  

Service  Partners  

Sta,c  Content  Akamai  

 

Ne6lix  CDN  

AWS/Ne6lix  Control  Plane  ISP  

Page 4: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Availability  vs.  Rate  of  Change  

Rate  of  Change  

Availability  (nines)  

6  

5  

4  

3  

2  

1  

0 1   10   100   1000  

Our  Focus  is  on  Quality  and  Velocity  

99.9999%  

99.999%  

99.99%  

99.9%  

99%  

90%  

31.5  seconds  

5.26  minutes  

52.56  minutes  

8.76  hours  

3.26  days  

36.5  days  

Page 5: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Availability  vs.  Rate  of  Change  

Rate  of  Change  

Availability  (nines)  

6  

5  

4  

3  

2  

1  

0 1   10   100   1000  

We  Seek  99.99%  Availability  for  Starts  

99.9999%  

99.999%  

99.99%  

99.9%  

99%  

90%  

31.5  seconds  

5.26  minutes  

52.56  minutes  

8.76  hours  

3.26  days  

36.5  days  

Page 6: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Availability  vs.  Rate  of  Change  

Rate  of  Change  

Availability  (nines)  

6  

5  

4  

3  

2  

1  

0 1   10   100   1000  

Our  goal  is  to  shiE  the  curve  

99.9999%  

99.999%  

99.99%  

99.9%  

99%  

90%  

●  Engineering  ●  Opera,ons  ●  Best  Prac,ces                  Con,nuous  Improvement  

Page 7: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Availability  means  that  members  can    ●  sign  up  ●  ac,vate  a  device  ●  browse  ●  watch  

Page 8: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

What  keeps  us  up  at  night  

Page 9: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Failures  happen  all  the  ,me  •  Disks  fail  •  Power  goes,  and  your  generator  fails  •  Sogware  bugs  •  Human  error    •  Failure  is  unavoidable  

Page 10: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

We  design  for  failure  •  Excep,on  handling  •  Auto-­‐scaling  clusters  •  Redundancy  •  Fault  tolerance  and  isola,on  •  Fall-­‐backs  and  degraded  experiences  •  Protect  the  customer  from  failures  

Page 11: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Is  that  enough?  

Page 12: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

•  How  do  we  know  if  we’ve  succeeded?  •  Does  the  system  work  as  designed?  •  Is  it  as  resilient  as  we  believe?  •  How  do  we  prevent  driging  into  failure?  

No  

Page 13: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

We  test  for  failure  •  Unit  tes,ng  •  Integra,on  tes,ng  •  Stress/load  tes,ng  •  Simula,on  matrices  

Page 14: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Tes,ng  increases  confidence  but…  is  that  enough?  

Page 15: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Tes,ng  distributed  systems  is  hard  • Massive,  changing  data  sets  • Web-­‐scale  traffic  •  Complex  interac,ons  and  informa,on  flows  •  Asynchronous  requests  •  3rd  party  services  •  All  while  innova,ng  and  improving  our  service  

Page 16: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

What  if  we  regularly  inject  failures  into  our  systems  under  controlled  circumstances?  

Page 17: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Embracing  Failure  in  Produc,on  •  Don’t  wait  for  random  failures  •  Cause  failure  to  validate  resiliency  •  Test  design  assump,ons  by  stressing  them  •  Remove  uncertainty  by  forcing  failures  regularly  

Page 18: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&
Page 19: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Two  Key  Concepts  

Page 20: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Auto-­‐Scaling  ●  Virtual instance clusters that scale and shrink with traffic ●  Reactive and predictive mechanisms ●  Auto-replacement of bad instances

Page 21: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Circuit  Breakers  

Page 22: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

An  Instance  Fails  •  Monkey  loose  in  your  DC  •  Run  during  business  hours  •  Instances  fail  all  the  ,me  

 

•  What  we  learned  –  State  is  problema,c  –  Auto-­‐replacement  works  –  Surviving  a  single  instance  failure  is  not  enough  

Page 23: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

A  Data  Center  Fails  Simulate  an  availability  zone  outage  •  3-­‐zone  configura,on  •  Eliminate  one  zone  •  Ensure  that  others  can  

handle  the  load  and  nothing  breaks  

Chaos  Gorilla  

Page 24: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

What  we  encountered  

Page 25: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

What  we  learned  •  Large  scale  events  are  hard  to  simulate  

–  Hundreds  of  clusters,  thousands  of  instances      •  Rapidly  shiging  traffic  is  error  prone  

–  LBs  must  expire  connec,ons  quickly  –  Lingering  connec,ons  to  caches  must  be  addressed  –  Not  all  clusters  pre-­‐scaled  for  addi,onal  load  

Page 26: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

What  we  learned  •  Hidden  assump,ons  &  configura,ons  

–  Some  apps  not  configured  for  cross-­‐zone  calls  –  Mismatched  ,meouts  –  fallbacks  prevented  fail-­‐over  –  REST  client  “preserva,on  mode”  prevented  fail-­‐over  

 •  Cassandra  works  as  expected  

Page 27: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Regrouping  •  From  zone  outage  to  zone  evacua,on  

–  Carefully  deregistered  instances  –  Staged  traffic  shigs    

 •  Resuming  true  outage  simula,ons  soon  

Page 28: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

AZ1   AZ2   AZ3  

Regional  Load  Balancers  

Zuul  –  Traffic  Shaping/Rou,ng  

Data   Data   Data  

Geo-­‐located  

Regions  Fail  

Chaos  Kong  

AZ1   AZ2   AZ3  

Regional  Load  Balancers  

Zuul  –  Traffic  Shaping/Rou,ng  

Data   Data   Data  

Customer  Device  

Page 29: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

What  we  learned  •  It  works!  •  Disable  predic,ve  auto-­‐scaling  •  Use  instance  counts  from  previous  day  

Page 30: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Room  for  Improvement  •  Not  a  true  regional  outage  simula,on  

–  Staged  migra,on  – No  “split  brain”  

Page 31: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Not  everything  fails  completely  Simulate  latent  service  calls  •  Inject  arbitrary  latency  and  

errors  at  the  service  level  •  Observe  for  effects    

Latency  Monkey  

Page 32: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

AWS  

Service  Architecture    

Device   Zuul  ELB   Edge   Service  B    

Service  C  

Internet  

Service  A    

Page 33: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Latency  Monkey  

Device   Zuul  ELB   Edge   Service  B    

Service  C  

Internet  

Service  A    

•  Server-­‐side  URI  filters  •  All  requests  •  URI  payern  match  •  Percentage  of  requests    

•  Arbitrary  delays  or  responses  

Page 34: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

What  we  learned  •  Startup  resiliency  is  an  issue  •  Services  owners  don’t  know  all  dependencies  •  Fallbacks  can  fail  too  •  Second  order  effects  not  easily  tested  •  Dependencies  change  over  ,me  •  Holis,c  view  is  necessary  •  Some  teams  opt  out  

Page 35: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Fault  Injec,on  Tes,ng  (FIT)  

Device   ELB   Service  B    

Service  C  

Internet   Edge  

Device  or  Account  Override?  

Zuul  

Service  A    Request-­‐level  simula,ons  

Page 36: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Benefits  •  Confidence  building  for  latency  monkey  tes,ng  •  Con,nuous  resilience  tes,ng  in  test  and  

produc,on  •  Tes,ng  of  minimum  viable  service,  fallbacks  •  Device  resilience  evalua,on  

Page 37: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Device  Resiliency  Matrix  

Page 38: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Is  that  it?  •  Fault-­‐injec,on  isn’t  enough  

–  Bad  code/deployments  –  Configura,on  mishaps  –  Byzan,ne  failures  –  Memory  leaks  –  Performance  degrada,on  

Page 39: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

A Multi-Faceted Approach •  Continuous Build, Delivery, Deployment •  Test Environments, Infrastructure, Coverage •  Automated Canary Analysis •  Staged Configuration Changes •  Crisis Response Tooling & Operations •  Real-time Analytics, Detection, Alerting •  Operational Insight - Dashboards, Reports •  Performance and Efficiency Engagements

Page 40: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

It’s also about people and culture

Page 41: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Technical  Culture  •  You  build  it,  you  run  it  •  Each  failure  is  an  opportunity  to  learn  •  Blameless  incident  reviews  •  Commitment  to  con,nuous  improvement    

Page 42: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Context  and  Collabora,on  Context  engages  partners    •  Data  and  root  causes  •  Global  vs.  local    •  Urgent  vs.  important    •  Long  term  vision    Collabora,on  yields  beyer  solu,ons  and  buy-­‐in  

Page 43: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

The  Simian  Army  is  part  of  the  Ne6lix  open  source  cloud  pla6orm    hyp://ne6lix.github.com  

Page 44: wednesday general evans faultinjection 19 · 2018-07-27 · Embracing*Failure* FaultInjec,on&and&Service&Resilience& atNe6lix& Josh&Evans& Director&of&Operaons&Engineering,&Ne6lix&

Josh Evans Director of Operations Engineering, Netflix

[email protected], @josh_evans_nflx