30
SRECon EU, July 2016 Greg Burek Lessons from Automatic Incident Resolution for a Million Databases

Lessons from Automatic Incident Resolution for a Million Databases

Embed Size (px)

Citation preview

Page 1: Lessons from Automatic Incident Resolution for a Million Databases

SRECon EU, July 2016Greg Burek

Lessons from Automatic Incident Resolution for a Million Databases

Page 2: Lessons from Automatic Incident Resolution for a Million Databases
Page 3: Lessons from Automatic Incident Resolution for a Million Databases

The Twelve-Factor App

Page 4: Lessons from Automatic Incident Resolution for a Million Databases
Page 5: Lessons from Automatic Incident Resolution for a Million Databases
Page 6: Lessons from Automatic Incident Resolution for a Million Databases

Department of Data

Page 7: Lessons from Automatic Incident Resolution for a Million Databases

PostgresqlRedisKafka

Page 8: Lessons from Automatic Incident Resolution for a Million Databases

~ Million Databases

Tens of thousands of AWS Instances

Page 9: Lessons from Automatic Incident Resolution for a Million Databases

Some Databases

Hundreds of AWS Instances

Page 10: Lessons from Automatic Incident Resolution for a Million Databases
Page 11: Lessons from Automatic Incident Resolution for a Million Databases

“The goal is to build systems that can scale

linearly with machines & sub-linearly with people” -

Caitie McCaffreyTackling Alert Fatigue

Page 12: Lessons from Automatic Incident Resolution for a Million Databases

Monitor and alert on your business

Page 13: Lessons from Automatic Incident Resolution for a Million Databases
Page 14: Lessons from Automatic Incident Resolution for a Million Databases

Monitor and alert on your business

Usually, don’t alert on machine specific metrics

Page 15: Lessons from Automatic Incident Resolution for a Million Databases

Write runbooks and playbooks

Page 16: Lessons from Automatic Incident Resolution for a Million Databases

Turn playbooks into code

Page 17: Lessons from Automatic Incident Resolution for a Million Databases
Page 18: Lessons from Automatic Incident Resolution for a Million Databases

“The goal is not to never get paged, the goal is to never get paged for the

same thing twice” - Astrid Atkinson

Engineering for the long game

Page 19: Lessons from Automatic Incident Resolution for a Million Databases

Verify monitoring before restarting the world

Page 20: Lessons from Automatic Incident Resolution for a Million Databases
Page 21: Lessons from Automatic Incident Resolution for a Million Databases

Circuit breakers

Page 22: Lessons from Automatic Incident Resolution for a Million Databases

Automation can’t handle the unknown

Wake someone up on exceptions and timeouts

Page 23: Lessons from Automatic Incident Resolution for a Million Databases
Page 24: Lessons from Automatic Incident Resolution for a Million Databases

Have a REPL/console

Page 25: Lessons from Automatic Incident Resolution for a Million Databases

Aggregate and review trends

Page 26: Lessons from Automatic Incident Resolution for a Million Databases
Page 27: Lessons from Automatic Incident Resolution for a Million Databases

Humans can break

Automation can be simplistic

Humans + Automation for a resilient and operable

system

Page 28: Lessons from Automatic Incident Resolution for a Million Databases

1.Monitor and alert on your business

2.Write playbooks3.Make playbooks into automation4.Checks and balances of

automation5.Circuit breakers6.Alert on exceptions and timeouts7.Admin console8.Aggregate and review trends

Page 29: Lessons from Automatic Incident Resolution for a Million Databases

[email protected]@gregburek

Page 30: Lessons from Automatic Incident Resolution for a Million Databases

State Machines