Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Embracing Failure(not my life story)
Setting the Mood•Understand that they WILL
happen •Failures are not binary
•Impact determines importance •deadlines for fixes are variable
Terminology
•Website •Production •Downtime
Monitor Failures
What is Monitoring?•Graphs. Everywhere. •Alerts on failures
•phone calls •texts
•Answers: Are we failing?
Postmortems(fool me once. shame on you.
fool me twice. shame on me.)
Postmortems
1. Reconstruct the factual timeline
2. Root cause analysis
3. Remediation items
Postmortems
•Why did we fail? •Blameless •Moderated
Gamedays(You wouldn’t wing a talk.
Don’t wing a hot fix)
Gameday
•Best defense is a good offense
•Simulate possible failures •Do it in production
kill -9
1. Draw a block diagram
2. Cut every connection
3. Watch the fireworks
SafeMachine(like a state machine … but safer)
Try, Try, Try again•What if we could just retry
failures? •Side effects are the root of all
evil •Safe failures vs Unsafe failures
What’s in a SafeMachine
•Actions •States
START Computed File
Uploaded File END
compute uploadrecord
successful
initialize_succeeded
initialize_failed
initialize_inprogress
computed_succeeded
START
a1
a1
a2
a2
a2
a3
a3
a3
END
The Pipeline
The Pipeline
START Computed File
Uploaded File END
Safe Unsafe Safe
Embracing Failure•Monitor •Postmortems •Gamedays - you wouldn’t
wing a talk? •SafeMachine
@chriswu_
Additional resources
• Postmortems https://codeascraft.com/2012/05/22/blameless-postmortems/
• Gamedays - https://stripe.com/blog/game-day-exercises-at-stripe
• links at the bottom of this post are also great
• Error Tracking - https://getsentry.com/welcome/