Embracing Failure - chriswu.mechriswu.me/talks/embracing-failure.pdf · (fool me once. shame on...

Preview:

Citation preview

Embracing Failure(not my life story)

Setting the Mood•Understand that they WILL

happen •Failures are not binary

•Impact determines importance •deadlines for fixes are variable

Terminology

•Website •Production •Downtime

Monitor Failures

What is Monitoring?•Graphs. Everywhere. •Alerts on failures

•phone calls •texts

•Answers: Are we failing?

healthcare.gov

•Know when you’re down before CNN

Postmortems(fool me once. shame on you.

fool me twice. shame on me.)

Postmortems

1. Reconstruct the factual timeline

2. Root cause analysis

3. Remediation items

Postmortems

•Why did we fail? •Blameless •Moderated

Gamedays(You wouldn’t wing a talk.

Don’t wing a hot fix)

Gameday

•Best defense is a good offense

•Simulate possible failures •Do it in production

kill -9

1. Draw a block diagram

2. Cut every connection

3. Watch the fireworks

SafeMachine(like a state machine … but safer)

Try, Try, Try again•What if we could just retry

failures? •Side effects are the root of all

evil •Safe failures vs Unsafe failures

What’s in a SafeMachine

•Actions •States

START Computed File

Uploaded File END

compute uploadrecord

successful

initialize_succeeded

initialize_failed

initialize_inprogress

computed_succeeded

START

a1

a1

a2

a2

a2

a3

a3

a3

END

The Pipeline

The Pipeline

START Computed File

Uploaded File END

Safe Unsafe Safe

Embracing Failure•Monitor •Postmortems •Gamedays - you wouldn’t

wing a talk? •SafeMachine

@chriswu_

Additional resources

• Postmortems https://codeascraft.com/2012/05/22/blameless-postmortems/

• Gamedays - https://stripe.com/blog/game-day-exercises-at-stripe

• links at the bottom of this post are also great

• Error Tracking - https://getsentry.com/welcome/

Recommended