36
HERDING CATS TO A FIREFIGHT THE EVOLUTION OF AN ENGINEERING ON-CALL TEAM G. CHANG @GREYSCHALE EURUKO 2016

Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Embed Size (px)

Citation preview

Page 1: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

H E R D I N G C AT S T O A

F I R E F I G H TT H E E V O L U T I O N O F A N E N G I N E E R I N G O N - C A L L T E A M

G . C H A N G @ G R E Y S C H A L E

E U R U K O 2 0 1 6

Page 2: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

T H E Y E A R 1 B . C . ( B E F O R E C AT S )

Page 3: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

In the beginning,

there was only

darkness.

Page 4: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

But suddenly,

out of the darkness,

there came a sound...

Page 5: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

(pager noises)

Page 6: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

One person was on-call. All day. And night. Every day. Every week. Forever.

Page 7: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

(not really, but close enough)

Page 8: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Why not start having a rotation?

"We don't need no stinkin' on-call rotation!"

Page 9: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Bullshit.

Page 10: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

"Hi, sorry to be calling at this hour. I'm from Yammer, I work with _____. Can I please speak with him?"

Date: Friday, xxth of XXX, 2013Time: 03:00 AM GMT -0800

Page 11: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

T H E Y E A R 1 A . D . ( A F T E R D I S A S T E R )

Page 12: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

How to maths?!?!

• Given:

• Given:

• Given: (1 + 15 ± 5 ) * 2

• ((4 / 1) * ((1 + 15 ± 5) * 2)) = ???

Answer:

Page 13: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight
Page 14: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

How to acronyms?!?!

M T B FM T T R A A R S L A

Page 15: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

• MTBF: Mean Time Between Failures

• MTTR: Mean Time To Recovery

• SLA: Service Level Agreement

• AAR: After Action Review

• IR: Incident Report

• OMGWTFBBQAFK

Page 16: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

M T B F M T T R

less frequent faster recovery

requires morestable systems

needs good response training

engineers interrupted less often

engineers gainbroad knowledge

possibly more disastrous issues

possibly more frequent issues

Page 17: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight
Page 18: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

• Google Docs Forms

• Yammer Notes

• JIRA

❎ (hard to read reports)

❎ (hard to analyse)

✅ (not perfect...but sort of works)

Page 19: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight
Page 20: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Hey, we're starting to get this!. . . . . . . . . . . .

Page 21: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Actually, not yet.

• System grows faster than we can learn about it

• Silos appear when you don't share knowledge

• Who's cleaning up this mess, anyway?

• Burnout is real

Page 22: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

T H E R E N A I S S A N C E ( G R O W I N G PA I N S )

Page 23: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Do more by doing less

• Split responsibilities by stack

• Added London office for follow-the-sun coverage

• Onboard everybody to the process

• Practice, practice, practice

Page 24: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

All hands on deck

• Keep all alerts in a configuration repo

• Managers aren't doing anything, anyway -- make them Incident Managers!

• Runbooks, runbooks everywhere (and a unified one)

• Make the initial response as simple as possible

Page 25: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

B A C K T O T H E F U T U R E ( T H E P R E S E N T )

Page 26: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Combined schedules

• Fewer rotations

• Team is unified, so schedules should be too

Page 27: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Post-mortems and retrospectives

• What? Where? Who? Why? How?

• NO blame game

Page 28: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Weekly hand-overs and monthly reviews

• Previous week engineers to current week engineers

• Track top alerts and resolutions (or lack of)

• Focus on the noisiest services

• Timezones are hard

Page 29: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Bi-monthly surveys

• Summarise overall preparedness

• Make sure we're improving

• ...and that nobody is actually burned out

Page 30: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Fix ALL the alerts

• Noisy

• Flaky

• Real

Page 31: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

W H E R E A R E T H E C AT S N O W ? !

Page 32: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

The end game

• 1 alert per person per day

• Service owners are on-call for those services

• The world is full of kittens!

Page 33: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Isn't on-call just for Ops?

• No

• Responsibility for our code

• Pride in our code

• No pain, no gain

Page 34: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Isn't on-call just for Ops?

• No

• Responsibility for our code

• Pride in our code

• No pain, no gain

Page 35: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

After all...

we are all cats being herded.

Page 36: Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

T H A N K Y O U

@ G R E Y S C H A L E

G . C H A N G @ G R E Y S C H A L E

E U R U K O 2 0 1 6