Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Preview:

Citation preview

H E R D I N G C AT S T O A

F I R E F I G H TT H E E V O L U T I O N O F A N E N G I N E E R I N G O N - C A L L T E A M

G . C H A N G @ G R E Y S C H A L E

E U R U K O 2 0 1 6

T H E Y E A R 1 B . C . ( B E F O R E C AT S )

In the beginning,

there was only

darkness.

But suddenly,

out of the darkness,

there came a sound...

(pager noises)

One person was on-call. All day. And night. Every day. Every week. Forever.

(not really, but close enough)

Why not start having a rotation?

"We don't need no stinkin' on-call rotation!"

Bullshit.

"Hi, sorry to be calling at this hour. I'm from Yammer, I work with _____. Can I please speak with him?"

Date: Friday, xxth of XXX, 2013Time: 03:00 AM GMT -0800

T H E Y E A R 1 A . D . ( A F T E R D I S A S T E R )

How to maths?!?!

• Given:

• Given:

• Given: (1 + 15 ± 5 ) * 2

• ((4 / 1) * ((1 + 15 ± 5) * 2)) = ???

Answer:

How to acronyms?!?!

M T B FM T T R A A R S L A

• MTBF: Mean Time Between Failures

• MTTR: Mean Time To Recovery

• SLA: Service Level Agreement

• AAR: After Action Review

• IR: Incident Report

• OMGWTFBBQAFK

M T B F M T T R

less frequent faster recovery

requires morestable systems

needs good response training

engineers interrupted less often

engineers gainbroad knowledge

possibly more disastrous issues

possibly more frequent issues

• Google Docs Forms

• Yammer Notes

• JIRA

❎ (hard to read reports)

❎ (hard to analyse)

✅ (not perfect...but sort of works)

Hey, we're starting to get this!. . . . . . . . . . . .

Actually, not yet.

• System grows faster than we can learn about it

• Silos appear when you don't share knowledge

• Who's cleaning up this mess, anyway?

• Burnout is real

T H E R E N A I S S A N C E ( G R O W I N G PA I N S )

Do more by doing less

• Split responsibilities by stack

• Added London office for follow-the-sun coverage

• Onboard everybody to the process

• Practice, practice, practice

All hands on deck

• Keep all alerts in a configuration repo

• Managers aren't doing anything, anyway -- make them Incident Managers!

• Runbooks, runbooks everywhere (and a unified one)

• Make the initial response as simple as possible

B A C K T O T H E F U T U R E ( T H E P R E S E N T )

Combined schedules

• Fewer rotations

• Team is unified, so schedules should be too

Post-mortems and retrospectives

• What? Where? Who? Why? How?

• NO blame game

Weekly hand-overs and monthly reviews

• Previous week engineers to current week engineers

• Track top alerts and resolutions (or lack of)

• Focus on the noisiest services

• Timezones are hard

Bi-monthly surveys

• Summarise overall preparedness

• Make sure we're improving

• ...and that nobody is actually burned out

Fix ALL the alerts

• Noisy

• Flaky

• Real

W H E R E A R E T H E C AT S N O W ? !

The end game

• 1 alert per person per day

• Service owners are on-call for those services

• The world is full of kittens!

Isn't on-call just for Ops?

• No

• Responsibility for our code

• Pride in our code

• No pain, no gain

Isn't on-call just for Ops?

• No

• Responsibility for our code

• Pride in our code

• No pain, no gain

After all...

we are all cats being herded.

T H A N K Y O U

@ G R E Y S C H A L E

G . C H A N G @ G R E Y S C H A L E

E U R U K O 2 0 1 6