Velocity NY 2014: Signal through the noise

hi.github: djosephsen

[email protected]

@davejosephsen

github: djosephsen

How We Computer

mailto:[email protected]

[email protected]

@davejosephsen

github: djosephsen

How We Computer


[email protected]

@davejosephsen

github: djosephsen

How We Computer


[email protected]

@davejosephsen

github: djosephsen

Signal Through the Noise


WAT?

WAT?

AAAGHHHHH!!!

ALERTS AREN’T FREE

Business Projects

IT Projects

Changes

Unplanned Work

Unplanned Work

(eeew Comic Sans)

Unplanned Work

Unplanned Work

Alerting

Tax the Ammunition

THE CONTENT OF YOUR ALERTS MATTERS

What did he just say?

•Notifications are expensive, they hurt people and productivity

•Make people work harder to send them by requiring run books

•Run books add context to alerts. Other types of context are awesome too

•Like graphs

WHY do we Monitor?

Telemetry Data

Command Signal

1. Identify Operational LimitationsY<160bpm

X<7m km/h

2. Monitor those limitations1. Identify Operational Limitations

Y<160bpmX<7m km/h

A Balancer ?!

Balancer

>66% Host Availability

Balancer

>66% Host Availability

% IO per instance

%hosts alive

% IO per instanceVS

(Hint: one of these things measures balancing)

%hosts alive

% IO per instance

Does not measure balancing Measures balancing

66 .2VSX

IT Monitoring != Feedback

IT Monitoring != Feedback

some silly balancer!=

WE CAN REDUCE ALERTS BY IMPROVING OUR TELEMETRY

SIGNAL

What did he just say?•Monitoring isn't a thing. It’s just part of the engineering process

•We’re treating it like a thing that only some types of engineers might want to do, and that’s giving us broken feedback

•Aerospace engineers are rad, they don’t do that.

•Fix your monitoring and your alerts will follow

Own YOUR problem

Own YOUR problem

Some Graph in the War Room



WHAT YOU MONITOR MATTERS

a } < x

C

} < x

b

kxa

xk

xk

xk

EVERYBODY OWNS MONITORING

What did he just say?• Choose metrics that tell you about the things you care about.

•Alert when the things you care about hit limits you understand

•All alerts < critical go to chatrooms, ticket systems or dashboards

•Critical alers use an automated escalation service that enforces on call policy

•Escalated alerts require acknowledgement

•Escalated alerts require run book url’s and/or links to graphs of the metric

ALERT ON WHAT YOU SEE

EVERYONE OWNS ALERTS(and dashboards)

The Ultimate Recap• Enforce a notification policy that requires context

• Make monitoring an engineering process

• Use the same signal for all metrics introspection and notification

• Encourage everyone to rely on telemetry data (graphs or it didn’t happen!)

• Everyone who collects a metric, gets keys to dashboard and alert design

Questions?Office Hours: 1:15pm

Technology

Velocity NY 2014: Signal through the noise