107
hi.

Velocity NY 2014: Signal through the noise

Embed Size (px)

DESCRIPTION

In recent years it’s become evident that alerting is one of the biggest challenges facing modern Operations Engineers. Conference talks, hallways tracks, meetups, etc are rife with discussions about poor signal/noise in alerts, fatigue from false positives, and general lack of actionability. Our talk (informed by real-world experience designing, building and maintaining our distributed, multi-tenant metrics/alerting service) takes a fundamental approach and examines alerting requirements and practices in the abstract. We put forth a comprehensive abstract model with best practices that should be followed and implemented by your team regardless of your tool of choice. This talk is equal parts cultural and technical, encompassing both computational capabilities as well as social practices, like: Defining organizational policy about where and when to set alerts. Ensuring the on-call engineer is armed with the proper information to take action Best practices for configuring an alert Fire-fighting after an alert has triggered Performing analysis across your organization wide history of alerts

Citation preview

Page 1: Velocity NY 2014: Signal through the noise

hi.github: djosephsen

Page 2: Velocity NY 2014: Signal through the noise

[email protected]

@davejosephsen

github: djosephsen

How We Computer

Page 3: Velocity NY 2014: Signal through the noise

[email protected]

@davejosephsen

github: djosephsen

How We Computer

Page 4: Velocity NY 2014: Signal through the noise

[email protected]

@davejosephsen

github: djosephsen

How We Computer

Page 5: Velocity NY 2014: Signal through the noise

[email protected]

@davejosephsen

github: djosephsen

Signal Through the Noise

Page 6: Velocity NY 2014: Signal through the noise
Page 7: Velocity NY 2014: Signal through the noise
Page 8: Velocity NY 2014: Signal through the noise
Page 9: Velocity NY 2014: Signal through the noise
Page 10: Velocity NY 2014: Signal through the noise
Page 11: Velocity NY 2014: Signal through the noise
Page 12: Velocity NY 2014: Signal through the noise

WAT?

Page 13: Velocity NY 2014: Signal through the noise

WAT?

Page 14: Velocity NY 2014: Signal through the noise

AAAGHHHHH!!!

Page 15: Velocity NY 2014: Signal through the noise

ALERTS AREN’T FREE

Page 16: Velocity NY 2014: Signal through the noise

Business Projects

IT Projects

Changes

Unplanned Work

Page 17: Velocity NY 2014: Signal through the noise

Unplanned Work

(eeew Comic Sans)

Page 18: Velocity NY 2014: Signal through the noise

Unplanned Work

Page 19: Velocity NY 2014: Signal through the noise

Unplanned Work

Page 20: Velocity NY 2014: Signal through the noise

Alerting

Page 21: Velocity NY 2014: Signal through the noise

Tax the Ammunition

Page 22: Velocity NY 2014: Signal through the noise
Page 23: Velocity NY 2014: Signal through the noise
Page 24: Velocity NY 2014: Signal through the noise
Page 25: Velocity NY 2014: Signal through the noise

THE CONTENT OF YOUR ALERTS MATTERS

Page 26: Velocity NY 2014: Signal through the noise

What did he just say?

•Notifications are expensive, they hurt people and productivity

•Make people work harder to send them by requiring run books

•Run books add context to alerts. Other types of context are awesome too

•Like graphs

Page 27: Velocity NY 2014: Signal through the noise

WHY do we Monitor?

Page 28: Velocity NY 2014: Signal through the noise
Page 29: Velocity NY 2014: Signal through the noise
Page 30: Velocity NY 2014: Signal through the noise
Page 31: Velocity NY 2014: Signal through the noise
Page 32: Velocity NY 2014: Signal through the noise

Telemetry Data

Command Signal

Page 33: Velocity NY 2014: Signal through the noise
Page 34: Velocity NY 2014: Signal through the noise

1. Identify Operational LimitationsY<160bpm

X<7m km/h

Page 35: Velocity NY 2014: Signal through the noise

2. Monitor those limitations1. Identify Operational Limitations

Y<160bpmX<7m km/h

Page 36: Velocity NY 2014: Signal through the noise
Page 37: Velocity NY 2014: Signal through the noise

A Balancer ?!

Page 38: Velocity NY 2014: Signal through the noise

Balancer

>66% Host Availability

Page 39: Velocity NY 2014: Signal through the noise

Balancer

>66% Host Availability

Page 40: Velocity NY 2014: Signal through the noise

% IO per instance

Page 41: Velocity NY 2014: Signal through the noise

%hosts alive

% IO per instanceVS

(Hint: one of these things measures balancing)

Page 42: Velocity NY 2014: Signal through the noise

%hosts alive

% IO per instance

Does not measure balancing Measures balancing

66 .2VSX

Page 43: Velocity NY 2014: Signal through the noise
Page 44: Velocity NY 2014: Signal through the noise
Page 45: Velocity NY 2014: Signal through the noise
Page 46: Velocity NY 2014: Signal through the noise
Page 47: Velocity NY 2014: Signal through the noise
Page 48: Velocity NY 2014: Signal through the noise

IT Monitoring != Feedback

Page 49: Velocity NY 2014: Signal through the noise

IT Monitoring != Feedback

Page 50: Velocity NY 2014: Signal through the noise

some silly balancer!=

Page 51: Velocity NY 2014: Signal through the noise

WE CAN REDUCE ALERTS BY IMPROVING OUR TELEMETRY

SIGNAL

Page 52: Velocity NY 2014: Signal through the noise

What did he just say?•Monitoring isn't a thing. It’s just part of the engineering process

•We’re treating it like a thing that only some types of engineers might want to do, and that’s giving us broken feedback

•Aerospace engineers are rad, they don’t do that.

•Fix your monitoring and your alerts will follow

Page 53: Velocity NY 2014: Signal through the noise
Page 54: Velocity NY 2014: Signal through the noise
Page 55: Velocity NY 2014: Signal through the noise
Page 56: Velocity NY 2014: Signal through the noise
Page 57: Velocity NY 2014: Signal through the noise
Page 58: Velocity NY 2014: Signal through the noise
Page 59: Velocity NY 2014: Signal through the noise
Page 60: Velocity NY 2014: Signal through the noise
Page 61: Velocity NY 2014: Signal through the noise
Page 62: Velocity NY 2014: Signal through the noise
Page 63: Velocity NY 2014: Signal through the noise
Page 64: Velocity NY 2014: Signal through the noise
Page 65: Velocity NY 2014: Signal through the noise

Own YOUR problem

Page 66: Velocity NY 2014: Signal through the noise

Own YOUR problem

Page 67: Velocity NY 2014: Signal through the noise

Some Graph in the War Room

Page 68: Velocity NY 2014: Signal through the noise
Page 69: Velocity NY 2014: Signal through the noise
Page 70: Velocity NY 2014: Signal through the noise
Page 71: Velocity NY 2014: Signal through the noise
Page 72: Velocity NY 2014: Signal through the noise
Page 73: Velocity NY 2014: Signal through the noise

Some Graph in the War Room

Page 74: Velocity NY 2014: Signal through the noise

Some Graph in the War Room

Page 75: Velocity NY 2014: Signal through the noise

WHAT YOU MONITOR MATTERS

Page 76: Velocity NY 2014: Signal through the noise
Page 77: Velocity NY 2014: Signal through the noise
Page 78: Velocity NY 2014: Signal through the noise
Page 79: Velocity NY 2014: Signal through the noise

a } < x

C

Page 80: Velocity NY 2014: Signal through the noise

} < x

b

kxa

Page 81: Velocity NY 2014: Signal through the noise

xk

xk

xk

Page 82: Velocity NY 2014: Signal through the noise

EVERYBODY OWNS MONITORING

Page 83: Velocity NY 2014: Signal through the noise
Page 84: Velocity NY 2014: Signal through the noise
Page 85: Velocity NY 2014: Signal through the noise
Page 86: Velocity NY 2014: Signal through the noise
Page 87: Velocity NY 2014: Signal through the noise
Page 88: Velocity NY 2014: Signal through the noise
Page 89: Velocity NY 2014: Signal through the noise
Page 90: Velocity NY 2014: Signal through the noise
Page 91: Velocity NY 2014: Signal through the noise
Page 92: Velocity NY 2014: Signal through the noise
Page 93: Velocity NY 2014: Signal through the noise
Page 94: Velocity NY 2014: Signal through the noise
Page 95: Velocity NY 2014: Signal through the noise
Page 96: Velocity NY 2014: Signal through the noise

What did he just say?• Choose metrics that tell you about the things you care about.

•Alert when the things you care about hit limits you understand

•All alerts < critical go to chatrooms, ticket systems or dashboards

•Critical alers use an automated escalation service that enforces on call policy

•Escalated alerts require acknowledgement

•Escalated alerts require run book url’s and/or links to graphs of the metric

Page 97: Velocity NY 2014: Signal through the noise

ALERT ON WHAT YOU SEE

Page 98: Velocity NY 2014: Signal through the noise
Page 99: Velocity NY 2014: Signal through the noise
Page 100: Velocity NY 2014: Signal through the noise
Page 101: Velocity NY 2014: Signal through the noise

EVERYONE OWNS ALERTS(and dashboards)

Page 102: Velocity NY 2014: Signal through the noise
Page 103: Velocity NY 2014: Signal through the noise
Page 104: Velocity NY 2014: Signal through the noise
Page 105: Velocity NY 2014: Signal through the noise
Page 106: Velocity NY 2014: Signal through the noise

The Ultimate Recap• Enforce a notification policy that requires context

• Make monitoring an engineering process

• Use the same signal for all metrics introspection and notification

• Encourage everyone to rely on telemetry data (graphs or it didn’t happen!)

• Everyone who collects a metric, gets keys to dashboard and alert design

Page 107: Velocity NY 2014: Signal through the noise

Questions?Office Hours: 1:15pm