Upload
david-josephsen
View
130
Download
1
Tags:
Embed Size (px)
DESCRIPTION
In recent years it’s become evident that alerting is one of the biggest challenges facing modern Operations Engineers. Conference talks, hallways tracks, meetups, etc are rife with discussions about poor signal/noise in alerts, fatigue from false positives, and general lack of actionability. Our talk (informed by real-world experience designing, building and maintaining our distributed, multi-tenant metrics/alerting service) takes a fundamental approach and examines alerting requirements and practices in the abstract. We put forth a comprehensive abstract model with best practices that should be followed and implemented by your team regardless of your tool of choice. This talk is equal parts cultural and technical, encompassing both computational capabilities as well as social practices, like: Defining organizational policy about where and when to set alerts. Ensuring the on-call engineer is armed with the proper information to take action Best practices for configuring an alert Fire-fighting after an alert has triggered Performing analysis across your organization wide history of alerts
Citation preview
hi.github: djosephsen
@davejosephsen
github: djosephsen
Signal Through the Noise
WAT?
WAT?
AAAGHHHHH!!!
ALERTS AREN’T FREE
Business Projects
IT Projects
Changes
Unplanned Work
Unplanned Work
(eeew Comic Sans)
Unplanned Work
Unplanned Work
Alerting
Tax the Ammunition
THE CONTENT OF YOUR ALERTS MATTERS
What did he just say?
•Notifications are expensive, they hurt people and productivity
•Make people work harder to send them by requiring run books
•Run books add context to alerts. Other types of context are awesome too
•Like graphs
WHY do we Monitor?
Telemetry Data
Command Signal
1. Identify Operational LimitationsY<160bpm
X<7m km/h
2. Monitor those limitations1. Identify Operational Limitations
Y<160bpmX<7m km/h
A Balancer ?!
Balancer
>66% Host Availability
Balancer
>66% Host Availability
% IO per instance
%hosts alive
% IO per instanceVS
(Hint: one of these things measures balancing)
%hosts alive
% IO per instance
Does not measure balancing Measures balancing
66 .2VSX
IT Monitoring != Feedback
IT Monitoring != Feedback
some silly balancer!=
WE CAN REDUCE ALERTS BY IMPROVING OUR TELEMETRY
SIGNAL
What did he just say?•Monitoring isn't a thing. It’s just part of the engineering process
•We’re treating it like a thing that only some types of engineers might want to do, and that’s giving us broken feedback
•Aerospace engineers are rad, they don’t do that.
•Fix your monitoring and your alerts will follow
Own YOUR problem
Own YOUR problem
Some Graph in the War Room
Some Graph in the War Room
Some Graph in the War Room
WHAT YOU MONITOR MATTERS
a } < x
C
} < x
b
kxa
xk
xk
xk
EVERYBODY OWNS MONITORING
What did he just say?• Choose metrics that tell you about the things you care about.
•Alert when the things you care about hit limits you understand
•All alerts < critical go to chatrooms, ticket systems or dashboards
•Critical alers use an automated escalation service that enforces on call policy
•Escalated alerts require acknowledgement
•Escalated alerts require run book url’s and/or links to graphs of the metric
ALERT ON WHAT YOU SEE
EVERYONE OWNS ALERTS(and dashboards)
The Ultimate Recap• Enforce a notification policy that requires context
• Make monitoring an engineering process
• Use the same signal for all metrics introspection and notification
• Encourage everyone to rely on telemetry data (graphs or it didn’t happen!)
• Everyone who collects a metric, gets keys to dashboard and alert design
Questions?Office Hours: 1:15pm