Upload
bigpanda-inc
View
80
Download
3
Embed Size (px)
Citation preview
Alert Fatigue -
and what to do about it
Elik Eizenberg, VP R&D
http://www.bigpanda.io
alert fatigue
noun
A constant flood of noisy, non-actionable alerts, generated
by your monitoring stack.
Synonyms: alert overload, alert spam
2
3
Poor Signal-to-Noise Ratio
Delayed Response
Wrong Prioritization
Constant Context Switching
4
Common Pitfalls
What you see: 20 critical Nagios / Zabbix alerts, all at once
What happened:
- Unexpected traffic to your app
- You get an alert from practically every host in the cluster
In an ideal world:
- 1 alert, indicating 80% of the cluster has problems
- Don’t wake me up unless at least some % of the cluster is down
5
Alert Per Host
What you see: Low disk space alert on a MongoDB host
What happened:
- DB disk is slowly filling up as expected
- Will become urgent in a few weeks
In an ideal world:
- No need for an alert at all!
- Automatically issue a Jira ticket and assign it to me
6
Important != Urgent
What you see: The same high-load alerts, every Monday after lunch
What happened:
- Monday is busy by definition
- You can’t use the same thresholds every day
In an ideal world:
- Dynamically update your thresholds
- Or focus only on anomalies (e.g. etsy/skyline)
7
Non-Adaptive Thresholds
What you see: Incoming alerts from Nagios, Pingdom, NewRelic, Keynote
& Splunk…
What happened:
- Data corruption in a couple of Mongo nodes
- Resulting in heavy disk IO and some transaction errors
- This kind of error manifests itself in server, application & user level
In an ideal world:
- Auto correlate highly-related alerts from different systems
- Show me one high-level incident, instead of low-level alerts
8
Same Issue, Different System
What you see: Issue pops us for a couple of minutes, then disappears.
What happened:
- Maybe a cronjob over utilizes the netwrok
- Or a random race-condition in the app
- Or a rarely-used product feature that causes the backend to crash
In an ideal world:
- No need for an alert every time it happens
- Give me a monthly report of common shot-lived alerts
9
Transient Alerts
10
Give us a try - http://www.bigpanda.iohttp://twitter.com/bigpanda
Thanks for listening!