10
Alert Fatigue - and what to do about it Elik Eizenberg, VP R&D http://www.bigpanda.io

Five Causes of Alert Fatigue -- and how to prevent them

Embed Size (px)

Citation preview

Page 1: Five Causes of Alert Fatigue -- and how to prevent them

Alert Fatigue -

and what to do about it

Elik Eizenberg, VP R&D

http://www.bigpanda.io

Page 2: Five Causes of Alert Fatigue -- and how to prevent them

alert fatigue

noun

A constant flood of noisy, non-actionable alerts, generated

by your monitoring stack.

Synonyms: alert overload, alert spam

2

Page 3: Five Causes of Alert Fatigue -- and how to prevent them

3

Poor Signal-to-Noise Ratio

Delayed Response

Wrong Prioritization

Constant Context Switching

Page 4: Five Causes of Alert Fatigue -- and how to prevent them

4

Common Pitfalls

Page 5: Five Causes of Alert Fatigue -- and how to prevent them

What you see: 20 critical Nagios / Zabbix alerts, all at once

What happened:

- Unexpected traffic to your app

- You get an alert from practically every host in the cluster

In an ideal world:

- 1 alert, indicating 80% of the cluster has problems

- Don’t wake me up unless at least some % of the cluster is down

5

Alert Per Host

Page 6: Five Causes of Alert Fatigue -- and how to prevent them

What you see: Low disk space alert on a MongoDB host

What happened:

- DB disk is slowly filling up as expected

- Will become urgent in a few weeks

In an ideal world:

- No need for an alert at all!

- Automatically issue a Jira ticket and assign it to me

6

Important != Urgent

Page 7: Five Causes of Alert Fatigue -- and how to prevent them

What you see: The same high-load alerts, every Monday after lunch

What happened:

- Monday is busy by definition

- You can’t use the same thresholds every day

In an ideal world:

- Dynamically update your thresholds

- Or focus only on anomalies (e.g. etsy/skyline)

7

Non-Adaptive Thresholds

Page 8: Five Causes of Alert Fatigue -- and how to prevent them

What you see: Incoming alerts from Nagios, Pingdom, NewRelic, Keynote

& Splunk…

What happened:

- Data corruption in a couple of Mongo nodes

- Resulting in heavy disk IO and some transaction errors

- This kind of error manifests itself in server, application & user level

In an ideal world:

- Auto correlate highly-related alerts from different systems

- Show me one high-level incident, instead of low-level alerts

8

Same Issue, Different System

Page 9: Five Causes of Alert Fatigue -- and how to prevent them

What you see: Issue pops us for a couple of minutes, then disappears.

What happened:

- Maybe a cronjob over utilizes the netwrok

- Or a random race-condition in the app

- Or a rarely-used product feature that causes the backend to crash

In an ideal world:

- No need for an alert every time it happens

- Give me a monthly report of common shot-lived alerts

9

Transient Alerts

Page 10: Five Causes of Alert Fatigue -- and how to prevent them

10

Give us a try - http://www.bigpanda.iohttp://twitter.com/bigpanda

Thanks for listening!