Monitoring by Zabbix: The Final Frontier

Preview:

Citation preview

Monitoring by Zabbix: the Final Frontier

Detect problems way before end users

AgendaProgramming languages we use to build our software

Standard approach to monitoring

How Zabbix does it?

Who am I?Alexei Vladishev

Creator of Zabbix

CEO and Architect

@avladishev

Riga | Tokyo | New York

Runtime issues

Memory leaks

Uninitialised pointers

Require discipline!

Runtime issues

Memory leaks

Uninitialised pointers

Require discipline!

Runtime issues

Out of memory

GC affects execution

Runtime issues

Memory leaks

Uninitialised pointers

Require discipline!

Runtime issues

Out of memory

GC affects execution

Runtime issues

Out of memory

Slow execution

Hard to predict resource usage

No guarantees: performance, resource usage, availability, etc.

Confluence KB: How to fix out of memory errors by increasing available memory?

We aren't really able to give a concrete recommendation for the amount of memory to allocate, because that will depend greatly on your server setup, the size of your user base, and their behaviour. You will need to find a value that works for you, ie no noticeable GC pauses, and no OutOfMemory errors.

Solution: Increase Xmx in small increments (eg 512mb at a time), until you no longer experience the OutOfMemory error.

Too many bad things may happen at runtime

That’s why we need monitoring!

Monitoring is about describing abnormal behaviour of our

systems

How to detect it?

Typical approach

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50

CPU load > 5

Typical approach

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50

CPU load > 5

Problem Problem Problem

Recovery Recovery

Too sensitive Flapping

Zabbix does it smart way

History

Analysis

Data collection

Zabbix server

History

Analysis

Data collection

Alerts

Zabbix server

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10

Analyse historyCPU load for the last 10 minutes > 5

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10

Analyse historyProblem!

CPU load for the last 10 minutes > 5

Recovery

Problem disappeared !=

problem is resolved

Problem: free disk space <= 10%

Now free disk space is 10.001%

Have we resolved our problem?

Problem: free disk space <= 10%

Now free disk space is 10.001%

Problem resolved?

Different conditions

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50

Problem: CPU load > 5 Recovery: CPU load < 1

Different conditions

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50

Problem: CPU load > 5 Recovery: CPU load < 1

Problem!

Recovery

No flapping!

Smarter approachProblem if Free disk space < 10%

Recovery if Free disk space > 30% for the last 15 minutes

Problem if 3 consecutive checks of REST service failed

Recovery if 10 consecutive checks of REST service are OK

Anomaly detection

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10

Compare current system state with the past

Anomaly!

Forecasting

0

12,5

25

37,5

50

7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00

Forecasting

0

12,5

25

37,5

50

7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00

y = -2,9455x + 48,309

When and value after period of time

Problem in the future

ConclusionMonitoring by is your best friend

Use smart problem detection, do not spam DevOps

Detect problems way before end users notice

Anomalies

Forecasting

Thank you!Learn more about Zabbix at our booth!

@avladishev

Email: alex@zabbix.com

Recommended