ADVANCED PROBLEM DETECTION (POKROČILÉ NASTAVENÍ … · Zabbix Certified Specialist Update Zabbix...

Preview:

Citation preview

ADVANCED PROBLEM DETECTION

(POKROČILÉ NASTAVENÍ TRIGGERŮ)

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

How does Zabbix work

Basic and additional ways of detecting a problem

Possible reaction scenarios

AGENDA

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

ZABBIX - enterprise-level and open-source

monitoring system.

Competitive advantages of the solution:

Free of charge

«All-in-one»

Easy-to-manage

Mature, high-quality and reliable

Flexible (also applicable to problem detection)

WHAT IS ZABBIX?

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

ZABBIX DATA FLOW

Visualization

DATABASEDATABASE

History

ZABBIX SERVERZABBIX SERVER

Analysis

Notifications

Data collection

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

Every N seconds

Zabbix will evenly distribute checks

Different frequency in different time periods

Every X seconds in working time

Every Y second in weekend

At a specific time (Zabbix 3.0)

Ready for business checks

Every hour starting from 9:00 at working hours (9:00, 10:00,…, 18:00)

HOW OFTEN TO EXECUTE CHECKS?

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

RHow to detect problems in the

incoming data flow?

TRIGGER

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

Triggers!

TRIGGER

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

RTrigger –

problem definition

TRIGGER

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

Example

{server:system.cpu.load.last()} > 5

Operators

- + / * < > = <> >= <= not or and

Functions

min max avg last count date time diff regexp and much more!

Analyze everything: any metric and any host

{node1:system.cpu.load.last()} > 5 and {node2:system.cpu.load.last()} > 5 and

{nodes:tps.last()} < 5000

TRIGGER

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

Performance

{server:system.cpu.load.last()} > 5

JUNIOR LEVEL

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

FALSE POSITIVES

«Flapping»

{server:system.cpu.load.last()} > 5

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

Availability

{server:net.tcp.service[http].last()} = 0

JUNIOR LEVEL

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

TOO SENSITIVE

{server:net.tcp.service[http].last()} = 0

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

RToo sensitive leads to

False positives

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

How to avoid false positives?

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

HOW TO AVOID FALSE POSITIVES?

Be careful and define problems wisely!

What does it really meansystem is overloaded

application does not work

service is not available

?

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EXAMPLES

Problem: CPU load > 5No problem: CPU load = 4.99 Resolved?

Problem: free disk space < 10%No problem: free disk space = 10.001% Resolved?

Problem: SSH check failedNo problem: SSH is up Resolved?

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

ANALYZE HISTORY

Performance

{server:system.cpu.load.min(10m)} > 5

Availability

{server:net.tcp.service[http].max(5m)} = 0

{server:net.tcp.service[http].max(#3)} = 0

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

ANALYZE HISTORY

{server:system.cpu.load.min(10m)} > 5

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

ANALYZE HISTORY

{server:net.tcp.service[http].max(#3)} = 0

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

DIFFERENT CONDITIONS FOR PROBLEM AND

RECOVERY

Before

{server:system.cpu.load.last()} > 5

Now

Problem definition: {server:system.cpu.load.last()}>5

Recovery expression:

{server:system.cpu.load.last()}<=1

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

DIFFERENT CONDITIONS FOR DIFFERENT TRIGGER STATES

{server:system.cpu.load.last()} > 5 … {server:system.cpu.load.last()} <= 1

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

No false positives!

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

System is overloaded

Problem definition: {server:system.cpu.load.min(5m)}>3

Recovery expression: {server:system.cpu.load.max(2m)}<=1

No free disk space /

Problem definition: {server:vfs.fs.size[/,pfree].last()}<10

Recovery expression: {server:vfs.fs.size[/,pfree].min(15m)}>30

SSH is not available

Problem definition: {server:net.tcp.service[ssh].max(#3)}=0

Recovery expression: {server:net.tcp.service[ssh].min(#10)}=1

EXAMPLES

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

How to detect?

By comparing with the data from the same period, the period is taken from the

past.

Average CPU load for the last hour is 2x higher than

CPU load for the same period week ago

{server:system.cpu.load.avg(1h)} > 2 * {server:system.cpu.load.avg(1h,7d)}

ANOMALIES

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

ANOMALIES

Comparison with the data 7 days ago

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

FORECAST

Trigger function timeleft

When we reach a certain threshold value

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

FORECAST

Trigger function forecast

When we reach a certain threshold value

4 hours

5.2

%

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

DOES HISTORY ANALYSIS AFFECT PERFORMANCE OF ZABBIX?

Yes, but not significantly.

Especially as of Zabbix 2.2.0.

DATA BASEDATA BASEZABBIX SERVERZABBIX SERVERCACHECACHE

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

DEPENDENCIES

CRM is not workingCRM is not working

DB is inavailableDB is inavailable

No free diskspaceNo free diskspace

Hide dependent problems

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

How to display problems?

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

SECTION „PROBLEMS“

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

TAGS

Tag word: meaning

Customer: Alza

Customer: Globus

Datacenter: NY2

Datacenter: San Francisco

Area: Performance

Area: Availability

Area: Security

Environment: Staging

Environment: Test

User impact: None

User impact: Critical

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

USE OF OBTAINED VALUES

Use of useful information in tags or names

Example:

“RAID Controller Module wide port has gone to failed state. Enclosure 0, Slot 0 (Critical)”

instead

“New SNMP trap received (Critical)”

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

How to react to problems?

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

POSSIBLE REACTIONS

Event correlation

Automatized problem solving

Manual problem closing

Sending notifications to a user or a group of users

Registration of tasks in the Helpdesk system

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION ON TRIGGER LEVEL

Correlation of events at the trigger level allows you to compare individual problems reported by a single trigger.

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION ON TRIGGER LEVEL

10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM

How does it work?

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION ON TRIGGER LEVEL

How does it work?

10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM

10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped” PROBLEM

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION ON TRIGGER LEVEL

How does it work?

10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM

10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped” RESOLVED

10/Aug/2016:06:28:11 service MySQL started

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION ON TRIGGER LEVEL

How does it work?

10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM

10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped”RESOLVED

10/Aug/2016:06:28:11 service MySQL started

10/Aug/2016:06:34:22 service Redis stopped “Service Redis stopped” PROBLEM

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION ON TRIGGER LEVEL

How does it work?

10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM

10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped” RESOLVED

10/Aug/2016:06:28:11 service MySQL started

10/Aug/2016:06:34:22 service Redis stopped “Service Redis stopped” RESOLVED

10/Aug/2016:06:37:58 service Redis started

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION ON TRIGGER LEVEL

How does it work?

10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” RESOLVED

10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped” RESOLVED

10/Aug/2016:06:28:11 service MySQL started

10/Aug/2016:06:34:22 service Redis stopped “Service Redis stopped” RESOLVED

10/Aug/2016:06:37:58 service Redis started

10/Aug/2016:06:55:31 service Jira started

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATIONA new problem appears

Existing problems

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION

Existing problems No correlation rules

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION

Existing problems No correlation rules

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

EVENT CORRELATION

Existing problems No correlation rules (close old)

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

ESCALATE!

Immediate reaction

Delayed reaction

Notiofication if automatic action

failed

Repeated notifications

Escalation to a new level

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

ESCALATE!

Critical

problem

Critical

problem Repeated Email

SMS and ticket

Service restart

SMS to manager

0 min

5 min

10 min

15 min

20 min

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

IN SUMMARY

Analyze history

No problem!= Solution

Use different conditions for problem definition and recovery

Pay attention to anomaly detection

Use correlation

Resolve common problems automatically

Do not hesitate to escalate!

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

ZABBIX SERVICES

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

ZABBIX DEMO

Video najdete zde: https://www.coreit.cz/monitoring-it-prostredi

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

ZABBIX WEBINÁŘE

Připravili jsme pro Vás další Zabbix webináře. Všechny informace najdete u

nás webu: https://www.coreit.cz/zabbix-webinare-do-konce-roku-2020/

26.11. Rozšíření funkcí Zabbixu

1.12. Poznejte Zabbix

3.12. Komunikace se Zabbixem pomocí API

8.12. Vizualizace dat v Zabbixu

ZAB

BIX

AD

VA

NC

ED P

RO

BLEM

DETEC

TION

WEB

INA

R

PROFESIONÁLNÍ ŠKOLENÍ

Become Zabbix certified without attending the training. If you are certain of your knowledge, ZCU, ZCS and ZCP exams can be purchased separately.https://www.zabbix.com/training?language=czech#training_schedule

Aktuální termíny

připravovaných

školení

najdete zde

ZA

BB

IX A

DV

AN

CED

PR

OB

LEM

DETEC

TIO

N W

EB

INA

R

Pokud byste potřebovali jakékoliv informace, neváhejte se na nás obrátit:

KONTAKTUJTE NÁS

Telefon: +420 840 771 177

Web: https://www.coreit.cz

Email:info@coreit.cz

tomas.hermanek@coreit.cz

LinkedIn:https://www.linkedin.com/company/coreitcz/

https://www.linkedin.com/in/hermanekt/

Twitter:https://twitter.com/CoreITcz

https://twitter.com/hermanekt

Mobil Tomáš Heřmánek: +420 732 447 184

OTÁZKY?

DĚKUJI ZA POZORNOST!

Recommended