Upload
others
View
21
Download
0
Embed Size (px)
Citation preview
ADVANCED PROBLEM DETECTION
(POKROČILÉ NASTAVENÍ TRIGGERŮ)
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
How does Zabbix work
Basic and additional ways of detecting a problem
Possible reaction scenarios
AGENDA
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
ZABBIX - enterprise-level and open-source
monitoring system.
Competitive advantages of the solution:
Free of charge
«All-in-one»
Easy-to-manage
Mature, high-quality and reliable
Flexible (also applicable to problem detection)
WHAT IS ZABBIX?
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
ZABBIX DATA FLOW
Visualization
DATABASEDATABASE
History
ZABBIX SERVERZABBIX SERVER
Analysis
Notifications
Data collection
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
Every N seconds
Zabbix will evenly distribute checks
Different frequency in different time periods
Every X seconds in working time
Every Y second in weekend
At a specific time (Zabbix 3.0)
Ready for business checks
Every hour starting from 9:00 at working hours (9:00, 10:00,…, 18:00)
HOW OFTEN TO EXECUTE CHECKS?
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
RHow to detect problems in the
incoming data flow?
TRIGGER
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
Triggers!
TRIGGER
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
RTrigger –
problem definition
TRIGGER
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
Example
{server:system.cpu.load.last()} > 5
Operators
- + / * < > = <> >= <= not or and
Functions
min max avg last count date time diff regexp and much more!
Analyze everything: any metric and any host
{node1:system.cpu.load.last()} > 5 and {node2:system.cpu.load.last()} > 5 and
{nodes:tps.last()} < 5000
TRIGGER
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
Performance
{server:system.cpu.load.last()} > 5
JUNIOR LEVEL
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
FALSE POSITIVES
«Flapping»
{server:system.cpu.load.last()} > 5
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
Availability
{server:net.tcp.service[http].last()} = 0
JUNIOR LEVEL
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
TOO SENSITIVE
{server:net.tcp.service[http].last()} = 0
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
RToo sensitive leads to
False positives
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
How to avoid false positives?
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
HOW TO AVOID FALSE POSITIVES?
Be careful and define problems wisely!
What does it really meansystem is overloaded
application does not work
service is not available
?
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EXAMPLES
Problem: CPU load > 5No problem: CPU load = 4.99 Resolved?
Problem: free disk space < 10%No problem: free disk space = 10.001% Resolved?
Problem: SSH check failedNo problem: SSH is up Resolved?
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
ANALYZE HISTORY
Performance
{server:system.cpu.load.min(10m)} > 5
Availability
{server:net.tcp.service[http].max(5m)} = 0
{server:net.tcp.service[http].max(#3)} = 0
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
ANALYZE HISTORY
{server:system.cpu.load.min(10m)} > 5
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
ANALYZE HISTORY
{server:net.tcp.service[http].max(#3)} = 0
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
DIFFERENT CONDITIONS FOR PROBLEM AND
RECOVERY
Before
{server:system.cpu.load.last()} > 5
Now
Problem definition: {server:system.cpu.load.last()}>5
Recovery expression:
{server:system.cpu.load.last()}<=1
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
DIFFERENT CONDITIONS FOR DIFFERENT TRIGGER STATES
{server:system.cpu.load.last()} > 5 … {server:system.cpu.load.last()} <= 1
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
No false positives!
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
System is overloaded
Problem definition: {server:system.cpu.load.min(5m)}>3
Recovery expression: {server:system.cpu.load.max(2m)}<=1
No free disk space /
Problem definition: {server:vfs.fs.size[/,pfree].last()}<10
Recovery expression: {server:vfs.fs.size[/,pfree].min(15m)}>30
SSH is not available
Problem definition: {server:net.tcp.service[ssh].max(#3)}=0
Recovery expression: {server:net.tcp.service[ssh].min(#10)}=1
EXAMPLES
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
How to detect?
By comparing with the data from the same period, the period is taken from the
past.
Average CPU load for the last hour is 2x higher than
CPU load for the same period week ago
{server:system.cpu.load.avg(1h)} > 2 * {server:system.cpu.load.avg(1h,7d)}
ANOMALIES
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
ANOMALIES
Comparison with the data 7 days ago
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
FORECAST
Trigger function timeleft
When we reach a certain threshold value
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
FORECAST
Trigger function forecast
When we reach a certain threshold value
4 hours
5.2
%
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
DOES HISTORY ANALYSIS AFFECT PERFORMANCE OF ZABBIX?
Yes, but not significantly.
Especially as of Zabbix 2.2.0.
DATA BASEDATA BASEZABBIX SERVERZABBIX SERVERCACHECACHE
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
DEPENDENCIES
CRM is not workingCRM is not working
DB is inavailableDB is inavailable
No free diskspaceNo free diskspace
Hide dependent problems
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
How to display problems?
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
SECTION „PROBLEMS“
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
TAGS
Tag word: meaning
Customer: Alza
Customer: Globus
Datacenter: NY2
Datacenter: San Francisco
Area: Performance
Area: Availability
Area: Security
Environment: Staging
Environment: Test
User impact: None
User impact: Critical
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
USE OF OBTAINED VALUES
Use of useful information in tags or names
Example:
“RAID Controller Module wide port has gone to failed state. Enclosure 0, Slot 0 (Critical)”
instead
“New SNMP trap received (Critical)”
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
How to react to problems?
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
POSSIBLE REACTIONS
Event correlation
Automatized problem solving
Manual problem closing
Sending notifications to a user or a group of users
Registration of tasks in the Helpdesk system
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION ON TRIGGER LEVEL
Correlation of events at the trigger level allows you to compare individual problems reported by a single trigger.
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION ON TRIGGER LEVEL
10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM
How does it work?
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION ON TRIGGER LEVEL
How does it work?
10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM
10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped” PROBLEM
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION ON TRIGGER LEVEL
How does it work?
10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM
10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped” RESOLVED
10/Aug/2016:06:28:11 service MySQL started
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION ON TRIGGER LEVEL
How does it work?
10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM
10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped”RESOLVED
10/Aug/2016:06:28:11 service MySQL started
10/Aug/2016:06:34:22 service Redis stopped “Service Redis stopped” PROBLEM
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION ON TRIGGER LEVEL
How does it work?
10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” PROBLEM
10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped” RESOLVED
10/Aug/2016:06:28:11 service MySQL started
10/Aug/2016:06:34:22 service Redis stopped “Service Redis stopped” RESOLVED
10/Aug/2016:06:37:58 service Redis started
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION ON TRIGGER LEVEL
How does it work?
10/Aug/2016:06:25:30 service Jira stopped “Service Jira stopped” RESOLVED
10/Aug/2016:06:27:32 service MySQL stopped “Service MySQL stopped” RESOLVED
10/Aug/2016:06:28:11 service MySQL started
10/Aug/2016:06:34:22 service Redis stopped “Service Redis stopped” RESOLVED
10/Aug/2016:06:37:58 service Redis started
10/Aug/2016:06:55:31 service Jira started
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATIONA new problem appears
Existing problems
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION
Existing problems No correlation rules
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION
Existing problems No correlation rules
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
EVENT CORRELATION
Existing problems No correlation rules (close old)
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
ESCALATE!
Immediate reaction
Delayed reaction
Notiofication if automatic action
failed
Repeated notifications
Escalation to a new level
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
ESCALATE!
Critical
problem
Critical
problem Repeated Email
SMS and ticket
Service restart
SMS to manager
0 min
5 min
10 min
15 min
20 min
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
IN SUMMARY
Analyze history
No problem!= Solution
Use different conditions for problem definition and recovery
Pay attention to anomaly detection
Use correlation
Resolve common problems automatically
Do not hesitate to escalate!
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
ZABBIX SERVICES
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
ZABBIX DEMO
Video najdete zde: https://www.coreit.cz/monitoring-it-prostredi
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
ZABBIX WEBINÁŘE
Připravili jsme pro Vás další Zabbix webináře. Všechny informace najdete u
nás webu: https://www.coreit.cz/zabbix-webinare-do-konce-roku-2020/
26.11. Rozšíření funkcí Zabbixu
1.12. Poznejte Zabbix
3.12. Komunikace se Zabbixem pomocí API
8.12. Vizualizace dat v Zabbixu
ZAB
BIX
AD
VA
NC
ED P
RO
BLEM
DETEC
TION
WEB
INA
R
PROFESIONÁLNÍ ŠKOLENÍ
Become Zabbix certified without attending the training. If you are certain of your knowledge, ZCU, ZCS and ZCP exams can be purchased separately.https://www.zabbix.com/training?language=czech#training_schedule
Aktuální termíny
připravovaných
školení
najdete zde
ZA
BB
IX A
DV
AN
CED
PR
OB
LEM
DETEC
TIO
N W
EB
INA
R
Pokud byste potřebovali jakékoliv informace, neváhejte se na nás obrátit:
KONTAKTUJTE NÁS
Telefon: +420 840 771 177
Web: https://www.coreit.cz
Email:[email protected]
LinkedIn:https://www.linkedin.com/company/coreitcz/
https://www.linkedin.com/in/hermanekt/
Twitter:https://twitter.com/CoreITcz
https://twitter.com/hermanekt
Mobil Tomáš Heřmánek: +420 732 447 184
OTÁZKY?
DĚKUJI ZA POZORNOST!