If you can't read please download the document
Upload
nagios
View
2.240
Download
2
Embed Size (px)
Citation preview
PowerPoint Presentation
Failover and High Availability Solutions for Nagios XI
Andy Brist
Introduction
Who am I?Nagios Support Team Manager
Team Lead for Nagios-Plugins(github.com/nagios-plugins
Disclaimer
Every environment is different
Failover/HA by nature, is a customized solution
My case studies are not your production environments
I know Nagios/XI, not your SLA
Test in a lab. First.
Agenda
Short overview of the different failback/failover solutions
Nagios XI Data Locations and other files/services relevant to failover scenarios.
Snapback
Failback
Failover
HA? Failover
Observations, Considerations
Backup (snapback)
Restore VM snapshot or spin up a new instance and restore a backup
Most common implementation
Easiest of all options
Most potential downtime of scenarios
Maximum historical and configuration data lost = the interval between snapshots
Requires manual intervention
Automated XI Backups
XI provides a method for scheduled backups through the "Scheduled Backups Component"ssh
ftp
local fs
Useful for remote backups or manual failback
Failback
Failback
Secondary is periodically updated from an XI backup.
The nagios process is started by hand when the master has an issue.
Cronjob on the secondary restores newest backup once a day.
If unconcerned with historical data and mrtg performance data, just push/restore the object configs and sql dumps (if not offloaded)
Not to be confused with snapback as this is a separate, different instance/image, not just a previous state of the failed instance.
Additional Considerations
Easy to implement with the Scheduled Backups XI component.
Agents must maintain 2+ allowed hosts
SNMP traps must be configured to push to 2+ hosts
May experience substantial downtime if the backup is large and the primary fails during a data restore on the secondary.
Failover
Difficult to get right
Demanding on i/o resources and network speed
Very little to no loss of historical data
Minimal downtime
Fully automated
Can provide minimal clustering for XI services through High Availability
Failover
Nagios XI
Object Configuration
Check Status
Object State
Program State
Historical State Data
Performance Data
Nagios XI - Services
nagios Monitoring enginemysql Object configuration and ndo historical datando2db Writes historical data to mysql databasepostgresql Nagios XI settings/user database npcd Performance data daemoncrond Task schedulerhttpd Web server
XI Data and Redundancy
Absolute minimum redundant data required for any failover scenario:(Working) Object configuration
Mysql 'nagiosql' database
Postgresql 'nagiosxi' database
Full Check Redundancy
Additional requirements for full check redundancy:mrtg config and RRDs (for bandwidth checks)
nagios libexec folder (plugins)
Any additional dependencies for plugins. For example:VMWare SDK
Oracle Perl Library
Java JRE
Runtime State Redundancy
Additional requirements for runtime state redundancy:retention.dat (state, runtime options, acknowledgments, notification depth)
NDO mysql database "nagios
Historical Redundancy
Additional Data required for complete historical redundancy:nagios.log and archives directory
perfdata RRDs
mrtg config and RRDs
NDO mysql database "nagios"
XI Data Summary
Logs/archivesPerfdataMrtg/configsDatabasesObject configsPlugins
XI Data Summary
/usr/local/nagios/var/nagios.log/usr/local/nagios/var/archives//usr/local/nagios/share/perfdata//var/lib/mrtg//etc/mrtg//var/lib/pgsql//var/lib/mysql//usr/local/nagios/etc//usr/local/nagios/libexec//usr/local/nagiosxi/
High Availability?
1. Elimination of single points of failure.2. Reliable crossover/failover.3. Detection of failures as they occur.
High Availability?
Why would you need it?Least amount of downtime
(limited) Service clustering
Shared volumes solve the issues with syncing historical data in redundant configurations
High Availability/Failover
Major components:Shared storage
Virtual IP
Management applications/scripts
Shared Storage
DRBD block level replication, part of the linux kernel, well supported and understood. Works well for all XI data types (including RRDs/DBs)
NFS Fine option, just make sure the NFS share does not have an i/o latency issue or your checks WILL get behind. Do not mount the volume on more than one server at time to avoid writing multiple checks in the case of a partial failover.
Replicated DBs Fine solution, clusters well. Use DNS or virtual ips to control access to the databases.
rsync Not immediate replication, but close. Easy to implement.
GlusterFS More problematic to set up, but good for offloaded mrtg/RRDs
DRBD
Active/passive suggested
Low latency storage
Active mount should move with the vip
Refer to Jeremy Rust's presentation notes for more information
Virtual IP
pacemaker vip script
Custom ifconfig/ip shell scripts
uCarp Scripts
keepalived
HA Failover Management
Pacemaker/Heartbeat (the HA stack)
uCarp scripts
keepalived scriptsCustom Scripts:
nagios itself Event handler driven
cron Job that checks the master for connectivity. Reuse the check_icmp or check_http plugins for this purpose.
Extra Considerations
STONITH
Clustering?
DRBD/Shared Storage
High Latency HA
NDO/Databases
Recovery
STONITH
(shoot the other node in the head)
Mechanism by which a failing server is guaranteed to be removed from the cluster
Not required, but advised
Hardware (including ups) and software (vmware stonith device and shell scripts)
Only failing over when the primary is unreachable is safest
Beware of overzealous failover conditions as they can lead to a . .
Deathmatch!
No, really. Stonith gives your servers the ability to KILL THEMSELVES and FRIENDS
Beware of services whose init actions/failures should not cause failover/stonith
Any actions requiring a shared volume in active/passive mode should not immediately cause failover due to potential latency during volume mounts
Test, test, test the disaster scenarios in a LAB first or the fragfest may include your job!
Clustering/Fencing
A number of portions of Nagios Core and Nagios XI are clusterable. Processes that can potentially be clustered:offloaded postgresql
offloaded mysql/ndo2db
offloaded mrtg
Services that are dependent on the core monitoring engine and filesystem and should not be clustered:nagios, npcd, cronjobs
httpd
snmptrapd, snmptt
DUAL DRBD Primary
Disconnecting from the master before mounting of the shared volume during failover is no longer needed.
Careful implementation allows multiple servers to concurrently access the shared volume. Potentially useful for ambitious clusters and shared historical records.
Slower, as the secondary can lock blocks.
More prone to split-brains
Usually requires clustered file systems
High Latency HA
Problematic if the HA solution was not designed for potential high latency
Will potentially cause i/o wait issues
It may be better to push checks to a central server(s) with NRDP/outbound checks/etc, keeping HA solutions local, or to pay for a faster pipe.
DRBD Proxy A good solution if high latency HA is a must uses an asynchronous buffer for block writes to the secondary volumes (does not support dual primary)
NDO Considerations
Enforce single ndo instance access to mysql
If multiple ndo processes connecting to a single ndo db is required, consider using ndo db instances
You can control ndo's access to the mysql server through iptables and the vip.
Offload ndo2db to the offloaded mysql server
Configure ndomod it to connect through a tcp socket. This can potentially decrease load on the nagios server.
Database Considerations
Initiating failover due to crashed DBs may cause a deathmatch as all nodes will fail (due to their shared nature)
Offload both postgresql and mysql databases. Requires a virtual ip or careful management of DNS.
XI has scripts to repair the databases, use them!
Recovering from Failover
Degraded ex-primaries should not be added back to the cluster automatically. Doing so may cause split brains.
Split brains REQUIRE manual intervention if preservation of historical data is desired.
Stonith Deathmatches Have a primary image/instance without stonith enabled for recovery
Maintain an ultimate disaster recovery server instance/image outside of the cluster pool for when all else has failed.
A Plea from Nagios Support
Failover/HA != backups
Test, test, TEST! Use your lab please.
Document. Everything. The biggest barrier and largest hurdle for support are unknown, undocumented, non-standard configurations. Failover/HA deployments definitely qualify.
Final Comparisons
Snapback: Easy. Slow recovery. Requires manual intervention. Highest potential historical loss.
Failback: Intermediate. Moderate recovery. Can be automated. Less historical loss.
Failover: Difficult. Fast recovery. Fully automated. Nearly no historical loss.
High Availability: Difficult. Fast recovery. Automated. Redundancy across WAN links. Limited clustering. Least potential downtime. Multiple potential issues with split-brain, stonith/deathmatches and latency, so care should be given, and scenarios tested.
Food for thought . . . .
HA in a federated model . . . . . . . .
Final Questions For You
How much of Nagios XI, or Core, can truly be set up to be "HA"? Do you care? :P
Do you need HA/failover, or will failback/snapback suffice?
Is the time trade off in your environment worth it?
Questions for Me?
Any questions?
(common/critical answers noted below for the sake of efficiency)
11 meters/sec (unladen European swallow)
42
The Prime Directive
3 Times
The Categorical Imperative/Pragmatism (choose 1)
No.*
Evasive Subjunctive
. . . Yes?
The End
Andy Brist
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
Click to edit Master subtitle style
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
Click to edit Master text styles
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
Click to edit Master text styles
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
10/21/14
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
10/21/14
Click to edit Master title style
Click to edit Master text styles
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
PRESENTATION TITLE
Presenter Name
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
PRESENTATION TITLE
Presenter Name
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
Click to edit Master text styles
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
Click to edit Master text styles
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
10/21/14
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
10/21/14
Click to edit Master title style
Click to edit Master text styles
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
10/21/14