Monitoring at/with SUSE 2015

1. Monitoring at/with SUSE How SUSE R&D checks network and system resources Lars Vogdt Team Lead SUSE IT [email protected]

2. 2 Agenda Official history What are you using at SUSE? Where are the Nagios Plugins? SUSE R&D internal usage Tips & Tricks High available, load balanced monitoring Demo (if possible)

3. 3 The short history of Monitoring in SUSE, Part 1 Saturday, October 23, 2001 SuSE Linux 7.3 * The first monitoring tool NetSaint version 0.0.7b6 Monday, September 30, 2002 SuSE Linux 8.1 * Welcome Nagios in SuSE (version 1.0b4) Saturday, April 16, 2005 SuSE Linux 9.3 * Nagios (in SuSE v 1.2 ) was project of month on SourceForge.net.

4. 4 The short history of Monitoring in SUSE, Part 2 Monday, June 18 2007 SUSE Linux Enterprise Server 10 SP1 * Nagios version 2.6 was with us until 2013 Tuesday, March 24, 2009 SUSE Linux Enterprise Server 11 * Nagios version 3.0.6 as monitoring tool is stronger then never... Wednesday , April 6, 2009 Icinga forked Nagios Icinga will be part of the next SUSE Manager release => Migrating from Nagios to Icinga 1.x is easy

5. 5 What are you using at SUSE? SUSE Linux Enterprise Server with High Availability Extension Additional packages from obs://server:monitoring, obs://devel:languages:perl and obs://network:telephony Internal packages (for no-src/legal-problematic packages mostly license problems) We release all internal tools in obs://server:monitoring as soon as possible and if there are no legal reasons (which only affects a handful of packages/scripts).

6. 6 Where are the Nagios Plugins? https://bugzilla.suse.com/show_bug.cgi?id=859105 and especially https://bugzilla.redhat.com/show_bug.cgi?id=1054340 have all details 2014-07-15: nagios-plugins* got renamed to monitoring-plugins*. Since that day, we have unmaintained nagios-core-plugins* and maintained monitoring-plugins* packages in server:monitoring

7. SUSE R&D internal

8. 8 SUSE R&D specials Crazy customers (developers) No need for monitoring or statistics until something breaks or gets relevant for business ??? Multiple dual-stacked networks (IPv4 + IPv6) separated via Firewalls Production vs. Development => high amount of moving targets Multiple hardware vendors, even NDA hardware without any further details or manuals Luckily mostly unique Operating Systems :-) but many different services

9. 9 The Past Services 43 2000 15 65 12 120 70 2185 Maximal Location Core System Addons Hosts Nuremberg Nagios Prague Nagios Provo Nagios Summary Latency in Nuremberg Average Host Check Latency ~7 seconds ~1.5 seconds Service Check Latency ~8 seconds ~1 second

10. 10 Current Situation Services 430 4700 96 1150 140 1700 30 170 696 7720 Maximal Location Core System Hosts Nuremberg 1 Icinga Nuremberg 2 Nagios Prague Icinga Provo Nagios Summary Latency in Nuremberg Average Host Check Latency ~5 seconds ~1.5 seconds Service Check Latency ~3 seconds ~1 second Nuremberg Prague Provo 0 2000 4000 6000 8000 Services monitored Past Current

11. Some small tips

12. 12 Why you always should define dependencies

13. 13 What should be monitored? Administrator View Business View Hardware health Service health Service availability host based Service availability business based Overview about the services and incidents of single hosts Overview about the final business impact, not the service components Only important for Administrators Important for Managers and Customers

14. 14 What can be checked? Nearly everything is possible! Minimal requirements listed below: Your script returns one of the following Exit-Codes: 3 : UnknownUnknown something outside the normal control range (of your script?) happened 2 : Something criticalcritical happend! Help needed! 1 : well, it works currently but be warnedwarned 0 : everything okok Some (human readable) output on STDOUT would be nice, but is not necessary for Nagios or Icinga itself. Print performance data on STDOUT, separated from normal output via '|' https://nagios-plugins.org/doc/guidelines.html.

15. 15 Example check: check_file_exists

16. 16 Eventhandlers If a service or host is in a defined, unwanted state, trigger external scripts to solve the problem automatically. (Restart apache if it crashes, send SMS if nobody acknowledges a problem, shutdown all OBS workers if Lars is not available, )

17. 17 Monitoring SANBoxes with MRTG For Qlogic, run the following command on your MRTG machine: /usr/bin/cfgmaker --global "WorkDir: /srv/www/htdocs/mrtg" --global "Options[_]: growright, bits, unknaszero" --ifdesc=alias,name --ifref=name --noreversedns --no-down --show-op-down --subdirs=sanbox-1 output=/etc/mrtg/sanbox- 1.conf --snmp-options=:::::2 192.168.0.1 ...or for Cisco MD: /usr/bin/cfgmaker --global "WorkDir: /srv/www/htdocs/mrtg" --global "Options[_]: growright, bits, unknaszero" --ifdesc=alias --noreversedns --no-down --show-op-down subdirs=sanbox-2 output=sanbox-2.conf --snmp-options=:::::2 192.168.0.2

18. 18 Monitoring IO on your machines On the machine your want to monitor: Install monitoring-plugins-sar-perf Prepare a command like (NRPE example): command[check_iostat_home]=/usr/lib/nagios/plugins/check_iost at -d root-fs_home -w 120000,120000,120000 -c 150000,150000,150000 -W 30 -C 50 Maybe also enable sysstat (chkconfig boot.sysstat on), to have the data available on the host directly

19. 19 MRTG graphs for network interfaces of virtual machines On the Server running the virtual machines, edit /etc/snmp/snmpd.conf : [...] rocommunity public 10.0.0.0/16 [...] On your MRTG machine, run: /usr/bin/cfgmaker --global "WorkDir: /srv/www/htdocs/mrtg" --global "Options[_]: growright, bits, unknaszero" --ifdesc=alias,name --ifref=name --noreversedns --no-down --show-op-down --subdirs=vmserv1 --output=vmserv1.conf --snmp- options=:::::2 10.0.0.101 ...and edit the xml definition of your virtual machine: [...] [...] Now (re-)start snmpd and your virtual machine.

20. 20 Monitoring of MySQL servers We are currently using two different checks: check_mysql (monitoring-plugins-mysql package) check_mysql_health (monitoring-plugins-mysql_health package) You need a database user with "SELECT" access for both options. Usually, this means that you create a user named "nagios" in MySQL: mysql> GRANT SELECT on nagios.* TO 'nagios'@'localhost' IDENTIFIED BY 'nag1os'; mysql> flush privileges; mysql> quit Afterward you should be able to check the database via: /usr/lib/nagios/plugins/check_mysql -H $HOST -u USER -p $PASS or: /usr/lib/nagios/plugins/check_mysql_health --units MB -modethreads-connected --username $USER --password $PASS--warning 40 --critical 50

21. 21 Monitoring of PostgreSQL check the file pg_hba.conf on the database server to contain the correct IP addresses of the monitoring cluster create the monitor user via the createuser command as user postgres: postgres@pg1:~> createuser --pwprompt --interactive monitor Enter password for new role: Enter it again: Shall the new role be a superuser? (y/n) y Shall the new role be allowed to create databases? (y/n) n Shall the new role be allowed to create more new roles? (y/n) n Note: the SUPERUSER privilege is needed for some special checks like "archive_ready". restart the database Try on the monitoring cluster: ~> ./check_postgres.pl --dbpass=$PASSWORD dbuser=$USERNAME--action=archive_ready -H pg1 POSTGRES_ARCHIVE_READY OK: DB "postgres" (host:pg1) WAL ".ready" files found: 0 | time=0.02s files=0;10;15

22. 22 ...and there is more... More and more monitoring-plugins* packages come with enabled Apparmor profiles: check /var/log/audit/audit.log if something seems to be crazy Re-enable notifications automatically via cron to not forget it: #!/bin/bash CFG=/etc/icinga/icinga.cfg commandfile=$(grep ^command_file "$CFG" | awk -F'=' '{ print $2 }') if [ -p "$commandfile" ]; then now=`date +%s` printf "[%lu] ENABLE_NOTIFICATIONSn" $now > "$commandfile" fi Monitor your NSCA daemon via monitoring-plugins-nsca and a dummy test (see README) Create performance data for your monitoring: #!/bin/bash if /etc/init.d/icinga status >/dev/null 2>/dev/null ; then if [ -p /var/run/icinga/icinga.cmd ]; then su icinga -c "/usr/lib/nagios/plugins/check_nagiostats --EXEC /usr/sbin/icingastats --passive $HOSTicingastats >> /var/run/icinga/icinga.cmd" fi fi Monitor your monitoring setup!

23. High available, load balanced monitoring

24. 24 Basic overview Corosync Pacemaker Cluster (two main machines + one VM just for Quorum) using IPMI for STONITH DRBD to provide storage (PNP, Logs) on both main machines Services like MySQL (cluster), snmptrapd or NSCA run unmanaged on all nodes mod_gearman for Load-Balancing/Failover of normal checks check_mk for automatic checks and Load- Reducing MRTG for statistics from Network and SAN (for historical reasons)

25. 25 Load-Balanced / HA Monitoring in project pictures Livestatus snmptt snmptt

26. Demo time

27. Questions?

28. Thank you. Join the conversation, contribute & have a lot of fun! www.opensuse.org

29. 29 Have a Lot of Fun, and Join Us At: www.opensuse.org

30. General Disclaimer This document is not to be construed as a promise by any participating organisation to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for openSUSE products remains at the sole discretion of openSUSE. Further, openSUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All openSUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States and other countries. All third-party trademarks are the property of their respective owners. License This slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license. It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any derivative work is distributed under the same license. Details can be found at https://creativecommons.org/licenses/by-sa/4.0/ Credits Template Richard Brown [email protected] Design & Inspiration openSUSE Design Team http://opensuse.github.io/branding- guidelines/

Presentations & Public Speaking

Monitoring at/with SUSE 2015