Staying Sane with Nagios

  • View

  • Download

Embed Size (px)


From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out. In the ensuing years, I've automated more, but still have the same kind of mindset about inheritance and so on.

Text of Staying Sane with Nagios

  • 1. Staying Sane with NagiosMatt Simmons @standaloneSA

2. Introduction & Outline Confessions: I am not actually a Nagios ExpertI do actually LIKE Nagios Outline: Global Sanity Small & Medium ShopsLarge Scale ShopsAdd OnsWarningsAdditional Resources 3. I know what you're thinking... Nagios?Sane???Unlikely!!! Serenity Now!!! 4. Nagios? SANE?!?Serenity Now!!! 5. Global SanityUniversal Advice Affects installations of all sizesDocumentationCentralized AuthenticationPlugin Development 6. Global Sanity: Documentation Read the documentation Object Definitions Use 3_0 when searching Bookmark the good will be soon coming out with 3.x docs 7. Global Sanity: Central AuthCentralized Authentication LDAP / AD with Apache (I use Likewise Open)Domain users -> Nagios Contacts msimmons@EXAMPLE.COM Access to CGI interface 8. Global Sanity: Do Not Reinvent the Wheel... Nagios Exchange Nearly 2000 Listings >1600 pluginsCons: Varying quality and reliability Old, unmaintained, code rot, etc 9. Global Sanity: ...unless you have to Writing your own Nagios Plugins Great guide OutputHuge CommunityAny language you want 10. Small & Medium Shops Not exclusively small or medium, just a nonautomatic way of doing things For people who: Manually edit / create entries in config filesDon't use extensive 3rd party management softwareHave a small team of responsible adminsDon't require large distributed monitoring networks 11. Configuration SanityWhen: Creating new configsWorking with existing configsTestingResponding to events 12. Syntax Highlighting This? 13. Syntax Highlighting Or this? 14. Config File HierarchyDefault config is stupid.cfg_dir directive is key *.cfg recursivelyHierarchy should resemble real lifeAllows for additional group securityUse what makes sense to you and document it 15. Config File Hierarchy: Example Output of tree -d on my Nagios objects directory |-- commands |-- computers | |-- groups | |-- linux | | `-- services | `-- windows |-- misc `-- network |-- firewalls |-- links |-- routers `-- switches 16. Regular Expressions Not all regexes are created equal use_regexp_matching Only when object names contain: * ?use_true_regexp_matching 'man regex' All object names Caution: Unintended consequences 17. Better Object Formatting This? 18. Better Object Formatting Or this? 19. Revision ControlCVS/SVN/git(?)Simple, maintainable, recoverableSelf-documenting (if done correctly) 20. (ab)Use InheritanceTemplates register = 0Multiple InheritanceBeware the spaghetti code 21. Use Hostgroupsdefineservice{ service_descriptionSSHServiceCheck check_commandcheck_ssh host_namelinux01,linux02,linux03,...linux50 } 22. Use Hostgroups definehostgroup{ hostgroup_namelinuxservers } definehost{ usegenerichost host_namelinux01 address192.168.0.10 hostgroupslinuxservers } defineservice{ service_descriptionSSHservicecheck check_commandcheck_ssh hostgroup_namelinuxservers } 23. Script / AutomateAutomate as much as possible New ServicesNew Hosts as a template 24. Use alternate contacts file when testing new features Coworkers are under enough stress as it isNo messy explanationsUse symlinks to point to real contacts file 25. Plugin SanityThoughts about writing, configuring, and using Nagios plugins 26. SNMPUse it whenever possible. Really. 27. NRPE vs check_by_ssh Nagios Remote Plugin Executable(?)Skip it when possible Use SNMPNRPE 28. When checking disk usageDo not specify the partitions to checkInstead, specify the partitions to NOT checkToo easy to forget to add new partitions.If possible, use a plugin that produces statistics for graphing usage trends 29. Notification Sanity Notifications suck. Here are some ways to make them not suck as much. 30. Alternate Communication Method When the network Is down, email is down tooHave a non-email contact method SMS, cell modem, smoke signalsTest it occasionally 31. Use parents Establish a path FROM THE NAGIOS SERVERFailure will trigger unreachable states u notification flagOnly useful for non-local-subnet hosts typically If the local switch dies, alerts don't go out anyway Typically 32. Use Dependencies Available for both hosts and services The disks didn't blow up, SNMP crashed What do you mean, the website is unavailable when the database crashesDependencies != parents Parents establish a line between the host and Nagios Dependencies establish logical object relationships 33. Notifications are Commands Use Them Execute what you need, when you need, where you need through extra-nagios scriptsYour imagination is the limit Electrical relays?Flashing lights?HALON release? Please don't. 34. Use Passive Checks (when necessary / appropriate) For normal passive checks, specify freshness checks Useful for SNMP traps Combine with snmptrapdDistributed Monitoring Use for capacity reasons Physical separation calls for separate Nagios installs (in my opinion) 35. Macros GOOD 60 bajillion available Demand Macros Specify remote macros from other hosts Custom Variable Macros _MACADDRESS 00:01:02:03:04:05 $HOSTMACRO:SOMEHOST$$_HOSTMACADDRESS$Available as environmental variables in scripts $NAGIOS_MACRONAME 36. Use Flap Detection Or not. Who wants a charged cellphone battery?Measures state changes:Weighted measure of the last 21 checks More recent counts higher 37. Large Shops Too many nodes to easily configure by hand, or too many nodes to deal with using one server Scaling NagiosCentralized ManagementWeb Configurators 38. Scaling Nagios large_installation_tweaks Distributed monitoring No summary macros, memory handling is different, and processes fork() less Assign groups of hosts to one Nagios server (reporting via NSCA / Passive checks)Check tuning docs: 39. Centralized Management Puppet / chef / cfengine / whatever Distribute nagios user's key if necessaryInstall nagios agents (NSCA / NRPE)Automate Configuration Build Puppet's built-in Nagios types sound convenient...sort of 40. Nagios Web Configuration Dozen, If not hundredsI don't know of a great one.May be worth building or finding one that matches your inventory system Don't double-up on data if you don't have to 41. Malproductive Practices Overreliance on Event Handlers Please don't do anything terribly important. Edge cases are scary.Overabuse of inheritance Spaghetti code Hard to traceOvercomplification Simple is nearly always better 42. Learn More Mailing List Nagios Users Nagios Users