Upload
arnon
View
76
Download
0
Embed Size (px)
DESCRIPTION
Using OMD/Nagios to Monitor Complex Hardware/Software Systems. Joe VanAndel NCAR/EOL 2012/3/29. Why is Monitoring Important?. Why is Monitoring Important?. Software systems can be very complex: networked data sources multiple computers long running daemons - PowerPoint PPT Presentation
Citation preview
Using OMD/Nagios to
Monitor Complex Hardware/Softwa
re Systems
Using OMD/Nagios to
Monitor Complex Hardware/Softwa
re SystemsJoe VanAndel
NCAR/EOL2012/3/29
Joe VanAndel NCAR/EOL2012/3/29
Why is Monitoring Important?
Why is Monitoring Important?
• Software systems can be very complex:
• networked data sources
• multiple computers
• long running daemons
• Hardware (including computers) can fail
Why is Monitoring
Important (2)?• Someone is relying on your system to
produce or process data.
• Computers are better than people at monitoring - manual procedures are error prone and don’t cover 24x7.
• Your staff may need to be notified out-of-hours if failures occur.
Why is Monitoring Important to S-Pol?
• S-Pol is a complex system of hardware and software - need to detect problems so they can be quickly corrected.
• Notifications allow unattended operation, so staff don’t have to stay on site 24x7.
• Can not afford to have 3 shifts in field projects
What is OMD?• Open Monitoring Distribution
(http://omdistro.org)
• runs on Linux
• Bundles Nagios with 16 useful utilities, including
• check_mk - creates Nagios configurations for you!
• rrdtool/rrdcached - store and retrieve time series data, supports graphing of performance data.
Why use OMD?
• complete package of monitoring tools
• avoid the effort of compiling and integrating Nagios add-ons
• Web based monitoring - from anywhere!
Why use check_mk?
• Automatically generates Nagios rules for each machine you monitor.
• Lower overhead allows monitoring more checks on more hosts.
• easy to create both hardware and software checks.
• The S-Pol radar had 700 checks running on 14 hosts - we didn’t want to generate the Nagios configuration manually.
check_mk architecture
RRD is “Round Robin Database” which efficiently stores the output from check_mk.
RRD is “Round Robin Database” which efficiently stores the output from check_mk.figure from
http://mathias-kettner.de
check_mk_agent
Getting Started with OMD
• install the RPM
• $ omd create mysite # the monitoring instance
• create scripts in /usr/lib/check_mk_agent/local
• $ check_mk -I # run inventory
• $ omd start mysite # start daemons.
• open the check_mk URL in a browser.
Writing a check is simple
• write a C program, shell script, or Python script
• query hardware or software status
• output string(s) to stdout: "0 PgenTritonRaidStatus - OK"
• run a check_mk inventory to
• find your script
• generate the Nagios configuration
#!/bin/bashDIRS="/var/log /tmp"for dir in $DIRSdo count=$(ls $dir | wc --lines) if [ $count -lt 50 ] ; then status=0 statustxt=OK elif [ $count -lt 100 ] ; then status=1 statustxt=WARNING else status=2 statustxt=CRITICAL fi echo "$status Filecount_$dir count=$count;50;100;0; $statustxt - $count files in $dir"done
/usr/lib/check_mk_agent/local/filecount
S-Pol monitoring• Radar hardware for S-Band & Ka-band:
• antenna
• transmitter
• receiver
• Klystron temperature
• Container temperatures
Hardware Monitoring Architecture
Sixnet Controller
Hardware monitoring
• Sixnet controller communicates to measurement modules using RS-485
• monitors transmitter status
• monitors antenna status
• monitors transmitter temperature
• Sixnet controller runs Linux, so adding a check_mk_agent was easy!
What else?• Computer status:
• cpu load,
• disk space,
• memory usage
• radar software - tasks running, products being produced
• fetching data: satellite images, soundings, forecast model output
Implementation
• installed OMD on a rack-mount Linux server
• installed check_mk_agent on all monitored computers
• wrote scripts, installed in /usr/lib/check_mk_agent/local
Implementation(2)
• Configured digital IO modules (controlled by an embedded Sixnet computer) to monitor S-Pol hardware
• Wrote a program on the Sixnet that reported hardware status to check_mk_agent
• Send Ka-band status over the network, wrote software to create status files readable by check_mk scripts
Types of S-Pol checks
• scripts/programs directly monitor hardware or software
• hybrid scripts - process the output of an existing program, output check_mk status reports.
Implementation(2)
• configured GSM cell phone to send SMS messages
• software from gnokii.org
• bought local SIM
• wrote script to limit frequency of SMS messages
Sample Web Screens
Challenges
• learning how to create advanced checks with graphs
• Avoiding false alarms (particularly after hours!)
• limiting frequency of notifications - getting 20 text messages on your cell phone in 5 minutes is not helpful!
How well did OMD/Nagios
work?• The second shift only had to be on-site from
3:00PM to 8:00PM, rather than until 11:00PM
• Daytime: OMD/Nagios warned staff of problems on multiple occasions.
• Offhours: OMD/Nagios notified S-Pol staff of critical hardware/software failures on multiple occasions
24x7 Operations : w/o
working 24x7• Added SMS (text message) notifications to
Nagios
• Technicians and Engineers carried cell phones
• Nagios sent SMS when hardware or software problems occurred.
• Technicians and Engineers would access Nagios web pages via 3G modems on laptops
FUTURE
• Monitoring of diesel generators
• Add remote control:
• generator & transfer switch
• reset of transmitter faults
• reset of antenna faults
Conclusion• Monitoring is important for any system,
critical for complex or unattended operation
• OMD/Nagios makes it easy to deploy monitoring
• OMD/Nagios helped EOL maintain high data quality from S-Pol without requiring staff 24x7 on site.
• Notifications via SMS and remote access to OMD’s web pages are very helpful.
Acknowledgments
• Ethan Galstad - Nagios chief developer
• Mathias Kettner - check_mk
• Fatima Dembele (summer intern) - prototyping
• Paloma Gutierrez - hardware monitoring
• Chris Burghart - Ka-band monitoring
• Mike Dixon - Ka-band & HAWK monitoring
Questions?