View
217
Download
2
Category
Preview:
Citation preview
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Monitoring a Control System Using Nagios
Ralph Lange, BESSY – Mauro Giacchini, LNL
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
What is the Situation?
Machine Status vs. Controls Infrastructure Status• Machine status:
– usually handled in the Control Room by an operator– uses the Alarm Handler or other EPICS tools– based on Channel Access connections
• Control System infrastructure can be comparably complex, its status:– needs to be handled outside the Control Room– with tools that allow remote access– using different types of connections/checks: ping, snmp, http,
Channel Access, disk usage, ...• BESSY was starting to have an increasing number of failures due to
ageing hardware• One summer day Mauro (preparing an EPICS training in hot Italian
summer) was asking me if I knew Nagios ...
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
What is Nagios?
Nagios (“nah-ghee-ose”)• Open source monitoring framework
– widely used & actively developed: www.nagios.org
• Host and service problems detection and recovery
• Provides wide set of basic plugins (checks)
– easy to develop custom plugins• Active vs. passive checks
• Centralized vs. distributed deployment– also allows redundant Nagios daemons
• High configurability
– service dependencies, fine-grained notification options
• Web interface
– status view, administration (e.g. analysis, downtime scheduling)
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
The Plugin (Check) Interface
Plugins (Checks)• Checks are command line programs that follow a convention for
arguments, stdout output, and return code: nagiosplugins.org– Output: one line of status info– Return code: OK / WARNING / CRITICAL / UNKNOWN
• Can be written in any (i.e. your favourite) compiled or interpreted language
• Are configured into Nagios for local or remote execution
Passive Checks• An external application can write check results (following a certain
format) into a file (or a pipe)
• Nagios reads from this and accepts the results (if configured)
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Nagios + CA Plugin = NAL
Nagios Channel Access Plugins• caget type plugin (active check) by Mauro Giacchini (LNL)
• camonitor type daemon (passive check) by Debby Quock (APS)
• Integrate data available through CA into the Nagios monitoring framework
• Can check the health of EPICS integrated VME crates, VME IOCs, soft IOCs, PLCs, CA gateways, CA archivers, ... as well as OPI machine and server health, disk status, network device status, NTP, DNS, web services etc.
• Allows NAL (Nagios Alarm Handler) to be the central monitoring system for all control system infrastructure, whereas the ALH in the control room provides similar functionality for the controlled facility
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Current Configuration at BESSY
Servers• All machines: ping, disk usage, load, processes, users, SSH
• Some: DNS (foreign and internal addresses), NTP
vxWorks IOCs• Ping, CPU load, memory usage, FD usage
Services• Wikis, web server, help pages, issue trackers (Trac/Redmine), elog
• Oracle servers: Ping, ODB Telnet, ODB TNS for important DBs
=> 296 checks on 111 hosts
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Screen Shots: Tactical Overview
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Screen Shots: Service Detail
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Screen Shots: Service Detail
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Screen Shots: Availability Report
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Screen Shots: Service Trends
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Firefox/Thunderbird Plugin
• Highly configurable, many filtering options
• New alarm starts blinking and may play sound
• Mouse-over opens a pop-up showing the current alarms
• Clicking an alarm opens the related Nagios page in a tab
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Experiences
Nagios is a very stable and reliable framework, configuration is flexible, options and plugins are many
Off control room, web based, email notification approach fits our controls group better than ALH
Manual configuration can be tedious, some parts could (should!) be generated from our RDB
Found some network problems, one running system clock, two disks filling up, IOC load and memory saturation on a number of mv162s (which were replaced by mv2100s)
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
Next Steps
To be configured:• Soft IOCs, CA Gateways, VME crates (Wiener), Embedded Controllers
• NFS shares usage, switches/routers, printers
Checks to be written: Conserver (IOC console access) CA Archiver (through ArchiveManager web interface) CA access rights (based on cainfo)
Collaborate:• Integrate CA check plugin development
• Agree on a common place for our plugins (APS? Sourceforge? Nagios?)
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios
LivEPICS Example
Live Example:Mauro Giacchini's LivEPICS distribution includes Nagios 3.0
(configured to look at the EPICS Base example app channels)
Go check it out – now!
Recommended