Upload
buikhanh
View
407
Download
2
Embed Size (px)
Citation preview
Supermicro Server
Monitoring with SuperDoctor 5 and
Nagios Using SNMP Protocol
Version 1.1b
Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol
Release: v 1.1b
Document release date: 11/15/2013
Copyright © 2013 Super Micro Computer, Inc.
All Rights Reserved.
Legal Notices
This software and documentation is the property of Super Micro Computer, Inc., and supplied only under
a license. Any use or reproduction of this software is not allowed, except as expressly permitted by the
terms of said license.
Information in this document is subject to change without notice.
Trademark Notice
All trademarks and copyrights referred to are the property of their respective owners.
ii
Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol
Revision History
Date Rev Description
Jul-4-2011 1.0 1. Initial Document.
Sep-20-2012 1.1 1. Reorganize MIB structures.
Jul-12-2013 1.1a 1. Change product name to SuperDoctor 5 (SD5).
Nov-15-2013 1.1b 1. Changed default install folder of SD5.
iii
Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol
Contents
1. Introduction ..................................................................................................... 4
2. Prerequisites ................................................................................................... 5
2.1 Installing Java Runtime Environment (JRE) ................................... 5
2.2 Installing the check_snmp_health Plug-in ...................................... 5
2.3 SuperDoctor 5 (SD5) ...................................................................... 5
2.4 Seting Up SNMP Service in Linux .................................................. 5
2.5 Installing Smartctl Utility ................................................................. 6
3. Getting Started ................................................................................................ 7
3.1 Defining the Hosts .......................................................................... 7
3.2 Defining a Command ..................................................................... 7
3.3 Defining the Services ..................................................................... 8
3.4 Validating the Nagios Configurations .............................................. 9
3.5 Restarting Nagios Service ............................................................ 10
3.6 Connecting to the Nagios Web UI ................................................ 10
4. Using check_snmp_health ............................................................................ 11
4.1 –h or --help .................................................................................. 11
4.2 –bc ............................................................................................... 11
4.3 –cn ............................................................................................... 12
4.4 –co ............................................................................................... 12
4.5 –d ................................................................................................. 13
4.6 –i .................................................................................................. 13
4.7 –t .................................................................................................. 13
4.8 –to ................................................................................................ 14
5. Appendix ....................................................................................................... 15
5.1 SD5 FAQ ..................................................................................... 15
5.2 How to Reset Memory Error Status? ............................................ 15
5.3 Can I Disable the SD5 Web? ....................................................... 16
5.4 Can I Disable the NRPE Protocol? ............................................... 17
5.5 No Health Information from SNMP Was Fetched ......................... 17
Contacting Supermicro ............................................................................................... 19
4
1. Introduction
This Nagios plug-in, named check_snmp_health, uses SNMP to talk to SuperDoctor 5
and check the health of the following hardware components:
Fan
Processor temperature
System temperature
DDR3 temperature
Power supply failure
Voltage
Chassis intrusion
Physical disk failure
Memory failure (Linux platform only) 1
Processor failure (Linux platform only) 2
RAID health (LSI MegaRAID 2108 and 2208 controllers only)
The results of executing the check_snmp_health plug-in are shown on the Nagios Web
UI.
1 The memory health check includes CECC and UECC. Both kinds must be BIOS supported,
and this function is currently only available on Linux platforms. 2 The processor failure checks must be BIOS supported and is currently only available on
Linux platforms.
5
2. Prerequisites
2.1 Installing Java Runtime Environment (JRE)
The check_snmp_health plug-in is written in Java. To run the plug-in, install JRE 1.6 or
above in your Nagios server. Please set the JAVA_HOME environment variable to the
JRE installation path.
2.2 Installing the check_snmp_health Plug-in
1. Unzip the package file SSMServerPlugin-1.0-build.[xyz].zip to the
/usr/local/nagios/libexec/SSMServerPlugin folder, assuming your Nagios is
installed on the /usr/local/nagios location.
2. Using the command “chmod +x check_snmp_health.sh” to make the
check_snmp_health plug-in executable.
3. Execute the check_snmp_health.sh program without providing any argument. If
the JRE and the plug-in are installed correctly, the error message appears:
“Invalid options. Three options must be provided for -i (--ip).”
2.3 SuperDoctor 5 (SD5)
The check_snmp_health plug-in is designed to work with the SuperDoctor 5, which
implements an SNMP extension to support Supermicro MIBs (see “5.3 Supermicro
MIB” in SuperDoctor 5 User's Guide for details).
For the installation of the SuperDoctor 5, please refer to “Chapter 2 Setting Up SD5” in
SuperDoctor 5 User's Guide. For the quick installation of multiple SD5s, see “2.1.4
Tips for Deploying a Large Number of SD5s” in SuperDoctor 5 User's Guide.
2.4 Seting Up SNMP Service in Linux
To support SNMP, the NET-SNMP service needs to be installed and configured on
your Linux. For the installation and configuration of the NET-SNMP service, please
refer to “5.2 Setup SNMP Service in Linux” in SuperDoctor 5 User's Guide.
6
2.5 Installing Smartctl Utility
SD5 uses an open source program named smartctl to check the health of physical
disks. To enable this function, you need to manually install the smartctl program.
Download the program from:
http://sourceforge.net/apps/trac/smartmontools/wiki/Download
Many Linux distributions provide pre-compiled packages to simplify the installation of
smartctl. For example, on the CentOS 5.x, you can install smartctl by using the yum
command as shown below.
7
3. Getting Started
3.1 Defining the Hosts
Define a host for each of the SD5s. Suppose that the host is written in the host1.cfg
file.
define host {
host_name 10.134.12.36
alias 10.134.12.36
address 10.134.12.36
use linux-server
}
Edit the $NAGIOS_HOME$/etc/nagios.cfg file to include the host1.cfg.
3.2 Defining a Command
Define a command for check_snmp_health. Suppose that the command is written in
the commands.cfg file.
define command {
command_name check_snmp_health_all
command_line
/usr/local/nagios/libexec/SSMServerPlugin/check_snmp_health.s
h –i $HOSTADDRESS$ -t $ARG1$
}
8
3.3 Defining the Services
Define the service to be checked by Nagios. Suppose that the service is written in the
host1.cfg file.
define service {
use local-service
service_description check_snmp_health
host_name 10.134.12.36
check_command check_snmp_health_all!a
}
You can also define a service to check a particular type of monitored items, e.g., fan,
disk, or memory, by specifying the type argument:
a: all (like checking all health)
w: power,
f: fan
c: current
d: disk
m: memory
t: temperature
v: voltage
s: switch
p: processor
r: raid
For example, the service used to check fan status is shown below:
define service {
use local-service
host_name 10.134.12.36
service_description check fan status
check_command check_snmp_health_all!f
}
You can also check multiple types of monitored items by enumerating each of the type.
A service used to check disk, fan, and voltage status is shown below.
9
define service {
use local-service
host_name 10.134.12.36
service_description check disk, fan, and voltage status
check_command check_snmp_health_all!dfv
}
3.4 Validating the Nagios Configurations
10
3.5 Restarting Nagios Service
3.6 Connecting to the Nagios Web UI
The results are shown on the Nagios Web UI as below.
11
4. Using check_snmp_health
4.1 –h or --help
The -h or --help option shows the help menu, as shown below.
4.2 –bc
Use the –bc option to specify user-defined thresholds for memory and processor
checking. The argument format is as follows:
[type][duration][fail count],….
[type]:
m: correctable single bit ECC errors.
M: uncorrectable ECC errors.
p: processor failures.
[duration]:
d: day
h: hour
m: minute
s: second
[fail count]: The acceptable number of failures. To trigger a critical status, the
failure counts must be greater than this value.
12
Example: To specify a threshold for memory that indicates four single bit ECC errors
per 1GB RAM within one day (24 hours) is allowed (i.e., m1d4), and 0 uncorrectable
ECC error is allowed within 1 hour (i.e., M1h0).
-bc m1d4,M1h0
Note: To reset the memory, refer to “5.2 How to Reset Memory Error Status?” for more
information.
Example: To specify a threshold for processor that indicates 2 correctable processor
failures within 30 days is allowed (i.e., p30d2), and 0 uncorrectable processor failure is
allowed within 1 hour (i.e., P1h0).
-bc p30d2,P1h0
4.3 –cn
Use the –cn option to specify user-defined thresholds for checking the number of
processors, memory and hard disks. The argument format is as follows:
[type][number],….
[type]:
p: processor.
m: memory.
d: hard disk drives.
[number]: The expected number of processors, memory, or hard disks. To trigger
an OK status, the assigned number must be equal to the number of the
processors, memory, or hard disks installed on the system under monitoring.
Example: The arguments are specified for the –cn option, and this option indicates the
system under monitoring has one processor, four memory DIMMs, and one hard disk
drive.
-cn p1,m4,d1
4.4 –co
Use the –co option to specify an SNMP community string.
13
4.5 –d
Use the –d option to show detailed information regarding the monitoring logics, which
is used for debugging propose only. This option should not be used in Nagios.
4.6 –i
Use the –i option to specify the host name or IP address to be checked.
4.7 –t
Use the –t option to specify the type of monitored items to be checked. The default
value is “all”.
Use the "-t r" option as shown below to check the health status of a RAID controller,
including the states of its components such as battery backup units, virtual drives and
hard disks.
14
The following figure indicates one virtual drive and one hard disk are alerted, and the
health status of the RAID controller is thus critical.
The following figure shows the RAID controller is critical due to the absent BBU.
4.8 –to
Use the –to option to specify the SNMP timeout value. The default value is 15 seconds.
You may need to increase the timeout value if the check_snmp_health plug-in cannot
retrieve all MIBs. For example, a host to be checked has several hard disks. Checking
its physical disk failures may be longer than 15 seconds and times out the
check_snmp_health plug-in. To avoid such a situation, specify a larger timeout value
by using the –to option.
15
5. Appendix
5.1 SD5 FAQ
Q: I see some error messages in the [SD5 install folder]/wrapper.log file. Do you have
a list of all error messages and solutions?
A: Here are the known error messages.
NO Message Root Cause Solution
A0001 HealthInfo
initialization error.
com.supermicro.ss
m.tmhealth.model.
MotherboardModel
NotExistException:
The SD5 is run on a
non-Supermicro server.
Health information is only
available on Supermicro
servers.
Install SD5 on Supermicro servers.
A0002 Unable to start
JVM: No such file
or directory
The SD5 cannot find the
required Java Virtual
Machine (JVM) located in
the [SD5 install folder]/jre
folder.
Reinstall the SD5.
5.2 How to Reset Memory Error Status?
Q: An uncorrectable ECC error has been raised on a server and I have manually
changed the pragmatic memory module. However, the check_snmp_health plug-in still
shows a critical status.
A: The check logic of memory errors is based on these:
1. There are memory error logs in the BIOS event log.
2. The log's generated time is in the check time period.
For example, suppose that you use the -bc M1d0 option (i.e., any uncorrectable ECC
error occurring in one day will cause a critical state) to check memory error. Once an
uncorrectable ECC error has been found, the status will remain critical for one day
even after the problematic memory is manually changed. To get an OK status
immediately after manually repairing the memory, you need to follow these steps:
16
1. Clear BIOS event logs from the BIOS setup menu.
2. Delete the file [SD5 install folder]/config/bioslogs.txt
5.3 Can I Disable the SD5 Web?
Q: I only use the check_snmp_health plug-in to check the health of a host and do not
use a browser to view the sensor readings via the SD5 Web. Can I disable it?
A: Yes, the SD5 Web can be disabled during installation. At the Setup SuperDoctor 5
Web step, select 2- No to disable the SD5 Web. See the figure below.
You can also manually disable the SD5 Web after installation. Use a text editor to open
the [SD5 install folder]/plugins/builtin/web/plugin.cfg file, as shown below.
17
Change the enabled attribute from 1 to 0 and save the document. Exit the text editor
and restart the SD5 to apply the setting. If the SD5 Web is disabled, the TCP ports
8181 and 8444 are not used.
5.4 Can I Disable the NRPE Protocol?
Q: I only use the check_snmp_health plug-in to check the health of a host and do not
use the NRPE protocol to talk with the SD5. Can I disable the support of the NRPE
protocol?
A: The SD5 supports three NRPE connection modes:
Mode A: Plain text with allowed IP (port 5333)
Mode B: Anonymous SSL connection with allowed IP (port 5666)
Mode C: SSL encryption with a public key infrastructure (port 5999)
Because the NRPE protocol is the default connection protocol provided by the SD5, it
cannot be completely turned off. At least one connection mode must be specified. For
modifying the connection mode settings, refer to “3.2 SuperDoctor 5 Connection
Modes” in SuperDoctor 5 User's Guide.
5.5 No Health Information from SNMP Was Fetched
Q: I execute the command check_snmp_health.sh -i [host_ip] and the result shows
“No health information from SNMP was fetched.” What is the problem?
A: Usually this message indicates that the host to be checked does not support
Supermicro MIB. Possible reasons include:
18
The operating system’s built-in SNMP service (i.e., the Net-SNMP) does not
start.
The SD5 does not start.
The SD5 SNMP extension is not correctly installed.
The SNMP port is blocked by firewall.
The default timeout value is not long enough for a health check.
19
Contacting Supermicro
Headquarters
Address: Super Micro Computer, Inc.
980 Rock Ave.
San Jose, CA 95131 U.S.A.
Tel: +1 (408) 503-8000
Fax: +1 (408) 503-8008
Email: [email protected] (General Information)
[email protected] (Technical Support)
Web Site: www.supermicro.com
Europe
Address: Super Micro Computer B.V.
Het Sterrenbeeld 28, 5215 ML
's-Hertogenbosch, The Netherlands
Tel: +31 (0) 73-6400390
Fax: +31 (0) 73-6416525
Email: [email protected] (General Information)
[email protected] (Technical Support)
[email protected] (Customer Support)
Asia-Pacific
Address: Super Micro Computer, Inc.
3F, No. 150, Jian 1st Rd.
Zhonghe Dist., New Taipei City 23511
Taiwan (R.O.C)
Tel: +886-(2) 8226-3990
Fax: +886-(2) 8226-3992
Web Site: www.supermicro.com.tw
Technical Support:
Email: [email protected]
Tel: +886-(2)-8226-3990
20
This page is intentionally left blank