Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy

Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia

Roxanne Martinez

Mentor: Yemi Adesanya

United States Department of Energy

Stanford, CA 94305

SCCS

The Scientific Computing and Computing Services at SLAC:

• Provides computing power, technical support, communications capabilities.

• Core services include Unix systems, Windows, networking, network operations, telecommunications.

• Supplies dept. support, science applications, network security.

• Houses thousands of servers.

The High Performance Computing Group of SCCS

• To ensure optimal computing performance of all of these servers, they must be monitored. This is the responsibility of the HPC group.

• The group watches data storage, electrical service to servers, cooling system abilities.

• This is made possible through the use of monitoring software: Nagios and Ganglia.

SCCS Task

• Until last year, all computing capacity at SLAC was located within the SCCS computing building.

• By then the datacenter had reached its maximum electrical service and cooling system capacities.

• New experiments meant the need for more computing power.

• A new datacenter would take years and a lot of funding to complete.

The Solution: Blackboxes

• This is a Sun Modular Datacenter produced by Sun Microsystems.

• It is a portable computing center built into a standard 8 foot by 20 foot shipping container.

• It is painted white for energy efficiency and is tightly sealed, insulated, and cooled.

• Today, SLAC maintains 2 blackboxes.

Blackbox Contents

• Blackbox 1– 252 bali machines

(Sun X2200 servers)

• Blackbox 2156 – yili machines (Sun

X4100 servers)– 139 boer machines

(Sun X2200 servers)The operating system on these machines is RedHat Enterprise Linux (RHEL) version 4.

Current Monitoring of the Blackboxes

The High Performance Computing Group currently uses Nagios and Ganglia to monitor:

• Percentage of CPU in use,• Amount of memory in use, and• Input/output rates.

The software periodically calls on utilities to extract monitoring data for the machines, displaying the info in graphs, storing the info in databases, and – in the case of Nagios – alerting administrators if machines reach warning or critical states.

Nagios

• User specifies items to be monitored by providing external plugins that return the status of machines to Nagios.

• If a warning or critical status is returned, Nagios can alert via email, IM, text, etc.

• Admins and users can view current status and history using a web browser.– MySQL runs as a server to provide

multi-user access to multiple databases. Interface: PerfParse.

– Round robin database (RRD) provides useful graphs of broad historical data. Popular because the database files do not increase in size over time.

Ganglia

• Robust scalable distributed monitoring system designed for clusters and grids.

• Based on a hierarchical design: uses a tree of connections to representative nodes for each cluster, reducing overheads.

• Updates the RRD.• Has a web frontend like Nagios but does

not have alerting feature.

Additional Monitoring Needed

• Temperature

• Fan speed

• Power supply voltage

“Materials”

• Baseboard management controller (BMC)– Service processor that monitors physical state of machine.– Located in the motherboard.– Performs monitoring through use of machines sensors. – Part of the Intelligent Platform Management Interface (IPMI)

which provides set of interfaces to manage and monitor a system.

• IPMI tool – Open source utility. – Can be used to extract physical parameters and parameter

thresholds. These are important in determining the status.• Lower Non-Recoverable, Lower Critical, Lower Non-Critical, Upper Non-

Critical, Upper Critical, and Upper Non-Recoverable

“Materials” continued“sudo ipmitool –c sdr”

“sudo ipmitool sensor list”

Output for both commands are when connected to the Sun X2200 server boer0113.

“Materials” continued

• Cron (Chronograph)– Time-based scheduling service in Unix.– Used for security reasons since root user is needed to

collect data.• Perl

– ideal Unix scripting language for the task.– Interpreted language; no compiler.– Efficient programming language that is powerful for

file input and output because of its text manipulation capabilities and fast development cycle .

Task

Create three Perl scripts (temperature, fan speed, voltage) that can be used on any machine regardless of the specific BMC.– Work first with yili0113, bali0113, and boer0113.– Cron will run root user to call on IPMI tool and will store data

every 15 minutes in a readable file.– The scripts will read the data every 15 minutes from the file to

produce the current machine parameters and interpret the current status of the machine (OK, WARNING, CRITICAL, UNKNOWN).

– For Nagios, the scripts will return the current status and parameters.

– For Ganglia, the scripts will call on the Ganglia command which passes in the parameters.

Results

• In a test of the check_cpu_temp.pl script on the bali0113 machine, the following results were produced using the Perl interpreter:

“Temperature OK - CPU_0_Temp=49.000, CPU_1_Temp=51.000 | CPU_0_Temp=49.000 CPU_1_Temp=51.000”

The Scripts as Nagios Plugins

Ganglia work is still underway!

Conclusions• Perl scripts, Nagios monitoring, and graphics tools

work successfully.• All three test machines are running with acceptable

temperatures, fan speeds, and power supply voltages. This suggests that current cooling systems and electrical supplies in blackboxes are effective. The monitoring must be done on all servers, however, for a complete evaluation to be possible.

• The HPC group is much closer to ensuring optimal computing performance for the lab.

Future Work

• The scripts are portable.– 3 test machines– KIPAC machines– All blackbox machines upon approval– Possibly more to come

• The scripts can also be edited to monitor different parameters.

Acknowledgements

Thank you to the U.S. Department of Energy Office of Science and the Stanford Linear Accelerator Center for the opportunity to participate in the Science Undergraduate Laboratory Internships program. Thank you to Steve, Susan, and Farah. Thank you to my mentor, Yemi Adesanya, for his mentorship throughout the project.

Documents

Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy