A monitoring tool for a GRID operation center
Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova),
David Rebatto (INFN Milano), Gennaro Tortone (INFN Napoli), Luca Vaccarossa (INFN Milano)
CHEP2003 - March 24-28, 2003 - La Jolla, California
Summary
introduction to DataTAG project monitoring of grid elements first implementation: WorldGRID resources monitoring the evolution: DataTAG WP4 resources monitoring future activities
Introduction
DataTAG projectDataTAG is an EU-funded project will create a large-scale intercontinental Grid testbed that will focus upon advanced networking issues and interoperability between these intercontinental Grid domains.
The project will address the issues which arise in the sector of high performance inter-Grid networking, including sustained and reliable high performance data replication, end-to-end advanced network services, and novel monitoring techniques. The project will also directly address the issues which arise in the sector of interoperability between the Grid middleware layers such as information and security services. The advance made will be disseminated into each of the associated Grid projects.
detailed information on:http://www.datatag.org
DataTAG Work Package 4
task of DataTAG WP4 (Interoperability between Grid domains) is to address issues of middleware interoperability between the European and US Grid domains and to enable a selected set of applications to run on the Transatlantic Grid Testbed
mainly activities include: Grid Resource Model for Computing and Storage resources (GLUE
schema) Virtual Organisation Membership Service implementation Grid Monitoring Resource Discovery LHC experiment applications integration
Monitoring of grid elements (1/2)
LOW LEVEL measurements CPU load memory usage disk usage (per partition) network activity number of processes number of users (UI) …
Computing Element Storage Element
Worker Node
Resource Broker
Information Index
Replica Manager
Replica Catalog
[…]
Grid services checks
gatekeeper gsiftp gris gdmp RB/LB …
“GRID” measurements
number of total CPUs
number of free CPUs number of running
jobs number of waiting
jobs SE free disk space …
Monitoring of grid elements (2/2)
sources of information LOW LEVEL measurements
plugins/sensors installed on each machine SERVICE checks
sensors installed on monitoring server GRID measurements
sensors installed on monitoring server
aggregate information (monitoring server side) per Virtual Organisation per site …
first implementation: WorldGRID resources monitoring
WorldGRID testbedWorldGRID is a “transatlantic grid” based on the existent European and American Grids with the goal of offering transparent access to the distributed computing infrastructure necessary to the “data-intensive” modern applications
The WorldGRID testbed has been successfully demonstrated during the WorldGRID demos at SuperComputing 2002 (Baltimore) and IST 2002 (Copenhagen) where real HEP application jobs were transparently submitted from US and Europe and run where resources were available, independently of their location
WorldGRID monitoring based on Nagios (a host and service monitoring engine)
[detailed information on: http://www.nagios.org]
host local plug-ins – collect info from OS- CPU load- RAM- disk- jobs
MDS plug-ins - collect aggregate info from GRIS- number of running/waiting jobs- number of total/free CPUs
history graphs for all monitoring metrics
aggregate info/graphs per Site and Virtual Organisation
the evolution:DataTAG-WP4 implementation for resources monitoring
DescriptionGOAL
The objective of the task is to develop software for use in the Grid Operations Centres in order to monitor the overall functioning of the grid. The software should enable the grid administrators to quickly identify problems in the operation of the grid and take appropriate action to rectify them
People involved Sergio Andreozzi (INFN CNAF) Vincenzo Ciaschini (INFN CNAF) Sergio Fantinel (INFN Padova) Antonia Ghiselli (INFN CNAF) Flavia Donno (CERN-LCG) Gennaro Tortone (INFN Napoli) Cristina Vistoli (INFN CNAF)
Requirements (1/2)
Features required: scalability very low intrusivity automatic resource discovery fault detection and notification metrics graphs
The GOC administrator should be presented with an integrated view of the grid showing the overall functional status of the grid and the various sites with various levels of detail
Requirements (2/2)
The system should provide a facility of defining alarms on specific conditions as functions of the various parameters monitored. When such alarms are triggered, the administrator should be alerted appropriately
The system should poll all the sites and gather the static and dynamic information about the resources in the site.
static information covers parameters like number of computing elements, total storage capacities, total memory etc.
dynamic information covers parameters like number of running jobs, number of jobs in queue, free memory, free storage space, load average etc.
All interfaces should be web based
Features provided by current implementation The new Grid Monitoring Tool is based on Grid Information
System implemented by LDAP protocol with GLUE schema
it provides various monitoring levels:
host level (by GLUE schema monitoring extension) fabric level (by DataGRID WP4 monitoring framework) Virtual Organisation level (by automatic resources discovery and
checks scheduling)
it also provides an historical database in order to generate graphs or reports of some measurements
GLUE schema(host level monitoring)
Conceptual model of grid resources to be used as a base schema of the GIS (Grid Information Service) for discovery and monitoring purposes
model of computing resources (CE) model of storage resources (SE) model of relationships among them (close CE/SE)
Implementation status (v. 1.0) (for Globus MDS) LDAP schema (DataTAG WP4.1) information providers (CE/SE)
we implemented an extension to include all monitoring metrics (“host level” added to GLUE schema)
DataGrid-WP4 monitoring framework(fabric level monitoring)
It provides a client (Monitoring Sensor Agent - MSA) running sensors (Monitoring Sensors - MS) on each node to monitor, and a central server (Fabric Monitoring Server - fmonServer) to collect data.
The server receives samples as they are measured by MSA, and stores them in a flat file / Oracle database
The client is provided with a sensor (sensorLinuxProc) which uses /proc file system to measure various basic quantities on Linux (CPU load, network,etc.)
EDG-WP4 monitoring framework
local farm element
computing element
Discovery process(Virtual Organisation level monitoring)
Through the GIIS, via LDAP, we can obtain the CE/SE available at a specific time.
Using a DB we compare the info from the GIIS with previous status of resources availability (an object can be new, disappeared, re-available)
Through the GRIS of the CE/SE we can obtain SITE/HOSTS info (we repeat the discovery process at site level to get site resources/info: queues, worker nodes, network adapters, disk partitions, supported transfer protocols, …)
Discovery process: base schema
Monitoring
Server
GIIS
GRIS
GIIS Server
Computing Element/Storage Element
1
2
34
SQL
1: LDAP Query2: available CE/SE3: LDAP Query4: CEIDs, WNs,
Steps 3,4 repeated for every CE/SE
LDAP
LDAP
Monitoring DB
GRIS (GLUE schema)
EDG-WP4 fmonserver
computing element
information providers farm monitoringarchive
runldif output
write
read EDG-WP4 monitoring agent
worker node
/procfilesystem
WP4 sensor
run
readmetric output
metric output
EDG-WP4 monitoring agent
worker node
/procfilesystem
WP4 sensor
run
readmetric output
metric output
information index
GIIS (GLUE schema)
monitoring server
discovery service
monitoring service
ldap query
ldap query
web interface
CentralMonitoringDatabase
Future activities
Future activities
job monitoring evaluation of OGSA monitoring service(s) evaluation/usage of SOAP interface provided by DataGrid WP4
monitoring framework in order to implement a distributed archive for monitoring metrics