Upload
brittany-graves
View
30
Download
0
Embed Size (px)
DESCRIPTION
Monitoring, accounting and automated decision support for the ALICE experiment based on the MonALISA framework. Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP) EGEE User Forum, May 9 th -11 th 2007, Manchester. Contents. Monitoring requirements MonALISA overview - PowerPoint PPT Presentation
Citation preview
EGEE-II INFSO-RI-031688
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks
Monitoring, accounting and automated decision support for the ALICE experiment based on the MonALISA framework
Catalin Cirstoiu (IT/PSS/ED)
Costin Grigoras, Latchezar Betev (PH/AIP)
EGEE User Forum, May 9th-11th 2007, Manchester
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Contents
• Monitoring requirements• MonALISA overview• Application monitoring• Monitoring architecture in AliEn
– Jobs monitoring– Traffic monitoring– Services monitoring– Nodes monitoring
• Actions framework• Feature snapshots
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Monitoring Requirements
• Global view of the entire distributed system– Non-intrusive– Accurate
• Providing– Near real-time information– Long-term history of aggregated data
• On key parameters like– System status– Resource usage
• Helping with– Correlating events– System debugging– Generating reports
• Taking automated actions based on the monitored data
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
MonALISA Overview
• MonALISA is a Dynamic, Distributed Service Architecture capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems.
Data Store
Data CacheService & DB
Configuration Control (SSL)
Predicates & Agents
Data (via ML Proxy)
Applications Java Client(other service)
Agents Filters DataModules
WS Client(other service)
WebService
WSDLSOAP
LookupService
LookupService
Registration
Discovery
Postgres MySQL
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
ApMon – Application Monitoring
• Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA Services
• High comm. performance • Flexible• Accounting• Sys Mon
MonALISAService
MonALISAService
ApMon
ApMon
APPLICATION
APPLICATION
MonitoringData
UDP/XDR
Mbps_out: 0.52 Status: reading
App. Monitoring
MB_inout: 562.4
ApMonConfig
parameter1: value parameter2: value
App. Monitoring
...
Time;IP;procIDMonitoring
Data
UDP/XDR
MonitoringData
UDP/XDR
load1: 0.24 processes: 97
System Monitoring
pages_in: 83
MonALISA
hosts
Config Servlet dynamic reloading
ApMon configuration generated automatically by a servlet / CGI script
0
10
20
30
40
50
60
70
0 1000 2000 3000 4000 5000 6000
Messages per second
Mo
nA
LIS
A C
PU
Usa
ge
(%)
No Lost Packages
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Monitoring Architecture in AliEn
Long HistoryDB
http://pcalimonitor.cern.ch:8889/LCG Tools
MonALISA @Site
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
MonALISA @CERN
MonALISA
LCG Site
ApMon
AliEn CE
ApMon
AliEn SE
ApMon
ClusterMonitor
ApMon
AliEn TQ
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn CE
ApMon
AliEn SE
ApMon
ClusterMonitor
ApMon
AliEn IS
ApMon
AliEn Optimizers
ApMon
AliEn Brokers
ApMon
MySQLServers
ApMon
CastorGridScripts
ApMon
APIServices
MonaLisaMonaLisaRepositoryRepository
Aggregated Data
rss
vsz
cpu
time
run
tim
e
job
slots
free
spac
e
nr.
of
file
s
op
en
files
Queued
JobAgents
cpu
ksi2k
jobstatus
disk
used
pro
cesses
loadn
etIn
/ou
t
jobsstatussockets
migratedmbytes
active
sessions
MyP
roxy
status
Alerts
Actions
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Job Status Monitoring
• Global summaries– For each/all conditions– For each/all sites– For each/all users– Running & cumulative
• Error status• From job agents• From central services• Multiple views
– Real-time map– Integrated pie charts– Long history plots
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
History Plots, Annotations
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Job Resource Usage Monitoring
• Cumulative parameters– CPU Time & CPU KSI2K– Wall time & Wall KSI2K– Read & written files– Input & output traffic (xrootd)
• Running parameters– Resident memory– Virtual memory– Open files– Workdir size– Disk usage– CPU usage
• Aggregated per site
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Job Network Traffic Monitoring
• Based on the xrootd transfer from every job
• Aggregated statistics for– Sites (incoming, outgoing,
site to site, internal)– Storage Elements
(incoming, outgoing)
• Of– Read and written files– Transferred MB/s
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Individual Job Tracking
• Based on AliEn shell cmds.– top, ps, spy, jobinfo, masterjob
• Using the GUI ML Client– Status, resource usage, per job
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
AliEn & LCG Services Monitoring
• AliEn services– Periodically checked– PID check + SOAP call– Simple functional tests– SE space usage– Efficiency
• LCG environment and tools– Integrating the VoBOX tests previously run by ML within the SAM framework
Proxy lifetime, gsiscp, LCG CE/SE, Job submission, BDII, Local catalog, software area etc.
– Error messages in case of failure– Efficiency– ML Alerts are used for problems notification
• .
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
FTD/FTS Monitoring
• Status of the transfers• Transfer rates• Success/failures• Efficiency via ARDA
Experiment Dashboard
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
VOBox/Head Node Monitoring
• Machine parameters, real-time & history– Load, memory & swap usage, processes, sockets
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Actions Framework
• Based on monitoring information, actions can be taken in– ML Service– ML Repository
• Actions can be triggered by– Values above/below given
thresholds– Absence/presence of values– Correlation between multiple
values• Possible actions types
– Alerts e-mail Instant messaging RSS Feeds
– External commands– Event logging
ML ML RepositoryRepository
ML ServiceML Service
ML ServiceML Service
Actions based onActions based onglobal informationglobal information
Actions based onActions based onlocal informationlocal information
• Traffic• Jobs• Hosts• Apps
• Temperature• Humidity• A/C Power• …
SensorsSensors Local Local decisionsdecisions
Global Global decisionsdecisions
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Alerts and Actions
MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage
ALICE Production jobs queue is automaticallykept full by the automatic resubmissionTrigger: threshold on the number of aliprod waiting jobs
Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Summary
• The MonALISA framework is used as a primary monitoring tool for the ALICE Grid since 2004
• Presently the system is used for monitoring of all (identified) services, jobs and network parameters necessary for the Grid operation and debugging
• The number of concurrently monitored and stored parameters today is ~ 300.000 in 75 ML Services
• The add-on tools for automatic events notification allow for more efficient reaction to problems
• The framework design and flexibility answers all requirements for a monitoring system
• The accumulated information allows to construct and implement automated decision making algorithms, thus increasing further the efficiency of the Grid operations
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Thank You
Questions?
http://alien.cern.ch http://monalisa.caltech.edu