20
EGEE-II INFSO-RI- 031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Monitoring, accounting and automated decision support for the ALICE experiment based on the MonALISA framework Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP) EGEE User Forum, May 9 th -11 th 2007, Manchester

Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

Embed Size (px)

DESCRIPTION

Monitoring, accounting and automated decision support for the ALICE experiment based on the MonALISA framework. Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP) EGEE User Forum, May 9 th -11 th 2007, Manchester. Contents. Monitoring requirements MonALISA overview - PowerPoint PPT Presentation

Citation preview

Page 1: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

EGEE-II INFSO-RI-031688

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

Monitoring, accounting and automated decision support for the ALICE experiment based on the MonALISA framework

Catalin Cirstoiu (IT/PSS/ED)

Costin Grigoras, Latchezar Betev (PH/AIP)

EGEE User Forum, May 9th-11th 2007, Manchester

Page 2: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Contents

• Monitoring requirements• MonALISA overview• Application monitoring• Monitoring architecture in AliEn

– Jobs monitoring– Traffic monitoring– Services monitoring– Nodes monitoring

• Actions framework• Feature snapshots

Page 3: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 3

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Monitoring Requirements

• Global view of the entire distributed system– Non-intrusive– Accurate

• Providing– Near real-time information– Long-term history of aggregated data

• On key parameters like– System status– Resource usage

• Helping with– Correlating events– System debugging– Generating reports

• Taking automated actions based on the monitored data

Page 4: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 4

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

MonALISA Overview

• MonALISA is a Dynamic, Distributed Service Architecture capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems.

Data Store

Data CacheService & DB

Configuration Control (SSL)

Predicates & Agents

Data (via ML Proxy)

Applications Java Client(other service)

Agents Filters DataModules

WS Client(other service)

WebService

WSDLSOAP

LookupService

LookupService

Registration

Discovery

Postgres MySQL

Page 5: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 5

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

ApMon – Application Monitoring

• Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA Services

• High comm. performance • Flexible• Accounting• Sys Mon

MonALISAService

MonALISAService

ApMon

ApMon

APPLICATION

APPLICATION

MonitoringData

UDP/XDR

Mbps_out: 0.52 Status: reading

App. Monitoring

MB_inout: 562.4

ApMonConfig

parameter1: value parameter2: value

App. Monitoring

...

Time;IP;procIDMonitoring

Data

UDP/XDR

MonitoringData

UDP/XDR

load1: 0.24 processes: 97

System Monitoring

pages_in: 83

MonALISA

hosts

Config Servlet dynamic reloading

ApMon configuration generated automatically by a servlet / CGI script

0

10

20

30

40

50

60

70

0 1000 2000 3000 4000 5000 6000

Messages per second

Mo

nA

LIS

A C

PU

Usa

ge

(%)

No Lost Packages

Page 6: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Monitoring Architecture in AliEn

Long HistoryDB

http://pcalimonitor.cern.ch:8889/LCG Tools

MonALISA @Site

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

MonALISA @CERN

MonALISA

LCG Site

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn TQ

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn IS

ApMon

AliEn Optimizers

ApMon

AliEn Brokers

ApMon

MySQLServers

ApMon

CastorGridScripts

ApMon

APIServices

MonaLisaMonaLisaRepositoryRepository

Aggregated Data

rss

vsz

cpu

time

run

tim

e

job

slots

free

spac

e

nr.

of

file

s

op

en

files

Queued

JobAgents

cpu

ksi2k

jobstatus

disk

used

pro

cesses

loadn

etIn

/ou

t

jobsstatussockets

migratedmbytes

active

sessions

MyP

roxy

status

Alerts

Actions

Page 7: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Job Status Monitoring

• Global summaries– For each/all conditions– For each/all sites– For each/all users– Running & cumulative

• Error status• From job agents• From central services• Multiple views

– Real-time map– Integrated pie charts– Long history plots

Page 8: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Real-time map

Page 9: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 9

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Integrated Pie Charts

Page 10: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

History Plots, Annotations

Page 11: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 11

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Job Resource Usage Monitoring

• Cumulative parameters– CPU Time & CPU KSI2K– Wall time & Wall KSI2K– Read & written files– Input & output traffic (xrootd)

• Running parameters– Resident memory– Virtual memory– Open files– Workdir size– Disk usage– CPU usage

• Aggregated per site

Page 12: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 12

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Job Network Traffic Monitoring

• Based on the xrootd transfer from every job

• Aggregated statistics for– Sites (incoming, outgoing,

site to site, internal)– Storage Elements

(incoming, outgoing)

• Of– Read and written files– Transferred MB/s

Page 13: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 13

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Individual Job Tracking

• Based on AliEn shell cmds.– top, ps, spy, jobinfo, masterjob

• Using the GUI ML Client– Status, resource usage, per job

Page 14: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 14

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

AliEn & LCG Services Monitoring

• AliEn services– Periodically checked– PID check + SOAP call– Simple functional tests– SE space usage– Efficiency

• LCG environment and tools– Integrating the VoBOX tests previously run by ML within the SAM framework

Proxy lifetime, gsiscp, LCG CE/SE, Job submission, BDII, Local catalog, software area etc.

– Error messages in case of failure– Efficiency– ML Alerts are used for problems notification

• .

Page 15: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 15

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

FTD/FTS Monitoring

• Status of the transfers• Transfer rates• Success/failures• Efficiency via ARDA

Experiment Dashboard

Page 16: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 16

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

VOBox/Head Node Monitoring

• Machine parameters, real-time & history– Load, memory & swap usage, processes, sockets

Page 17: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 17

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Actions Framework

• Based on monitoring information, actions can be taken in– ML Service– ML Repository

• Actions can be triggered by– Values above/below given

thresholds– Absence/presence of values– Correlation between multiple

values• Possible actions types

– Alerts e-mail Instant messaging RSS Feeds

– External commands– Event logging

ML ML RepositoryRepository

ML ServiceML Service

ML ServiceML Service

Actions based onActions based onglobal informationglobal information

Actions based onActions based onlocal informationlocal information

• Traffic• Jobs• Hosts• Apps

• Temperature• Humidity• A/C Power• …

SensorsSensors Local Local decisionsdecisions

Global Global decisionsdecisions

Page 18: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 18

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Alerts and Actions

MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage

ALICE Production jobs queue is automaticallykept full by the automatic resubmissionTrigger: threshold on the number of aliprod waiting jobs

Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information

Page 19: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 19

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Summary

• The MonALISA framework is used as a primary monitoring tool for the ALICE Grid since 2004

• Presently the system is used for monitoring of all (identified) services, jobs and network parameters necessary for the Grid operation and debugging

• The number of concurrently monitored and stored parameters today is ~ 300.000 in 75 ML Services

• The add-on tools for automatic events notification allow for more efficient reaction to problems

• The framework design and flexibility answers all requirements for a monitoring system

• The accumulated information allows to construct and implement automated decision making algorithms, thus increasing further the efficiency of the Grid operations

Page 20: Catalin Cirstoiu (IT/PSS/ED) Costin Grigoras, Latchezar Betev (PH/AIP)

[email protected] 20

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Thank You

Questions?

http://alien.cern.ch http://monalisa.caltech.edu