47
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Embed Size (px)

Citation preview

Page 1: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Monitoringand

Operations

SAM Development Team

CERN IT/GD

Tier2 Admin Workshop

03 Dec. 2006, Mumbai

Page 2: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2

Outline

• Monitoring and Operational tools– SAM

• framework• sensors• availibility metrics

– FCR– gstat, GOCDB, SAM Admin Portal, COD

Dashboard

• Grid Operations (COD)

Page 3: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3

Monitoring tools

Service Availibility Monitoring

(SAM)

Page 4: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4

SAM -- Overview

• Grid service-level monitoring framework

• successor of SFT• used in Grid Operations • basis for Availibility Metrics• VO-based submissions

– VO-specific tests

• services tested currently:•CE, gCE•SE•RB• sBDII

•BDII• FTS• LFC• JobWrapper

tests

Page 5: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 5

Central SAM submissions

• Official CERN submissions– Production and Certified sites– ops (+ dteam) VO– job submitted in every hour– basis of COD alarms– https://lcg-sam.cern.ch:8443/sam/sam.py

• PPS– ops VO– hourly– https://lcg-sam.cern.ch:8443/sam-pps/sam.py

• SAM Admin Portal– ops VO– on-demand– Certified + Uncertified sites

Page 6: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 6

VO specific tests submission

• LHCb – successfully migrated to SAM (only CE, gCE)– VO specific test (Dirac installation)

• Atlas– all sensors – submitted from SAM UI

• CMS– set up, but no regular submission yet

Page 7: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 7

SAM Portal

Page 8: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 8

SAM Internals

• framework structure– client

• submission framework – (developed by CERN team)

• sensors – developed by different contributors + CERN team– tests: plug-in modules

– server•web services• portal

• Oracle DB accessed by web services• static (GOCDB) + dynamic (BDIIs) info

Page 9: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 9

Sensors – CE, gCE, SE, SRM

• CE, gCE– job submission

•UI → RB → CE → WN chain

– CA certificates (on WN)– software middleware version (WN)– replica management

• lcg-utils• default SE + 3rd -party replication

– RGMA, Apel, etc.

• SE, SRM– UI ↔ SE/SRM

• lcg-utils (LFC)

Page 10: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 10

Sensors – LFC, FTS

• LFC– lfc-ls + create file in /grid/<VO>

• FTS– BDII entry check– listing channels

• glite-transfer-channel-list (ChannelManagement service)– transfer test (in development):

• submitting transfer jobs between SRMs in all Tier0 and Tier1 sites (N-N testing)

• checking the status of jobs•Note! The test is relying on availability of SRMs in sites

Page 11: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 11

Standalone sensors – BDII, RB

• sBDII (Gstat)– accessibility– sanity checks

• top-level BDIIs (Gstat)– accessibility – reliability of data (number of entries)

• RB– jobs submission

•UI → important RBs → “reliable” CEs

– time of matchmaking

Page 12: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 12

JobWrapper tests

• JobWrapper– requested by experiments, also useful in operations– testing all WNs

•SAM always tests just an arbitrary one

– tests executed by CE wrapper script• executed with every production job

– test results• passed to the job• published to the SAM DB

– test code• core scripts in the release• tests on software area (signed tarball)

– soon in production

Page 13: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 13

Availability metrics - algorithm

t CriticalTests∈

TestResult (N,t)Status of node N =

Status of site S =

CE1

CE2

CEn

SRM 1

SRM 2

SRM n

site BDII

AND

OR

OR

OR

OR

Everything is calculated for each

VO that defined critical tests in

FCR

Results make sense only if VO

submits tests!!!

N instances(C)∈

Status (N)Status of service C =

∧∨

∧ = boolean AND ∨ = boolean OR

Page 14: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 14

Availability metrics - algorithm II

• service and site status in every hour

• daily, weekly, monthly availability • scheduled downtime information from

GOCDB • details of the algorithm on GOC:

http://goc.grid.sinica.edu.tw/gocwiki/SAME_Metrics_calculation

Page 15: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 15

Availability metrics - GridView

Page 16: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 16

Availability metrics - data export

Page 17: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 17

VO tools

Freedom of Choice for Resources (FCR)

Page 18: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 18

FCR -- Overview

• Freedom of Choice for Resources• https://lcg-fcr.cern.ch:8443/fcr/fcr.cgi• VO policy enforcement tool• critical test and resource selection for VOs by

manipulating top-level BDII information• goal is to be able to

– select which aspects of site funcionality are important for the VO

– blacklist unreliable sites– always use stable, "important" sites – less reliable sites based on SAM results

Page 19: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 19

FCR -- Overview

• integrated with SAM– sharing the same DB

• optional usage– BDII configuration parameter– FCR output: ldif file

• information from GOCBD + BDII • DN-based authentication (2-levels)

Page 20: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 20

FCR Admin Portal

Page 21: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 21

FCR User Pages

• read-only view of VO settings• tells if the resource is available at the moment• grouping selection

Page 22: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 22

FCR User Portal

Page 23: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 23

Monitoring tools

gstat, SAM Admin Portal, COD dashboard

Page 24: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 24

gstat (Sinica)

– http://goc.grid.sinica.edu.tw/gstat/– Information System (BDII) monitoring– response time, consistency (sanity),

completeness– site-BDII + top-level BDII– aggregated and detailed views– plots (history)– refreshed in every 5 mins (non-

intrusive)

Page 25: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 25

gstat

Page 26: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 26

SAM Admin Portal

– https://monitoring.egee.man.poznan.pl/admin2

– on-demand SAM submission– easy to use– target site selection– used by:

•ROCs: certification of a site•ROCs, site admins, CODs: speed up

debugging

Page 27: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 27

SAM Admin Portal

Page 28: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 28

GOCDB

– https://goc.grid-support.ac.uk/gridsite/gocdb2/index.php

– central database to store static site information

– all EGEE sites have to register– contact, security contact, certification status,

site type– scheduled maintainence– used by

•script that generates top-level BDII config file

•monitoring tools •SAM DB → SAM, FCR, Availability calc.•operations management tools

Page 29: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 29

GOCDB

Page 30: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 30

CIC Operations Portal

• COD management

– schedule for rotations– COD dashboard– COD handover notes

• ROC management

– ROC contacts– weekly reports

• VO management

– VO ID cards (VO contacts, etc.)

• EGEE broadcast

Page 31: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 31

CIC Operations Portal

Page 32: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 32

GGUS (FZK)

• Global GRID User Support• http://ggus.org• ticketing system for the EGEE GRID

• based on Remedy• tickets created by

– individual users (manually)– Grid Operators (via COD Dashboard)

• news, documentation

Page 33: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 33

GGUS Portal

Page 34: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 34

Operations

Grid Operations

Page 35: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 35

EGEE Operations Structure

• Regional Operations Centres (ROC)– One in each region (incl. Asia-Pacific)

– Front-line support for user and operations issues

• point of contact for sites in the region

– Provide local knowledge and adaptations

– Manage daily Grid operations – oversight, troubleshooting

– Run infrastructure services

• for Asia-Pacific region

– Asia-Pacific• [email protected]

• Jason Shih, Min-Hong Tsai, Shu-Ting Liao

– CERN (catch-all ROC)• [email protected]• Nicholas Thackray

Page 36: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 36

COD

• COD is Operator on Duty – was: CIC-on-Duty

• global LCG/EGEE GRID monitoring• 1 (2) ROCs responsible for the whole GRID

operations at a time– 12 ROCs involved– weekly rotation

• weekly WLCG-OSG-EGEE Operations meeting

– ROCS, Tier1, VOs– all sites invited

Page 37: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 37

COD Procedures• https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures

• Looking at monitoring tools– SAM, Certificate Monitoring pages

• Open tickets using COD Dasboard• Escalate expired tickets• Process site responses (update tickets

accordingly)• End of duty: hand-over notes• Update the GOC wiki pages

Page 38: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 38

COD Dashboard

• summary of necessary monitoring information + tools for ticket processing

• tickets linked to GGUS tickets• GOCDB information

– site downtime information!

• SAM alarms• ticket creation and management tool• tools for related e-mail

Page 39: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 39

COD Dashboard

Page 40: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 40

Connection between the used tools

COD dashboard

Monitoring tools

GGUS

Grid Operators

(COD)

Problem tracking

and

reporting

Ticket follow-up

Modifications

on the ticketsSAM

Page 41: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 41

• defines the steps to be taken during the lifetime of a ticket– tickets don't get forgotten!

• avaliable on CIC Portal– (https://edms.cern.ch/document/701575)

• prioritization alarms depending on the amount of resources at the site

Escalation Procedure

Page 42: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 42

Escalation Steps

1.ticket creation

2.first mail (to: site + ROC)

3.second mail (to: site + ROC)

4.suspension from the GRID

• before 4.:a) mail to ROCb)mail to OCC for validation c)site is invited to the weekly operations meeting

Page 43: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 43

Escalation Procedure -- Quarantine

• site categories– low: CPU <20– normal: 20 < CPU < 100– high: 100 < CPU

• between 2.-3. and 3.-4.– low + normal: 3 days– high: 1 days

Page 44: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 44

COD Escalation Procedure

Create ticket Close ticket

When

deadline

reachedProblem solved ?

last

escalation ?

Extend deadline

Suspend site

Escalate

mail

yes

no

no

site respondsmail mail

mail

Page 45: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 45

What a site is expected to do

• Look at the monitoring tools (SAM)

– try to notice & fix failures before the CODs

• COD notification about a failure

– fix it ASAP– contact the ROC for help if needed

• Scheduled downtime– enter it in GOCDB– broadcast it in advance– broadcast when it's finished

• weekly site reports (at COD portal)– input to weekly Operations meeting

Page 46: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 46

What a site could do

• problems → contact the ROC

– best way: GGUS ticket

• question → ask the ROC

• open a ticket if there is a failure in Central Services – LFC, SAM, etc.

Page 47: Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 47

Happy End

Thanks for your attention :)