31
Dave Kant [email protected] Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

Dave Kant [email protected] Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

Embed Size (px)

Citation preview

Page 1: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

Dave Kant

[email protected]

Monitoring and Accounting

Dave KantCCLRC e-Science Centre, UK

GridPP 12 Jan 31st - Feb 1st 2005

Page 2: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

2

Overview

1. GOC Database

2. Monitoring Tools

3. Accounting

4. Issues

5. Future Plans

Page 3: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

4

GOC Database

– What features? • Configuration of monitoring tools• Security• Organisations• Administrative Roles• Replication

– What role will it play in the future?• New site registration procedure• BDII generation

Page 4: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

8

GRID Configuration Database

GOCDB

GridSite MySQL

Resource CentreResources & Site Information

EDG, LCG-1, LCG-2, …

ce

se

bdii

rb

Monitoring Services

• Operations Maps

• Configure other Tools

• Resource Provider

• Organisation Structures

• Secure services

- Site News

- Self Certification

- Accounting

Secure Database Management via HTTPS / X.509

Store a Subset of the Grid Information system

People, Contact Information, Resources

Maintenance Bit

RC

SQLhttps

SERVER

GOC DB can also contain information that is not present in the IS such as:Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps.

Page 5: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

9

EGEE ROC Structure

• EGEE is made up of regions.• Each region contains many computing centres.• Regional Operational Centres are a focus for

operational activities.

USA

Page 6: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

10

Developed a tool to manage organisational structures. Modelled on GridPP Tier1/2 Structure

Materialised Path Encoding Provide ROCs with a package to monitor the resources in the region

• Tailored Monitoring• Administrative roles to the coordinators in GOCDB

Organisational Structures

EGEE (1)

France (1.1) UK/I (1.2) S.E.E (1.3)

GridPP (1.2.1)

LondonT2

ScotGrid

IMPERIAL

QMUL

Edinburgh

Page 7: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

11

• Total List of all sites is derived from GOCDB (via RGMA)• GOC bit: sites which have opted out e.g. scheduled maintenance• White List: Sites that failed one or more core tests but are well supported are put back in e.g. a Tier1 site • Core tests are a subset of the site functional tests run by CERN every day• Black List: Sites that are not trusted

100’s of Sites

Monitoring Services

Total List of all sites

Sites pass core tests

Trusted Sites

Black List

White List BDII

RGMA

GOC Bit

• GOC DB Site info• Gstat Data• Site Functional Tests• GOC Hourly Tests

Generation of BDII configuration file via feedback into IS

Adaptive Job Brokering Based on the Monitoring System

Environments Production, VO, GridPP, …

Page 8: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

12

How Are New Sites Added?

Site

ROC

GOCDB

Site and ROC liaise

[1]

EGEE

1. JSPG have written a “Site Registration Policy & Procedure” Document2. https://edms.cern.ch/document/503198/3. New GOCDB portal to streamline the site registration process.

[3] Site installs middleware

[2] “candidate” site

[4] “uncertified” Site

[6] “certified” Site

[5] Certification Testing

Page 9: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

13

ReplicationTwo replicas, each one has a different security

considerations• “Services” replica managed by Taipei

– Direct connections to the database by the monitoring tools from known hosts

• “Users” replica to be setup at IN2P3– Web portal based on X.509 certificates

– CIC on duty

Page 10: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

14

Monitoring Tools

• What are the main tools that are used in the day-to-day operations of the LCG Grid? – GPPMON– GSTAT– Site Functional Tests

• Other monitoring tools exist, but I won’t discuss them here– GridIce

Page 11: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

15

Operations Map – Job Submission Tests

GPPMON

Displays the results of tests against sites.

Test: Job Submission

Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements.

This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues).

Page 12: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

16

Operations Map – Certificate Lifetime

GPPMON

Displays the results of tests against sites.

Test:Certificate Lifetime

Many grid services require a valid certificate for security.

By probing the host certificates on CEs and SEs at sites with a simple SSL client service, we can identify certificates which are due to expire and send an early warning to them. A predictive tool!

Page 13: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

23

GIIS Monitor• Developed by MinTsai (GOC Taipei)• Tool to display and check information published by the site GIIS (sanity

checks, fault detection)

• http://goc.grid.sinica.edu.tw/gstat/

Regional Plot:

http://map.gridpp.ac.uk

Page 14: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

24

Site Certification Service

• In terms of middleware, the installation and configuration of a site is quite a complicated procedure. – When there is a new release, sites don’t upgrade at the same time– Some upgrades don’t always go smoothly– Unexpected things happen (who turned of the power?)– Day-to-day problems; robustness of service under load?

• Its necessary to actively hunt for problems • • Site certification testing is by CERN deployment team on a daily

basis. First step toward providing this service involves running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE.

• Unlike the simple job submission tests implemented in GPPMON, these tests are more heavy weight and attempt simulate the life cycle of real applications.

Page 15: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

25

Certification Test Results

http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/listreports.cgi

Page 16: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

26

Aggregator RSSReader (Windows Client)

GOC generates RSS feeds which clients can pull using an RSS aggregator.

How can we integrate feeds and ticketing systems?

Syndication of Monitoring Information

Page 17: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

27

Real Time Grid Monitorhttp://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html

A Visualisation tool to track jobs currently running on the grid.

Applet queries the logging and bookkeeping service to get information about grid jobs.

Why are jobs failing?

Why are jobs queued at sites while others are empty?

Page 18: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

28

Problems with Existing Tools

• Lots of monitoring tools around which have things in common:-- all the information which they generate is hidden away or difficult to access- limited interfaces: the data can only be accessed in specific ways

• Therefore, its difficult to build “on-demand” services to allow communities “Players” to interact with the data.

• The idea is for the services to collect information and put it into a common repository such as an RGMA Archiver. In this way, the information can be shared and accessible to all.

• Services (EGEE parlance: ROC and CIC services) munch the data and present it to the community.

• How much CPU in UKI ROC– How much in GridPP?

• How much in each Tier2?

=> Integrate data from different sources to provide this information

Page 19: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

29

Monitoring Paradigm

A Better way to unify monitoring information.

GOC Services collect information and publish into an archiver.

ROC/CIC Services provide a means for the community to interact with this information on-demand. GOC provides services tailored to the requirements of the community.

Information Repository (RGMA)

Accounting

Monitoring

GSTATTesting

ROC Services

Self Certification

CIC Services

Communities

VOs

ROCs

EGEE

Sites

Organisations

GOC Services

Page 20: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

30

Use Cases

• Monitoring services which use RGMA as the backbone for data transport and data location via the registry service.– Grid Event Monitoring System– “Site Functional Test” Reporting Tool– Accounting

Page 21: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

31

UseCases - GEMS• Grid Event Monitoring System• List of resources to monitor is provided by GOCDB

Alert system that uses RGMA

Looks for changes of state in the monitoring data tables

Generates an alert and displays on the GEMS console.

Notification features

Event filtering

Page 22: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

32

Reporting Tool PrototypeOrganisational Identities taken from GOCDB

Page 23: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

36

Accounting• Information collected at each site from batch logs,

gatekeeper logs etc• Information joined at site level to select grid jobs and

stored in database on R-GMA MON box at site.• Information published through R-GMA and collected

centrally in an R-GMA archive at GOC• Web site presents various views of this data for

presentation

• Information schema based on GGF Usage Group • Structure of Grid taken from GOC DB – the grid

configuration database.• Only normalised cpu time collected (at the moment)

Page 24: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

37

Page 25: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

39

GOC Accounting Serviceshttp://goc.grid-support.ac.uk/gridsite/accounting/index.html

BaseCpuSeconds Aggregated across EGEE

Each Site, per VO, per Month

Simple interface to customise views of data: VO, time frame and Region (default = EGEE)

Each Region, per VO, per Month

On Demand Services to EGEE Community

Other Distributions

Normalised CPU

# Jobs

Page 26: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

40

Web form to apply selection criteria on the data

Aggregate data across an organisation structure

(Default= All ROCs)

Select VOs (Default = All)

Select date range

Page 27: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

41

VO Index

Summed CPU (Seconds) consumed by resources in selected Region

Selected Date Range

Page 28: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

42

List of Sites Belonging to the Selected ROC

A breakdown of the resource usage per Site, per VO, per Month

Page 29: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

43

Deployment

• Package was released to LCG in August 2004 and certified soon afterwards.

• There was no LCG release after that until LCG2_3_0 on 18th December 2004

• Today there are still very few 2_3_0 sites. There are 28 sites producing accounting records today.

• The 2_3_0 release has some bugs which are fixed in a new release that is available on the accounting home page

• Recommend that sites upgrade accounting to version APEL 3.4.40 available on the accounting homepage

http://goc.grid-support.ac.uk/gridsite/accounting/index.html

Page 30: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

46

Future Plans

• Support for the LSF batch system. • Understand Normalisation issues; do we

have faith in the numbers we present?• Extend accounting schema to include

information about the worker node, Job efficiency and globalJobID.

• Integrate the LCG schema with de-facto grid accounting standards, namely GGF– Share data with other Grid Communities

• NorduGrid, Grid03

Page 31: Dave Kant D.Kant@rl.ac.uk Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005

47

Summary

• GOCDB to take a more important role in operation environment

• A shift in the monitoring paradigm which relies on sharing data through RGMA

• Accounting Information gathering infrastructure and reporting web site

• Development towards on-demand services to provide the community with up-to-date information, aggregated at different levels.

• Development of Visualisation tools to enhance our understanding of the grid.

• Adaptive Job brokering based on the monitoring system