25
Alberto Aimar Alberto Aimar LCG Planning LCG Planning Officer Officer Planning and Communication LHCC Comprehensive Review 19-20 November 2007

LHCC Comprehensive Review 19-20 November 2007

  • Upload
    evers

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Planning and Communication. LHCC Comprehensive Review 19-20 November 2007. Planning and Reporting Tools (until mid-2007). Milestones Plans for Sites, Areas, Projects and Experiments including the Tier-1 regional centers Level 1 Milestones Reports Quarterly Reports - PowerPoint PPT Presentation

Citation preview

Page 1: LHCC Comprehensive Review 19-20 November 2007

Alberto AimarAlberto AimarLCG Planning OfficerLCG Planning Officer

Planning and Communication

LHCC Comprehensive Review19-20 November 2007

Page 2: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 2

Planning and Reporting Tools (until mid-2007)

Milestones Plans for Sites, Areas, Projects and Experiments including the Tier-1 regional centers Level 1 Milestones Reports

Quarterly Reports prepared by each site/project/experiment every quarter all milestones due or late are commented in the report projects need to “fill” the Quarterly Report

provide a summary of progress highlight problems (and issues with other projects) add future milestones

Meetings and Communication LCG/EGEE/OSG Operations Meeting Experiment Coord. and Service Coord. Meetings WLCG Bulletin

Page 3: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 3

Page 4: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 4

High Level Milestones (until 2007)

Page 5: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 5

Quarterly Reports in 2007:High Level Milestones +LCG Services +GDB +12 Sites +6 Projects/Areas +4 Experiments

Now we are in a different phase of the project and can focus on

Common Milestones for all Sites

Common MetricsTransfersAvailability/ReliabilityJob Success

Automation and Monitoring

Page 6: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 6

Planning and Communication (recent changes)

Planning Milestones Dashboard Specific plans for Areas and Projects

Metrics Sites Reliability Job Efficiency

Monitoring Gridview Monitoring tools

Communication Meetings, Bulletin

Reporting (Simplified) Quarterly Reports

Page 7: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 7

High Level Milestones Dashboard

We are now in a different phase compared to 2005-2007 when each site had different preparations to implements and therefore different milestones

E.g. installations, infrastructure, networking, buildings, etc Each site had its Milestones Plan and a Quarterly Report

focusing on the specific milestones and progress of each site. On several occasions the Referees had expressed interest in a

higher overview of the milestones across all sites Now the services are installed and common milestones can be

expressed and should be met by all sites E.g. DB Services, gLite Services (or equivalent by other MW),

SRM Services, 24x7 Support, VO Box Support, etc. A new High Level Milestone Dashboard has been introduced,

with milestones across all sites Green=“Done”, Orange=“Late<1 Month”, Red=“Late>1

Month) This new representation is very clear and reviewed monthly at

the MB Meetings.

Page 8: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 8

Page 9: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 9

Sites

Milestones

Page 10: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 10

Sites Availability and Reliability Metrics

The SAM system has been developed to provide Site Availability Monitoring

Tests the Services at the Tier-0 and Tier-1 Sites E.g. CE, SE, SRM, Data Transfers, Certificates, etc Is extensible to more tests and also to VO-specific tests Can check different implementations depending on the

site and VO (e.g. EGEE, OSG, NGDF services, etc) Critical and non-critical tests have been developed for

the general tests (OPS VO) and for the Experiments (ALICE, ATLAS, CMS, LHCB VOs).

Downtimes are commented weekly in the Operations Meeting reports

Since the beginning of 2007 we use the SAM data to review the reliability of the sites

Targets have been set 88% (Jan 07) 91% (Jun 07) 93% (Dec 07)

Page 11: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 11

CA-TRIUMF TW-ASGC US-FNAL-CMS

CERN DE-KIT (GridKa/FZK) FR-CCIN2P3 (IN2P3)

IT-INFN-CNAF UK-T1-RAL NL-T1 (SARA-NIKHEF)

ES-PIC US-T1-BNL NDGF

http://cern.ch/LCG/MB/availability/site_reliability.pdf

Last 6 Months

Page 12: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 12

ES-PIC 96%

CA-TRIUMF 91% TW-ASGC 51%

UK-T1-RAL 95%IT-INFN-CNAF 97%

CERN 99% DE-KIT (GridKa/FZK) 76%

NL-T1 (SARA-NIKHEF) 89%

US-T1-BNL 89%

US-FNAL-CMS 75%

FR-CCIN2P3 90%

NDGF 89%

Every Month

Page 13: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 13

Monthly Reliability of Tier-0, Tier-1 Sites January - October 2007

Site Jan 07

Feb 07

Mar 07

Apr 07

May 07

Jun 07

Jul 07

Aug 07

Sept 07

Oct 07

CERN 99 91 97 96 90 96 95 99 100 99

DE-KIT (FZK) 85 90 75 79 79 48 75 67 91 76

FR-CCIN2P3 96 74 58 95 94 88 94 95 70 90

IT-INFN-CNAF 75 93 76 93 87 67 82 70 80 97

UK-T1-RAL 80 82 80 87 87 87 98 99 90 95

NL-T1(NIKHEF) 93 83 47 92 99 75 92 86 92 89

CA-TRIUMF 79 88 70 73 95 95 97 97 95 91

TW-ASGC 96 97 95 92 98 80 83 83 93 51

US-FNAL-CMS 84 67 90 85 77 77 92 99 89 75

ES-PIC 86 86 96 95 77 79 96 94 93 96

US-T1-BNL 90 57* 6* 89 98 94 75 71 91 89

NDGF n/a n/a n/a n/a n/a n/a n/a n/a n/a 89

Reliability Target

88 88 88 88 88 91 91 91 91 91

Target + 90% target

5 + 5 6 + 3 4 + 1 7 + 3 6 + 3 3 + 2 7 + 2 6 + 2 7 + 2 5 + 4

Avg. 8 best sites: Apr 92% May 94% Jun 87% Jul 93% Aug 94% Sept 93% Oct

93% Avg. all sites: Apr 89% May 89% Jun 80% Jul 89% Aug 88% Sept 89% Oct 86%

* BNL: LCG/gLite CE probed by SAM but not installed with the SL4 upgrade

Page 14: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 14

Sites Availability and Reliability Reports

Every week the Sites report about unavailability at the Operations Meeting

Explaining the problem, the solution found and the severity of the downtime

The SAM tests are executed automatically and provide an objective (although not perfect) view of which services work at the sites

Critical and non-critical tests are added to improve the verifications

They are executed on all sites but depending on the site they test they can be adapted to specific Services (e.g. ARC at NDGF instead of gLite)

VO can add their tests and can check what interests them or add verifications of their systems (e.g. PhedEx, DIRAC, etc)

The VOS can also choose which sites to check

Note: The VO-specific SAM results are not yet published - Experiments and Sites still finding out the problems with the tests

Page 15: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 15

Comparison with VO-Specific SAM Tests September 2007

  OPS ALICE ATLAS CMS LHCb  GOCDB id

CERN  100% 97% 100% 100% 96% CERN-PROD 

DE-KIT 91% 95% 62% 99% 91% FZK-LCG2 

FR-CCIN2P3 70% 45% 26% 8% 97% IN2P3-CC 

IT-INFN-CNAF 80% 97% 85% 100% 66% INFN-T1 

NDGF 97% - 76% - - NDGF-T1 

UK-T1-RAL  90% 96% 100% 100% 97% RAL-LCG2 

NL-T1  92% 96% 92% 53% 90% SARA-MATRIX 

CA-TRIUMF 95% - 98% - - TRIUMF-LCG2 

TW-ASGC 93% - 98% 95% - Taiwan-LCG2 

US-FNAL-CMS  89% - - 38% - USCMS-FNAL-WC1 

ES-PIC 93% - 100% 100% 93% pic 

US-T1-BNL 91% - 72% - - BNL-LCG2 

>=91% >=82% <82%

Page 16: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 16

Monitoring and Reporting Tools

GridView Gridview is a monitoring and visualization tool being

developed to provide a high level view of various functional aspects of the Worldwide LHC Computing Grid (LCG).

Currently it shows the statistics of data transfers, jobs running and service availability information for the WLCG

It shows the SAM results, accessing the SAM database and one can find out exactly which test has failed on which host

One has a GUI where it is possible to select T1s, T2, VOs, and many options for the display

Grid Monitoring Working Group (on going) Common definitions for sensors and metrics Interface between a site and the grid monitoring fabric Allow sites within different grid infrastructures to publish

and consume the monitoring data Provide views of the system (“dashboards”) adapted to

each of the stakeholder communities

Page 17: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 17

GridView

Page 18: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 18

Page 19: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 19http://gridview.cern.ch

Page 20: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 20

New Quarterly Reports

The new Quarterly Reports will be simplified to only report High Level Milestones and Metrics for each of the Sites

Projects and Area will still have dedicated milestone plans because there are no commonalities

Experiments’ progress is presented at the MB and summarized

Sites will be asked to comment late milestones or performance below targets

i.e. if a site is above targets and milestones and is all “green” will have nothing else to report

Proposed by the MB and accepted by the Overview Board in October 2007

Page 21: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 21

Next Steps: Job Efficiency

Sites Reliability tests show only whether the Services are running

Are the necessary condition for the Experiments application to run

But one needs to verify what the success rates of REAL Experiments jobs are at the Sites

Experiments monitor and display the execution of their jobs at the sites (e.g. ARDA Dashboard) and they have specific job submission and control systems

ALICE Agent, ATLAS Ganga, CMS Crab, LHCb Pilot With specific verification to check exit status and

verify the success/failure of the jobs This data is used to calculate the Site Job Efficiency

Page 22: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 22https://cern.ch/twiki/bin/view/LCG/LcgBulletins

LCG Bulletins

Page 23: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 23

Summary

Status Services are in place and equipment is installed therefore Monitoring and

Metrics are more appropriate Added Metrics for Reliability, Accounting, and (soon) Job Efficiency Dedicated projects have specific milestones (DB, SRM, CCRC, etc)

Reporting Milestones Dashboard and Quarterly Reports (simplified)

Monitoring Information is displayed in a better way (dashboards, targets, colors,

etc) Site reliability available online, weekly reporting and MB reviewing

Communication Unchanged communication tools. Meetings (Operations, Services,

Experiments) and Bulletin

Next Steps Success rates and Job Efficiency for the Experiments applications

WEB: http://cern.ch/LCG/planning WIKI: https://cern.ch/twiki/bin/view/LCG/Planning

Page 24: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 24

Backup Slides

Page 25: LHCC Comprehensive Review 19-20 November 2007

[email protected] CERN – LCG 25

Job Efficiency Table(data below is preliminary)

September 2007 ALICE ATLAS CMS LHCb

AGENT GANGA PROD CRAB PILOTASGC - 22% 82% 90% -

BNL - 0% 0% - -

CERN 99% 50% 92% 76% 99%

CNAF 53% 52% 74% 97% 95%

FNAL - - - 99% -

FZK 96% 73% 93% 96% 93%

IN2P3 89% 77% 79% 99% 96%

NDGF 0% - 84% - -

NIKHEF 100% 45% 84% - 19%

PIC - 7% 61% 100% 88%

RAL 99% 15% 93% 90% 90%

TRIUMF - 4% 94% 0% 0%

>=91%   >=82%   <82%