Upload
evers
View
28
Download
0
Embed Size (px)
DESCRIPTION
Planning and Communication. LHCC Comprehensive Review 19-20 November 2007. Planning and Reporting Tools (until mid-2007). Milestones Plans for Sites, Areas, Projects and Experiments including the Tier-1 regional centers Level 1 Milestones Reports Quarterly Reports - PowerPoint PPT Presentation
Citation preview
Alberto AimarAlberto AimarLCG Planning OfficerLCG Planning Officer
Planning and Communication
LHCC Comprehensive Review19-20 November 2007
[email protected] CERN – LCG 2
Planning and Reporting Tools (until mid-2007)
Milestones Plans for Sites, Areas, Projects and Experiments including the Tier-1 regional centers Level 1 Milestones Reports
Quarterly Reports prepared by each site/project/experiment every quarter all milestones due or late are commented in the report projects need to “fill” the Quarterly Report
provide a summary of progress highlight problems (and issues with other projects) add future milestones
Meetings and Communication LCG/EGEE/OSG Operations Meeting Experiment Coord. and Service Coord. Meetings WLCG Bulletin
[email protected] CERN – LCG 3
[email protected] CERN – LCG 4
High Level Milestones (until 2007)
[email protected] CERN – LCG 5
Quarterly Reports in 2007:High Level Milestones +LCG Services +GDB +12 Sites +6 Projects/Areas +4 Experiments
Now we are in a different phase of the project and can focus on
Common Milestones for all Sites
Common MetricsTransfersAvailability/ReliabilityJob Success
Automation and Monitoring
[email protected] CERN – LCG 6
Planning and Communication (recent changes)
Planning Milestones Dashboard Specific plans for Areas and Projects
Metrics Sites Reliability Job Efficiency
Monitoring Gridview Monitoring tools
Communication Meetings, Bulletin
Reporting (Simplified) Quarterly Reports
[email protected] CERN – LCG 7
High Level Milestones Dashboard
We are now in a different phase compared to 2005-2007 when each site had different preparations to implements and therefore different milestones
E.g. installations, infrastructure, networking, buildings, etc Each site had its Milestones Plan and a Quarterly Report
focusing on the specific milestones and progress of each site. On several occasions the Referees had expressed interest in a
higher overview of the milestones across all sites Now the services are installed and common milestones can be
expressed and should be met by all sites E.g. DB Services, gLite Services (or equivalent by other MW),
SRM Services, 24x7 Support, VO Box Support, etc. A new High Level Milestone Dashboard has been introduced,
with milestones across all sites Green=“Done”, Orange=“Late<1 Month”, Red=“Late>1
Month) This new representation is very clear and reviewed monthly at
the MB Meetings.
[email protected] CERN – LCG 8
[email protected] CERN – LCG 10
Sites Availability and Reliability Metrics
The SAM system has been developed to provide Site Availability Monitoring
Tests the Services at the Tier-0 and Tier-1 Sites E.g. CE, SE, SRM, Data Transfers, Certificates, etc Is extensible to more tests and also to VO-specific tests Can check different implementations depending on the
site and VO (e.g. EGEE, OSG, NGDF services, etc) Critical and non-critical tests have been developed for
the general tests (OPS VO) and for the Experiments (ALICE, ATLAS, CMS, LHCB VOs).
Downtimes are commented weekly in the Operations Meeting reports
Since the beginning of 2007 we use the SAM data to review the reliability of the sites
Targets have been set 88% (Jan 07) 91% (Jun 07) 93% (Dec 07)
[email protected] CERN – LCG 11
CA-TRIUMF TW-ASGC US-FNAL-CMS
CERN DE-KIT (GridKa/FZK) FR-CCIN2P3 (IN2P3)
IT-INFN-CNAF UK-T1-RAL NL-T1 (SARA-NIKHEF)
ES-PIC US-T1-BNL NDGF
http://cern.ch/LCG/MB/availability/site_reliability.pdf
Last 6 Months
[email protected] CERN – LCG 12
ES-PIC 96%
CA-TRIUMF 91% TW-ASGC 51%
UK-T1-RAL 95%IT-INFN-CNAF 97%
CERN 99% DE-KIT (GridKa/FZK) 76%
NL-T1 (SARA-NIKHEF) 89%
US-T1-BNL 89%
US-FNAL-CMS 75%
FR-CCIN2P3 90%
NDGF 89%
Every Month
[email protected] CERN – LCG 13
Monthly Reliability of Tier-0, Tier-1 Sites January - October 2007
Site Jan 07
Feb 07
Mar 07
Apr 07
May 07
Jun 07
Jul 07
Aug 07
Sept 07
Oct 07
CERN 99 91 97 96 90 96 95 99 100 99
DE-KIT (FZK) 85 90 75 79 79 48 75 67 91 76
FR-CCIN2P3 96 74 58 95 94 88 94 95 70 90
IT-INFN-CNAF 75 93 76 93 87 67 82 70 80 97
UK-T1-RAL 80 82 80 87 87 87 98 99 90 95
NL-T1(NIKHEF) 93 83 47 92 99 75 92 86 92 89
CA-TRIUMF 79 88 70 73 95 95 97 97 95 91
TW-ASGC 96 97 95 92 98 80 83 83 93 51
US-FNAL-CMS 84 67 90 85 77 77 92 99 89 75
ES-PIC 86 86 96 95 77 79 96 94 93 96
US-T1-BNL 90 57* 6* 89 98 94 75 71 91 89
NDGF n/a n/a n/a n/a n/a n/a n/a n/a n/a 89
Reliability Target
88 88 88 88 88 91 91 91 91 91
Target + 90% target
5 + 5 6 + 3 4 + 1 7 + 3 6 + 3 3 + 2 7 + 2 6 + 2 7 + 2 5 + 4
Avg. 8 best sites: Apr 92% May 94% Jun 87% Jul 93% Aug 94% Sept 93% Oct
93% Avg. all sites: Apr 89% May 89% Jun 80% Jul 89% Aug 88% Sept 89% Oct 86%
* BNL: LCG/gLite CE probed by SAM but not installed with the SL4 upgrade
[email protected] CERN – LCG 14
Sites Availability and Reliability Reports
Every week the Sites report about unavailability at the Operations Meeting
Explaining the problem, the solution found and the severity of the downtime
The SAM tests are executed automatically and provide an objective (although not perfect) view of which services work at the sites
Critical and non-critical tests are added to improve the verifications
They are executed on all sites but depending on the site they test they can be adapted to specific Services (e.g. ARC at NDGF instead of gLite)
VO can add their tests and can check what interests them or add verifications of their systems (e.g. PhedEx, DIRAC, etc)
The VOS can also choose which sites to check
Note: The VO-specific SAM results are not yet published - Experiments and Sites still finding out the problems with the tests
[email protected] CERN – LCG 15
Comparison with VO-Specific SAM Tests September 2007
OPS ALICE ATLAS CMS LHCb GOCDB id
CERN 100% 97% 100% 100% 96% CERN-PROD
DE-KIT 91% 95% 62% 99% 91% FZK-LCG2
FR-CCIN2P3 70% 45% 26% 8% 97% IN2P3-CC
IT-INFN-CNAF 80% 97% 85% 100% 66% INFN-T1
NDGF 97% - 76% - - NDGF-T1
UK-T1-RAL 90% 96% 100% 100% 97% RAL-LCG2
NL-T1 92% 96% 92% 53% 90% SARA-MATRIX
CA-TRIUMF 95% - 98% - - TRIUMF-LCG2
TW-ASGC 93% - 98% 95% - Taiwan-LCG2
US-FNAL-CMS 89% - - 38% - USCMS-FNAL-WC1
ES-PIC 93% - 100% 100% 93% pic
US-T1-BNL 91% - 72% - - BNL-LCG2
>=91% >=82% <82%
[email protected] CERN – LCG 16
Monitoring and Reporting Tools
GridView Gridview is a monitoring and visualization tool being
developed to provide a high level view of various functional aspects of the Worldwide LHC Computing Grid (LCG).
Currently it shows the statistics of data transfers, jobs running and service availability information for the WLCG
It shows the SAM results, accessing the SAM database and one can find out exactly which test has failed on which host
One has a GUI where it is possible to select T1s, T2, VOs, and many options for the display
Grid Monitoring Working Group (on going) Common definitions for sensors and metrics Interface between a site and the grid monitoring fabric Allow sites within different grid infrastructures to publish
and consume the monitoring data Provide views of the system (“dashboards”) adapted to
each of the stakeholder communities
[email protected] CERN – LCG 17
GridView
[email protected] CERN – LCG 18
[email protected] CERN – LCG 19http://gridview.cern.ch
[email protected] CERN – LCG 20
New Quarterly Reports
The new Quarterly Reports will be simplified to only report High Level Milestones and Metrics for each of the Sites
Projects and Area will still have dedicated milestone plans because there are no commonalities
Experiments’ progress is presented at the MB and summarized
Sites will be asked to comment late milestones or performance below targets
i.e. if a site is above targets and milestones and is all “green” will have nothing else to report
Proposed by the MB and accepted by the Overview Board in October 2007
[email protected] CERN – LCG 21
Next Steps: Job Efficiency
Sites Reliability tests show only whether the Services are running
Are the necessary condition for the Experiments application to run
But one needs to verify what the success rates of REAL Experiments jobs are at the Sites
Experiments monitor and display the execution of their jobs at the sites (e.g. ARDA Dashboard) and they have specific job submission and control systems
ALICE Agent, ATLAS Ganga, CMS Crab, LHCb Pilot With specific verification to check exit status and
verify the success/failure of the jobs This data is used to calculate the Site Job Efficiency
[email protected] CERN – LCG 22https://cern.ch/twiki/bin/view/LCG/LcgBulletins
LCG Bulletins
[email protected] CERN – LCG 23
Summary
Status Services are in place and equipment is installed therefore Monitoring and
Metrics are more appropriate Added Metrics for Reliability, Accounting, and (soon) Job Efficiency Dedicated projects have specific milestones (DB, SRM, CCRC, etc)
Reporting Milestones Dashboard and Quarterly Reports (simplified)
Monitoring Information is displayed in a better way (dashboards, targets, colors,
etc) Site reliability available online, weekly reporting and MB reviewing
Communication Unchanged communication tools. Meetings (Operations, Services,
Experiments) and Bulletin
Next Steps Success rates and Job Efficiency for the Experiments applications
WEB: http://cern.ch/LCG/planning WIKI: https://cern.ch/twiki/bin/view/LCG/Planning
[email protected] CERN – LCG 24
Backup Slides
[email protected] CERN – LCG 25
Job Efficiency Table(data below is preliminary)
September 2007 ALICE ATLAS CMS LHCb
AGENT GANGA PROD CRAB PILOTASGC - 22% 82% 90% -
BNL - 0% 0% - -
CERN 99% 50% 92% 76% 99%
CNAF 53% 52% 74% 97% 95%
FNAL - - - 99% -
FZK 96% 73% 93% 96% 93%
IN2P3 89% 77% 79% 99% 96%
NDGF 0% - 84% - -
NIKHEF 100% 45% 84% - 19%
PIC - 7% 61% 100% 88%
RAL 99% 15% 93% 90% 90%
TRIUMF - 4% 94% 0% 0%
>=91% >=82% <82%