View
217
Download
0
Category
Tags:
Preview:
Citation preview
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
Status of EGEE Operations
Ian Bird, CERN
SA1 Activity Leader
EGEE 3rd Conference
Athens, 18th April, 2005
Athens Conference; 18th April 2005 2
Enabling Grids for E-sciencE
INFSO-RI-508833
Overview
• Overall activity status Service & Operations
• Planning for remainder of project Main focus of activities gLite migration
• Summary
Tomorrow’s plenary session for technical details
Country providing resourcesCountry anticipating joining
In LCG-2: 131 sites, 30 countries >12,000 cpu ~5 PB storage
Includes non-EGEE sites:• 9 countries• 20 sites
Computing Resources: April 2005
Athens Conference; 18th April 2005 5
Enabling Grids for E-sciencE
INFSO-RI-508833
Infrastructure metrics
Countries, sites, and
CPU available in EGEE
production service
Countries, sites, and
CPU available in EGEE
production service
Region coun-tries
sites cpu M6 (TA)
cpuM15 (TA)
cpuactual
CERN 0 1 900 1800 1841
UK/Ireland 2 19 100 2200 2398
France 1 8 400 895 1172
Italy 1 21 553 679 2164
South East 5 16 146 322 159
South West 2 13 250 250 498
Central Europe 5 10 385 730 629
Northern Europe 2 4 200 2000 427
Germany/Switzerland 2 10 100 400 1733
Russia 1 9 50 152 276
EGEE-total 21 111 3084 9428 11297
USA 1 3 - - 555
Canada 1 6 - - 316
Asia-Pacific 6 8 - - 394
Hewlett-Packard 1 3 - - 172
Total other 9 20 - - 1437
Grand Total 30 131 - - 12734
EGEE partner regions
Other collaborating sites
Athens Conference; 18th April 2005 6
Enabling Grids for E-sciencE
INFSO-RI-508833
Service Usage
• VOs and users on the production service Active HEP experiments:
4 LHC, D0, CDF, Zeus, Babar Active other VO:
Biomed, ESR (Earth Sciences), Compchem, Magic (Astronomy), EGEODE (Geo-Physics)
6 disciplines Registered users in these VO: 600 In addition to these there are many VO that are
local to a region, supported by their ROCs, but not yet visible across EGEE
• Scale of work performed: LHC Data challenges 2004:
>1 M SI2K years of cpu time (~1000 cpu years) 400 TB of data generated, moved and stored 1 VO achieved ~4000 simultaneous jobs (~4 times
CERN grid capacity)
Number of jobs processed/month
Athens Conference; 18th April 2005 7
Enabling Grids for E-sciencE
INFSO-RI-508833
SA1 – Operations Structure
• Operations Management Centre (OMC):• Core Infrastructure Centres (CIC)
Manage daily grid operations – oversight, troubleshooting
Run essential infrastructure services Provide 2nd level support to ROCs UK/I, Fr, It, CERN, + Russia (M12)
Weekly rotation in place since October
Taipei also run a CIC
• Regional Operations Centres (ROC) Act as front-line support for user and
operations issues Provide local knowledge and adaptations One in each region – many distributed
• User Support Centre (GGUS) In FZK – manage PTS – provide single
point of contact (service desk) Not foreseen as such in TA, but need is
clear
Athens Conference; 18th April 2005 8
Enabling Grids for E-sciencE
INFSO-RI-508833
Operations Procedures
• Driven by experience during 2004 Data Challenges, &• Reflecting the outcome of the November Operations
Workshop• Operations Procedures
roles of CICs - ROCs - RCs weekly rotation of operations centre duties (CIC-on-duty)
Process in place since October
daily tasks of the operations shift monitoring (tools, frequency) problem reporting
• problem tracking system
• communication with ROCs&RCs escalation of unresolved problems handing over the service to the next CIC
Athens Conference; 18th April 2005 9
Enabling Grids for E-sciencE
INFSO-RI-508833
New Release Process (simplified)
C&TC&T
EISEISGISGIS
GDBGDB
ApplicationsApplicationsRCRCBugs/Patches/Task
SavannahBugs/Patches/Task
Savannah
EISEISCICsCICs
Head of Deployment
Head of Deployment
prioritization&
selection
DevelopersDevelopers
ApplicationsApplications
DevelopersDevelopers
11
List for next release(can be empty)
List for next release(can be empty)22
integration&
first testsC&TC&T
33
Internal ReleasesInternal
Releases
44
User Level install of
client toolsEISEIS
55
full deployment on test clusters (6)
functional/stress tests~1 week
C&TC&T
66
assign and update cost
Bugs/Patches/TaskSavannah
Bugs/Patches/TaskSavannah
componentsready at cutoff
InternalClient
Release
InternalClient
Release
77Client
ReleaseClient
ReleaseService ReleaseService Release
Updates ReleaseUpdates Release
Core Service Release
Core Service Release
C&TC&T
Athens Conference; 18th April 2005 10
Enabling Grids for E-sciencE
INFSO-RI-508833
Deployment process
Release(s)Release(s)
Certificationis run daily
Update User Guides EISEIS
UpdateRelease Notes
GISGIS
ReleaseNotes
InstallationGuides
UserGuides
Re-Certify
CICCIC
Every Month
1111
ReleaseReleaseReleaseReleaseClient ReleaseClient Release
Deploy ClientReleases
(User Space)GISGIS
Deploy ServiceReleases (Optional) CICs
RCsCICsRCs
Deploy MajorReleases
(Mandatory) ROCsRCs
ROCsRCs
YAIM
Every Month
Every 3 months
on fixed dates !
at own pace
Athens Conference; 18th April 2005 12
Enabling Grids for E-sciencE
INFSO-RI-508833
Future work – comments from review
• Testing and software packaging will be critical to success. Reinforce these also intellectually very demanding activities even further. Yes – this is agreed!
• Work hard on event-based monitoring techniques, triggering preventive maintenance actions, to improve the stability of the Grid infrastructure.
• Implement a strong mechanism to quickly isolate unstable sites in the production Grid. These are both part of ongoing program of work Use R-GMA as monitoring framework; build triggers and alarms on top Better mechanism to remove sites – web interface to allow VO to select
• Improve the middleware deployment process (technical, organisational) even further to increase the stability of the infrastructure and consequently improve the job success rate and reduce the load on the support team. Already updated and streamlined deployment and release process and
improved configuration mechanisms
Athens Conference; 18th April 2005 13
Enabling Grids for E-sciencE
INFSO-RI-508833
15 month plan
• No major changes to goals or work • Areas of work focus:
Migration to gLite See next slides
Improving operational and grid reliability Follow recommendations of review discussed above Improve monitoring systems – build reactive alarms Site isolation – need simple mechanism (CIC tool) to remove sites
• Bad sites, security problems, etc. Improving user support
In progress – need recognised usable service by mid-year 24x7 service availability
Availability of service rather than components Identify critical services Isues: on-call support; hot stand-by machines; etc (might need work
on middleware to support this!)
Athens Conference; 18th April 2005 14
Enabling Grids for E-sciencE
INFSO-RI-508833
Review recommendations to SA1
• The migration path to gLite needs to be better planned, as it is inherently difficult to support two different grid software stacks indefinitely. More specifically, establishing a fixed time-line for migration as well as deprecation deadlines for LCG-2 services, plus possibly identifying who would be the earliest adopters from the application side and the time-line for their possible early committal, would be essential; otherwise, existing users may not be motivated to migrate.
•Migration plan is being worked out in detail – but will be driven by experience in the certification and pre-production deployment•Must be a migration plan and not a switch from old to new•Early adopters include LCG, others should be identified via NA4
Athens Conference; 18th April 2005 15
Enabling Grids for E-sciencE
INFSO-RI-508833
Migration to gLite
• Migration strategy Needs to be incremental rather than big-
bang – as has been stated for a year
• 2 Activities in parallel: Deploy components into LCG-2 certification
test-bed and then to pre-production Deploy pre-production sites in parallel
• PPS and Production Are evolutionary LCG-2 gLite components
• Cannot provide LCG-2 end-of-life estimate/deadlines LCG-2 is the fallback solution
• Applications must test services and decide which ones they need
LCG-2 (=EGEE-0)
prototyping
prototyping
product
20042004
20052005
LCG-3 (=EGEE-x?)
product
Athens Conference; 18th April 2005 16
Enabling Grids for E-sciencE
INFSO-RI-508833
Review recommendations to SA1
• Consider the current gLite as a stepping stone towards a more robust standards-based infrastructure, rather than a final deployment solution. Select additional components for integration and deployment through collaborations with other international middleware R&D initiatives.
•Work with Globus, VDT, OSG, etc on common solutions/interfaces – but has to be driven by the applications and experience from operations•Should be in situation to be able to deploy components needed by the applications•Integration and certification process mechanism from selecting other components
Athens Conference; 18th April 2005 17
Enabling Grids for E-sciencE
INFSO-RI-508833
Review recommendations to SA1
• Continue to conduct application-driven investigation that may result in complex usage scenarios and consider how the advanced middleware and infrastructure would support them in a viable manner. As such, keep a keen eye on new generations of production-level Grid middleware from various international groups that go beyond gLite features.
•For HEP – Data challenges and service challenges bring specific goals and targets (and timescales) – this will continue•Other applications might consider similar exercises – define some goals
Athens Conference; 18th April 2005 18
Enabling Grids for E-sciencE
INFSO-RI-508833
Milestones for rest of project
• M14: full production grid in production 9 ROCs, 5 CICs (include Russia at M12), 20 sites Should be based on EGEE re-engineered middleware.
This is dependent on the quality and robustness of gLite components Experience: takes 6 months to put new software into production Will not deploy new components unless they improve upon existing
components or add new required functionality
• M21: expanded production infrastructure in place As above, but expanded to 50 sites Now decoupled from specific gLite release
Athens Conference; 18th April 2005 19
Enabling Grids for E-sciencE
INFSO-RI-508833
Deliverables for rest of project
• Release notes corresponding to milestones Updated relative to first set of release notes; snapshots corresponding to
milestones NB. ALL releases are accompanied by full set of release notes
• EGEE “Cookbook” Foreseen as planning guides to assist new participants join or build components
of the infrastructure. Resource centres and their administrators ROCs, CICs, and VOs Templates and checklists to assist administrators to: design a facility, determine what
resources to acquire, how to configure them, etc. Detailed enough to allow admins to understand limitations of the system are and how
to address them (e.g. what services can run on 1 machine, how to configure, etc.)
Make use of expertise of CICs, ROCs and staff in RCs (“and use technical writers in NA3”)
• M24: Assessment of infrastructure operation throughout the project Remove suggestions on long-term sustainability put into EGEE-2 planning
Athens Conference; 18th April 2005 20
Enabling Grids for E-sciencE
INFSO-RI-508833
Summary
• Production grid is operational and in use Larger scale than foreseen, use in 2004 probably the first time such a
set of large scale grid productions has been done Modest growth in resources foreseen over next year
• Operational infrastructure in place and working Need to continue to improve reliability of service Need to continue to improve user support
• Support for applications and VOs VO deployment should become still simpler and more routine Application support needs more resources than foreseen
• Deployment and migration to gLite is now a major focus
Recommended