View
226
Download
2
Category
Tags:
Preview:
Citation preview
Ian Bird
LCG Deployment Area Manager &EGEE Operations Manager
IT Department, CERN
Presentation to HEPiX22nd October 2004
LCG OperationsLCG Operations
22 October 2004 2
Grid Operations: Grid Operations: Scope of ResponsibilitiesScope of Responsibilities
• Certification activities Certification of middleware as a coherent set of services Preparing that package for deploying
• Operational and support activities Coordinating and supporting the deployment to collaborating computer
centres Coordinating Grid Operations activities Providing Operational support Providing Operational security support Providing User support CA management VO registration and management
• Policy CA and user registration policies Operational policy Security policies Resource usage and access policies
RALIN2P3
FNAL
Tier-1
USC….
KrakowCIEMAT
Rome
Taipei
LIP
CSCS
Legnaro
UB
IFCA
IC
MSU
Prague
Budapest
Cambridge
IFIC
NIKHEF
TRIUMF
CNAFFZK
BNLPIC
ICEPP Nordic
….
Tier-2small
centresdesktopsportables
• Tier-2 –– Well-managed, grid-
enabled disk storage– End-user analysis – batch
and interactive– Simulation
LHC Computing Model (simplified!!)• Tier-0 – the accelerator centre
– Filter raw data reconstruction event summary data (ESD)
– Record the master copy of raw and ESD
• Tier-1 – – Managed Mass Storage –
permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data service
– Data-heavy (ESD-based) analysis– Re-processing of raw data– National, regional support– “online” to the data acquisition process
high availability, long-term commitment
last update 04/18/23 17:48
LCG LCG-2
25 Universities4 National Labs2800 CPUs
Grid3
30 sites3200 cpus
Total:78 Sites~9000 CPUs
6.5 PByte
Total:78 Sites~9000 CPUs
6.5 PByte
22 October 2004 5
Operations services for LCGOperations services for LCG
• Operational support Hierarchical model
• CERN acts as 1st level support for the Tier 1 centres• Tier 1 centres provide 1st level support for associated Tier 2s
– Tier 1 “Primary sites”
Grid Operations Centres (GOC)• Provide operational monitoring, troubleshooting, coordination of incident
response, etc.• RAL (UK) led sub-project to prototype a GOC• 2nd GOC in Taipei now in prototype
• User support Central model
• FZK provides user support portal– Problem tracking system web-based and available to all LCG participants
• Experiments provide triage of problems CERN team provide in-depth support and support for integration of
experiment sw with grid middleware
22 October 2004 6
Support Teams within LCGSupport Teams within LCG
CERN DeploymentSupport (CDS)
Middleware Problems
4 LHCexperiments
(Alice Atlas CMS LHCb)
OtherCommunities
(VOs)
4 non-LHCexperiments
(BaBar CDF Compass D0)
Grid OperationsCenter (GOC)
Operations Problems
ResourceCenters (RC)
Hardware Problems
Experiment Specific User Support (ESUS)
Software Problems
Global Grid User Support (GGUS)Single Point of Contact
Coordination of User Support
22 October 2004 7
Experiences in deploymentExperiences in deployment
• LCG covers many sites (>70) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box
installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging,
installation, and configuration tools and procedures• A lot of effort had to be invested in this area
• There are many problems – but in the end we are quite successful System is stable and reliable System is used in production System is reasonably easy to install now – 60 sites Now have a basis on which to incrementally build essential functionality
• This infrastructure forms the basis of the initial EGEE production service
22 October 2004 8
• LCG Operations EGEE Operations
22 October 2004 9
What is EGEE ? (I)What is EGEE ? (I)
• EGEE (Enabling Grids for Escience in Europe) is a seamless Grid infrastructure for the support of scientific research, which: Integrates current national, regional
and thematic Grid efforts Provides researchers in academia
and industry with round-the-clock access to major computing resources, independent of geographic location
Applications
Geant network
Grid infrastructure
22 October 2004 10
What is EGEE ? (II)What is EGEE ? (II)
• 70 leading institutions in 28 countries, federated in regional Grids
• 32 M Euros EU funding (2004-5), O(100 M) total budget
• Aiming for a combined capacity of over 8000 CPUs (the largest international Grid infrastructure ever assembled)
• ~ 300 persons
22 October 2004 11
EGEE ActivitiesEGEE Activities
• Emphasis on operating a production grid and supporting the end-users
• 48 % service activities (Grid Operations, Support and Management, Network Resource Provision)
• 24 % middleware re-engineering (Quality Assurance, Security, Network Services Development)
• 28 % networking (Management, Dissemination and Outreach, User Training and Education, Application Identification and Support, Policy and International Cooperation)
22 October 2004 12
LCG and EGEE OperationsLCG and EGEE Operations
• EGEE is funded to operate and support a research grid infrastructure in Europe
• The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service LCG includes US and Asia-Pacific, EGEE includes other sciences Substantial part of infrastructure common to both
• LCG Deployment Manager is the EGEE Operations Manager CERN team (Operations Management Centre) provides coordination,
management, and 2nd level support
• Support activities are expanded with the provision of Core Infrastructure Centres (CIC) (4) Regional Operations Centres (ROC) (9) ROCs are coordinated by Italy, outside of CERN (which has no ROC)
22 October 2004 13
• User support: Becomes hierarchical Through the Regional Operations
Centres (ROC)• Act as front-line support for user and
operations issues• Provide local knowledge and
adaptations
• Coordination: At CERN (Operations Management
Centre) and CIC for HEP
• Operational support: The LCG GOC is the model for the
EGEE CICs• CIC’s replace the European GOC at
RAL• Also run essential infrastructure
services• Provide support for other (non-LHC)
applications• Provide 2nd level support to ROCs
LCG LCG EGEE in Europe EGEE in Europe
22 October 2004 14
SummarySummary
• Data challenges – demonstrated: Many m/w functional and performance issues (documented) Main problem is service stability
• Site fabric management, configuration, change control• Etc
Grid3 report similar problems … User support process needs improvement
• Now moving into continuous production + service & data challenges
22 October 2004 15
How to move forward – 1 How to move forward – 1
• Build an agreed operations model for the next year Should be able to evolve
• Operations/Fabric workshop Nov 2 – 4 Hepix ½ day – input from some sites and Grid3/OSG on their plans Documenting use-cases (based on experience), propose support
mechanisms for each EGEE SA1 infrastructure 5 working groups:
• Operations support• User support• Operational security• Fabric management issues• SW needs and tools requirements from operations
• Need fabric management training for many sites
22 October 2004 16
Some issuesSome issues
• Resource Centres: Large sites – have operations staff and/or on-call support Small sites – have no on-call and often little support at all
• Regional Operations Centres: Probably do not provide after-hours or on-call support. If this were the case then
the model of support could more include the ROCs. However, it is clear that most ROCs will not have this level of support.
• Core Infrastructure Centres: Must have on-call support after-hours
• To be rotated through the 4 or 5 active CICs
Thus, a basic question to answer is how much power or control can the CICs have in order to deal with problems when staff at RCs and ROCs are not available? Either CICs have rights to manage critical services on sites where there is no
support, or Have the right to remove “broken” sites and services from the infrastructure.
• Likely that we have all combinations of these …
22 October 2004 17
Immediate actionsImmediate actions
• Weekly operations meeting (Monday afternoon) Weekly reports from ROCs, CICs, other Tier 1s etc
• Operations Manager – Role rotates through 4 EGE CIC’s – manage problem reporting and
follow up Hand over responsibility in weekly meeting
• Operational security team Being set up – led by Ian Neilson, strong collaboration between US
and Europe on these issues.
Recommended