Upload
alison-dorsey
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
LCG Status and Plans
GridPP13Durham, UK
4th July 2005
Ian BirdIT/GD, CERN
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
2
Overview
Introduction Project goals and overview
Status Applications area Fabric Deployment and Operations
Baseline Services Service Challenges Summary
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
3
LCG – Goals
The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments
Two phases:
Phase 1: 2002 – 2005 Build a service prototype, based on existing grid middleware Gain experience in running a production grid service Produce the TDR for the final system
Phase 2: 2006 – 2008 Build and commission the initial LHC computing environment
We
are
her
e
LCG and Experiment Computing TDRs completed and presented to the LHCC last week
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
4 Joint with EGEE
Applications AreaPere Mato
Development environmentJoint projects, Data management
Distributed analysis
Middleware AreaFrédéric Hemmer
Provision of a base set of grid middleware (acquisition, development, integration)
Testing, maintenance, support
CERN Fabric AreaBernd Panzer
Large cluster managementData recording, Cluster technology
Networking, Computing service at CERN
Grid Deployment AreaIan Bird
Establishing and managing the Grid Service - Middleware, certification, security
operations, registration, authorisation,accounting
Project Areas & Management
Distributed Analysis - ARDAMassimo LamannaPrototyping of distributedend-user analysis using
grid technology
Project LeaderLes Robertson
Resource Manager – Chris EckPlanning Officer – Jürgen Knobloch
Administration – Fabienne Baud-Lavigne
Applications Area
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
6
Application Area Focus
Deliver the common physics applications software Organized to ensure focus on real experiment
needs Experiment-driven requirements and monitoring Architects in management and execution Open information flow and decision making Participation of experiment developers Frequent releases enabling iterative feedback
Success defined by experiment validation Integration, evaluation, successful deployment
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
7
Validation Highlights
POOL successfully used in large scale production in ATLAS, CMS, LHCb data challenges in 2004
~400TB of POOL data produced Objective of a quickly-developed persistency hybrid leveraging
ROOT I/O and RDBMSes has been fulfilled
Geant4 firmly established as baseline simulation in successful ATLAS, CMS, LHCb production
EM & hadronic physics validated Highly stable: 1 G4-related crash per O(1M) events
SEAL components underpin POOL’s success, in particular the dictionary system
Now entering a second generation with Reflex
SPI’s Savannah project portal and external software service are accepted standards inside and outside the project
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
8
Current AA Projects
SPI – Software process infrastructure (A. Aimar) Software and development services: external libraries,
savannah, software distribution, support for build, test, QA, etc.
ROOT – Core Libraries and Services (R. Brun) Foundation class libraries, math libraries, framework
services, dictionaries, scripting, GUI, graphics, etc. POOL – Persistency Framework (D. Duellmann)
Storage manager, file catalogs, event collections, relational access layer, conditions database, etc.
SIMU - Simulation project (G. Cosmo) Simulation framework, physics validation studies, MC event
generators, participation in Geant4, Fluka.
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
9
SEAL and ROOT Merge
Major change in the AA has been the merge of the SEAL project with ROOT project
Details of the merge are being discussed following a process defined by the AF
Breakdown into a number of topics Proposals discussed with the experiments Public presentations Final decisions by the AF
Current status Dictionary plans approved MathCore and Vector libraries proposals have been
approved First development release of ROOT including these new
libraries
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
10
Ongoing work SPI
Porting LCG-AA software to amd64 (gcc 3.4.4) Finalizing software distribution based on Pacman QA tools: test coverage and savannah reports
ROOT Development version v5.02 released last week Including new libraries: mathcore, reflex, cintex, roofit
POOL Version 2.1 released including new file catalog implementations:
LFCCatalog (lfc), GliteCatalog (glite, Fireman), GTCatalog (globus toolkit) New version of Conditions DB (COOL) 1.2 Adapting POOL to new dictionaries (Reflex)
SIMU New Geant4 public minor release 7.1 is being prepared Public release of Fluka expected by end July Intense activity in the combined calorimeter physics validation with
ATLAS, report in September. New MC generators being added (CASCADE , CHARYBDIS, etc.) into the
already long list of generators provided Prototyping persistency of Geant4 geometry with ROOT
Fabric Area
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
12
configuration, installation andmanagement of nodes
lemon LHC Era Monitoring - system & service monitoring
LHC Era Automated Fabric – hardware / state management
Extremely Large Fabric management system
CERN Fabric
Fabric automation has seen very good progress The new systems for managing large farms are in production
at CERN
Includes technology developed by European DataGrid
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
13
CERN Fabric
Fabric automation has seen very good progress The new systems for managing large farms are in production
at CERN New CASTOR Mass Storage System
Was deployed first on the high throughput cluster for the recent ALICE data recording computing challenge
Agreement on collaboration with Fermilab on Linux distribution
Scientific Linux based on Red Hat Enterprise 3 Improves uniformity between the HEP sites serving LHC and
Run 2 experiments CERN computer centre preparations
Power upgrade to 2.5 MW Computer centre refurbishment well under way Acquisition process started
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
14
Preparing for 7,000 boxes in 2008
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
15
High Throughput Prototype openlab/LCG
Experience with likely ingredients in LCG:
64-bit programming
next generation I/O (10 Gb Ethernet, Infiniband, etc.)
High performance cluster used for evaluations, and for data challenges with experiments
Flexible configuration components
moved in and out of production environment
Co-funded by industry and CERN
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
16
Alice Data Recording Challenge
Target – one week sustained at 450 MB/sec Used the new version of Castor mass storage system Note smooth degradation and recovery after
equipment failure
Deployment and Operations
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
18
Country providing resourcesCountry anticipating joining
In LCG-2: 139 sites, 32 countries ~14,000 cpu ~5 PB storage
Includes non-EGEE sites:• 9 countries• 18 sites
Computing Resources: June 2005
Number of sites is already at the scale expected for LHC
- demonstrates the full complexity of operations
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
19
Operations Structure Operations Management Centre
(OMC): At CERN – coordination etc
Core Infrastructure Centres (CIC) Manage daily grid operations –
oversight, troubleshooting Run essential infrastructure
services Provide 2nd level support to ROCs UK/I, Fr, It, CERN, + Russia (M12) Hope to get non-European centres
Regional Operations Centres (ROC) Act as front-line support for user
and operations issues Provide local knowledge and
adaptations One in each region – many
distributed
User Support Centre (GGUS) In FZK – support portal – provide
single point of contact (service desk)
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
20
Grid Operations The grid is flat, but Hierarchy of responsibility
Essential to scale the operation
CICs act as a single Operations Centre
Operational oversight (grid operator) responsibility
rotates weekly between CICs Report problems to ROC/RC ROC is responsible for ensuring
problem is resolved ROC oversees regional RCs
ROCs responsible for organising the operations in a region
Coordinate deployment of middleware, etc
CERN coordinates sites not associated with a ROC
CIC
CICCIC
CICCIC
CICCIC
CICCIC
CICCIC
RCRC
RCRC RCRC
RCRC
RCRC
ROCROC
RCRC
RCRC
RCRCRCRC
RCRCRCRC
ROCROC
RCRC
RCRC RCRC
RCRC
RCRC
ROCROC
RCRC
RCRC
RCRC
RCRC
ROCROC
OMCOMC
RC = Resource Centre
It is in setting up this operational infrastructure where we have really benefited from EGEE funding
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
21
Grid monitoring
GIIS Monitor + Monitor Graphs
Sites Functional Tests GOC Data Base Scheduled Downtimes
Live Job Monitor GridIce – VO + fabric view Certificate Lifetime Monitor
Operation of Production Service: real-time display of grid operations Accounting information Selection of Monitoring tools:
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
22
Operations focus Main focus of activities now:
Improving the operational reliability and application efficiency:
Automating monitoring alarms Ensuring a 24x7 service Removing sites that fail functional tests Operations interoperability with OSG
and others
Improving user support: Demonstrate to users a reliable and
trusted support infrastructure
Deployment of gLite components: Testing, certification pre-production
service Migration planning and deployment –
while maintaining/growing interoperability
Further developments now have to be driven by experience in real use
LCG-2 (=EGEE-0)
prototyping
prototyping
product
20042004
20052005
LCG-3 (=EGEE-x?)
product
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
23
Recent ATLAS work
• ATLAS jobs in EGEE/LCG-2 in 2005•In latest period up to 8K jobs/day
• Several times the current capacity for ATLAS at CERN alone – shows the reality of the grid solution
Number of jobs/day ~10,000 concurrent jobs in the system
Baseline Services & Service Challenges
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
25
Baseline Services: Goals
Experiments and regional centres agree on baseline services
Support the computing models for the initial period of LHC
Thus must be in operation by September 2006. Expose experiment plans and ideas Timescales
For TDR – now For SC3 – testing, verification, not all components For SC4 – must have complete set
Define services with targets for functionality & scalability/performance metrics.
Very much driven by the experiments’ needs – But try to understand site and other constraints
Not done yet
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
26
Baseline services
Storage management services
Based on SRM as the interface
Basic transfer services gridFTP, srmCopy
Reliable file transfer service
Grid catalogue services Catalogue and data
management tools Database services
Required at Tier1,2 Compute Resource
Services Workload management
VO management services
Clear need for VOMS: roles, groups, subgroups
POSIX-like I/O service local files, and include
links to catalogues Grid monitoring tools
and services Focussed on job
monitoring VO agent framework Applications software
installation service Reliable messaging
service Information system
Nothing really surprising here – but a lot was clarified in terms of requirements, implementations, deployment, security, etc
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
27
Preliminary: Priorities
Service ALICE ATLAS CMS LHCb
Storage Element A A A A
Basic transfer tools A A A A
Reliable file transfer service A A A/B A
Catalogue services B B B B
Catalogue and data management tools
C C C C
Compute Element A A A A
Workload Management B A A C
VO agents A A A A
VOMS A A A A
Database services A A A A
Posix-I/O C C C C
Application software installation C C C C
Job monitoring tools C C C C
Reliable messaging service C C C C
Information system A A A A
A: High priority, mandatory serviceB: Standard solutions required, experiments could select different implementationsC: Common solutions desirable, but not essential
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
28
Service Challenges – ramp up to LHC start-up service
SC2SC3
LHC Service OperationFull physics run
2005 20072006 2008
First physicsFirst beams
cosmics
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 – SC4 Service Phase
Sep06 – Initial LHC Service in stable operation
SC4
SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERNSC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period)SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughputLHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput
Apr07 – LHC Service commissioned
See Jamie’s
talk fo
r more
details
Baseline Services, Service Challenges, Production Service, Pre-production
service, gLite deployment, …
… confused?
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
30
Services …
Baseline services Are the set of essential services that the experiments need to
be in production by September 2006 Verify components in SC3, SC4
Service challenges The ramp up of the LHC computing environment – building up
the production service, based on results and lessons of the service challenges
Production service The evolving service putting in place new components
prototyped in SC3, SC4 No big-bang changes, but many releases!!!
gLite deployment As new components are certified, will be added to the
production service releases, either in parallel with or replacing existing services
Pre-production service Should be literally a preview of the production service, But is a demonstration of gLite services at the moment – this
has been forced on us by many other constraints (urgency to “deploy” gLite, need for reasonable scale testing, … )
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
31
Releases and Distributions We intend to maintain a single line of production middleware
distributions Middleware releases from [JRA1, VDT, LCG, …] Middleware distributions for deployment from GDA/SA1 Remember: announcement of a release is months away from a
deployable distribution (based on last 2 years experience) Distributions still labelled “LCG-2.x.x”
Would like to change to something less specific to avoid LCG/EGEE confusion
Frequent updates for Service challenge sites But only needed for SC sites
Frequent updates as gLite is deployed Not clear if all sites will deploy all gLite components immediately
This is unavoidable A strong request from LHC experiment spokesmen to the LCG
POB: “early, gradual and frequent releases of the [baseline] services is essential
rather than waiting for a complete sets”
Throughout all this, we must maintain a reliable production service, which gradually improves in reliability and performance
Gri
dPP
13;
Du
rham
, 4th J
uly
, 200
5
32
Summary
We are at end of LCG Phase 1 Good time to step back and look at achievements
and issues LCG Phase 2 has really started
Consolidation of AA projects Baseline services Service challenges and experiment data challenges Acquisitions process starting No new developments make what we have work
absolutely reliably, and be scaleable, performant Timescale is extremely tight Must ensure that we have appropriate levels of
effort committed