Project Status Report

Project Status Report

Ian Bird

WLCG Overview Board3rd September 2010, CERN

[email protected] 2

• Experience with data• Service quality• Resource installations• Milestones• Budget issues• RRB issues• Tier 3s• Service evolution

Topics

[email protected] 3

5 months with LHC data

LHCb: 70 TB raw data since June

ALICE: 550 TB

ATLAS: 1.7 PB raw

CMS: - 220 TB of RAW data at 7 TeV- 70 TB Cosmics during this period- 110 TB Tests and exercises with trigger.

• Large numbers of analysis users

CMS ~500, ATLAS ~1000, LHCb/ALICE ~200

• Use remains consistently high – 1 M jobs/day; 100k

CPU-days/day

WLCG Usage1 M jobs/day

LHCb

CMS

100k CPU-days/day

ALICE: ~200 users, 5-10% of Grid resources

Job workloadsCMS: 100k jobs per day;Red: analysisLHCb

ATLAS: analysis jobs

ALICE:60 sites, ~20K jobs running in parallel

Data transfers24x24500MB/s

48x48800MB/s

ATLAS Throughput

– CMS saw effect of going from 24x24 to 48x48 bunches

Traffic on OPN up to 70 Gb/s!- ATLAS reprocessing campaigns

• Data transfers (mostly!) at rates anticipated

Data distribution

[email protected] 7

ATLAS: Total throughputT0-T1; T1-T1; T1-T2

CMS: T0 – T1

LHCb: T0 – T1

Data distribution for analysis

[email protected] 8

• For all experiments: early data has been available for analysis within hours of data taking

[email protected] 9

Resource status

[email protected] 10

• CNAF– Remainder of disk and CPU to be available 1st Sep.

• NL-T1– Had serious issues with disks. Data corruption

meant 2/3 was unusable for some time.• ASGC:

– Remaining 1 PB disk not available before end Oct.

Issues with resources


• Significant use of Tier 2s for analysis – frequently-expressed concern

that too much analysis would be done at CERN is not reflected

CPU – July

• Tier 0 capacity underused in general – But this is

expected to change as luminosity increases


Service incidents – Q2Site Date Duration Service ImpactRAL 30 June SE 1083 CMS files were lost.CERN 29 June 4 h CASTOR CASTOR outage due to AFSCERN 29 June 5 h AFS complete FC disk array - affected CASTOR and also LHC!

CERN 28 June 4+h CASTOR Log volume slowed down the Castor instances

ASGC 29 June ~ 15 hours 3D DB ASGC didn't apply stream LCRs from central 3D DB for 15 hours

CERN 26 June ~50 min ATLAS offline DB (ATLR) 9 Oracle services did not fail over properly after a node eviction

CERN 24,25 June ~10h LHCb Streaming Streaming of LHCb data to PIC was not working during 10 hours, streaming to other Tier1 sites not working for 40 minutes

CERN 22 June CASTOR LDAP high load caused CASTOR to become unresponsive

KIT/GridKa 12 June ~3:15h CMS dCache Service down

CERN 7 June ~3h CREAM CE Job submission failureCERN 2 June ATLAS and LHCb online and

offline databasesDatabase access and quality of DB services compromised

CERN 31 May, 1 June CMS online, LCGR and ATLAS offline databases

Database services unavailability during patching

CERN 26 May CMS offline database Hw failure affecting one node

PIC 21 May 19 hours Whole site Site power cut. Outage.CERN 14 May - CASTOR Data loss from incorrect recyclingGGUS 12 May <=4.5 hours .de domain Domain does not existCERN-ASGC 12-15 May - LHCOPN Reduced bandwidth

CNAF 28 and 29 April 9 hours & 12 hours STORM SRM blockage (hardware) followed by MCDISK full and STORM bug

IN2P3 26 Apr 17.5 hours AFS Distributed File System (AFS) crashed after server overload. Batch also affected.

IN2P3 24 Apr 17 hours Batch services location service stopped responding to requests blocking most batch system commands

IN2P3 20 Apr 9 hours & 5 days Grid Downtime Notification Grid downtime notifications were impossible after two consecutive incidents

Q2 2010

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents?sortcol=0;table=2;up=0#sorted_table





https://twiki.cern.ch/twiki/bin/view/LCG/RAL

https://twiki.cern.ch/twiki/bin/view/LCG/IN2P3




Site Date Duration Service Impact

NL-T1 Aug > 3 Weeks DB ATLAS NL-T1 cloud down, LHCb T1 site down

CERN 9 Aug 16 h LHCb online

LHCBONR database unavailable, prevents data taking

PIC 25 Jul 30h CE Service Degradation. Cooling problem causing about 50% of WNs to be shutdown (running jobs killed)

PIC 22 Jul 10h SE SRM service not available for ATLAS due to a problem with dCache pool costs configuration.

PIC 20 Jul 3h CE Computing Service not available after SD due to a wrong gridmapdir migration.

CERN 19 Jul 2h several Cooling failure in the vault

OSG/GOC 15 Jul 4h GOC GOC Service outage

KIT 10 Jul 4h + site Outage of central and local services due to a cooling failure

NL-T1 5 Jul 1 week SE Reduced availability caused by data corruption

NDGF 14 Jul 16h SE srm.ndgf.org downtime followed by degradation

NDGF 8 Jul 3h LFC LFC downtime on lfc1.ndgf.org

KIT 5 Jul 18h SE CMS dCache SE down because of hardware failure

Incidents – Q3 Q3 2010

NL-T1 DB issue: Illustrates criticality of LFC for ATLAS. We must set up a live stand-by for all ATLAS T1s. Done already at CNAF. Could be hosted centrally.






• A configuration error in Castor resulted in data being directed across all available tape pools instead of to the dedicated raw data pools– For ALICE, ATLAS, CMS this included a pool where the tapes were re-cycled after a

certain time• The result of this was that a number of files were lost on tapes that were

recycled• For ATLAS and CMS the tapes had not been overwritten and could be fully

recovered (fall back would have been to re-copy files back from Tier 1s)• For ALICE 10k files were on tapes that were recycled, inc 1700 files of 900

GeV data• Actions taken:

– Underlying problem addressed; all recycle pools removed• Software change procedures being reviewed now

– Action to improve user-facing monitoring in Castor – Tapes sent to IBM and SUN for recovery – have been able to recover ~97% of

critical (900 GeV sample) files, ~50% of all ALICE files– Work with ALICE to ensure that always 2 copies of data available

• In HI running there is a risk for several weeks until all data is copied to Tier 1s; several options to mitigate this risk under discussion

– As this was essentially a procedural problem: we will organise a review of Castor operations procedures (sw dev, deployment, operation etc) together with experiments and outside experts – timescale of September.

ALICE data loss (May)


• Pilot jobs, glexec, SCAS, etc (1st July)– No pressure at the moment

as data taking takes priority

• CREAM deployment:– >100 sites have a CREAM CE; – still missing job submission from condor-g (needed by ATLAS)– Stability and reliability of CREAM still not good enough to replace LCG-

CE, many open tickets

• Gathering of installed capacity

Milestones


• Storage accounting reports:– For Tier 1 sites, proto-report for beginning Sep– For Tier 2 sites, for end Oct– Before these are regularly published significant effort has to be

devoted to checking the published data (see slide on installed capacity) – lots missing!

• From DM and virtualisation activities:– Create a small architectural team – Address several strategic questions: (Nov)

• Xrootd as a strategy for data access by all experiments? • Need for ability to reserve whole WN (to better use cores) – is this a clear requirement for all

experiments?• CERNVM and/or CVMFS as a strategy for all experiments?

• Proposal for better managing Information System and data (end Sep)

Milestones – 2


• Action has been pending for a long time:– Information requested by RRB/RSG

• Significant effort was put into defining and agreeing how to publish and collect the data – via the information system. – Document has been available for some time

• Tool (gstat2) now has the ability to present this information – Significant effort still required to validate this data before it is

publishable– Propose to task Tier 1s with validation for Tier 2s, but Tier 1 data

is not good yet!– Sites need to publish good data urgently

Installed capacity


Published capacities ...



• IT asked to reduce 15M over MTP (2011-15)– Proposal of 12M reduction accepted – this is what was

approved by FC last week• Part (8.25MCHF over 5 years) comes from the LCG

budgets:1. Move to 4-year equipment replacement cycles

• Save 2 MCHF in 2011, very little afterwards

2. Stop CERN contribution to USLHCnet • Save 350 kCHF/year.• NB. No CERN contribution to costs of other Tier 1s

3. Reduce slope of Tier 0 computing resource increase• Save ~ 1 MCHF/year on average• This is the main mechanism we have to reduce costs. Current assumption

was ~30%/year growth.

Budget Issues at CERN


• Note that the proposals have not been discussed with the experiments, nor with the Committees overseeing LCG (LHCC, RRB/C-RSG, OB, etc...)

• Reducing the Tier-0 Computing resources for LHC experiments does not seem wise now the detectors are ramping up– Not even taking into account the lack of experience with Heavy Ions

• Slowing down the replacement cycles and reducing the slope of computing resources increase– Delays the need for additional computing infrastructure needed for the Tier-0 (e.g. containers)– Will strictly limit the overall experiment computing growth rate in the Tier-0– Assumes no additional requests from non-LHC experiments– Detailed planning requires further studies

• Extending lifetime of hardware from 3 to 4 years– Does not gain anything if the maintenance is extended– Has negative impact on power budget– Requires additional spares, therefore additional budget– May impact service quality– Implies additional effort to cover for repairs

• USLHCNet– May have secondary effects ...

Budget implications


• Plan for a new Tier 0 building at CERN is cancelled (noted at SPC last week)

• Delayed need for containers (or other alternate solution) by ~1 year or more:– Budget reduction will have implications for total power

needs– Efforts invested in recent years have benefitted total

power needs– Ongoing plans for upgrading B513 from 2.9 to 3.5MW

including up to 600 kW diesel-backed power, together with use of local hosting (100 kW today, potentially more)

– Remote Tier 0-bis F. Hemmer talk

Tier 0 planning – status


Schedule


• Two points to be addressed:• Scrutiny report of use of resources with experience of

data– Plan to have a report with same details from each

experiment– Discuss with chair of C-RSG at MB on Sep 7

• What should be said about requirements for 2013– According to MoU process should have said something

already– Very difficult to be precise without details of running

conditions, and better analysis of how experience with data correlates with planning (TDR etc).

October RRB


• Formally Tier 3s are outside the scope of the WLCG MoU; but ...• May be co-located at a Tier 1 or 2• Many Tier 3 sites coming:

– Some are “Tier 2” but just not party to the MoU (yet...) – these are not a problem – treated in same way as a formal Tier 2

– Some just analysis but need to access data, so need to be grid aware (customer to grid services), need to be known to the infrastructure(s)

• Strong statements of support for them from Experiment managements – they need these Tier 3s to be able to do analysis

– Who supports them operationally?• Operational support:

– In countries with a ROC (NGI ops, or equiv) – not a problem – ROC should support

– What about non-ROC countries? Can we rely on the infrastructures? What is the process?

• Come back to this in point on EGI-WLCG interactions

Tier 3 resources


• Want Tier 3s to be known and supported as clients...• They should not interfere with Tier 1,2 daily operations

– Either because of overloading the Tier 1,2– Or because of support load

• No central support if the Tier 3 is really only a national analysis facility (i.e. Not for use of the wider experiment collaboration)

• Issue: shared facility (e.g. Part of a cluster is Tier 2, part is for national use) – The part for national-only use should not be part of the pledge– How should this be accounted? (Should not be in principle,

but in a shared cluster care needs to be taken)

Tier 3 issues – proposal

• Need to adapt to changing technologies– Major re-think of storage and data access– Use of many-core CPUs (and other processor types?)– Virtualisation as a solution for job management

• Brings us in line with industrial technology• Integration with public and commercial clouds

• Network infrastructure– This is the most reliable service we have– Invest in networks and make full use of the distributed system

• Grid Middleware– Complexity of today’s middleware compared to the actual use cases– Evolve by using more “standard” technologies: e.g. Message Brokers, Monitoring

systems are first steps• But: retain the WLCG infrastructure

– Global collaboration, service management, operational procedures, support processes, etc.

– Security infrastructure – this is a significant achievement• both the global A&A service and trust network (X509) and • the operational security & policy frameworks

Ian Bird, CERN 27

Evolution and sustainability


• Have started discussion on several aspects –– Data management: Kors Bos talk

• Discussion in March, workshop in June, follow ups in GDBs (next week for the first)

– Multi-core and virtualisation: Predrag Buncic talk• 2nd workshop was held in June (1st was last year)

• Long term:– Clarify the distinction between the WLCG distributed computing

infrastructure and the software that we use to implement it– Understand how to make the grid middleware supportable and

sustainable • Use of other components (e.g. Industrial messaging, tools like Nagios, etc); what does

cloud technology bring us (and how to integrate/maintain our advantages – global trust, VOs, etc.) ?

Service evolution strategy

Documents

Project Status Report