20
Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Embed Size (px)

Citation preview

Page 1: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience withCMS Tier-2 Sites

I. González Caballero(Universidad de Oviedo)

for the CMS Collaboration

Page 2: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009 - 2 -

Some relevant aspects of CMS Computing Some relevant aspects of CMS Computing ModelModel

Data driven:Move big blocks of data in a more or less controlled wayJobs are sent to the data and not vice versaTools to handle the date and find where it is become very important

DistributedExtensive use of the GRID technologyProfit from the two more extended GRID infrastructures: OSG and EGEE

Hierarchical:Tier-0 serves data to Tier-1s, which serve data to Tier-2s, which serve data to Tier-3sDifferent workflows occur in different tiersDifferent degrees of service and compromise are expected from different tiers

Some figures for CMS Event Size (MB):

RAW: 1 - RECO: 0.5 - AOD: 0.1 CPU required (SI2k/event):

Sim.: 90 – Rec.: 25 – Análisis: 0.25

T0 + T1 + T2

2009

2010

CPU (MSI2k)

60 90

Disk (PB) 15 25

Tape (PB) 25 40

Page 3: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009 - 3 -

CMS Computing ModelCMS Computing Model

simulation

reconstruction

analysis

interactivephysicsanalysis

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventreprocessing

eventsimulation

eventsimulation

analysis objects(extracted by physics topic)

event filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

les.

rob

ert

son@

cern

.ch

Tier-2

Tier-2

Page 4: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009 - 4 -

CMS Computing Model: Tier-2 tasksCMS Computing Model: Tier-2 tasks

Tier-2s account for 1/3 of the total CMS resourcesMore than 40 sites in 22 countries

They are expected to provide resources for:Production of all the simulation the collaboration needsUser Data Analysis

Centrally controlled activity

MC Production requires…

GRID environmentWorking Storage Element that understands SRMAbility to transfer data to Tier-1sCMS software (CMSSW) installed at the site

User driven activity Bursty

Data Analysis requires…

GRID environmentCMS software (CMSSW)Working Storage Element:

That understands SRMWith enough space to host the datasets needed

Ability to transfer data fromTier-1s

Page 5: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009 - 5 -

CMS Tier-2 requirementsCMS Tier-2 requirements

A CMS Tier-2 needs the following GRID infrastructure:

A GRID computing cluster: OSG or EGEEA storage cluster: CASTOR, dCache, DPM, GPFS…

With an SRMv2 frontend: StoRM

GRID interfaces to both clustersLocal monitoring tools: batch, storage, accounting, …

Plus the following CMS Services:

PhEDEx: To manage Data Transfers

Connects sites through SRMv2FTS service at Tier-1s is used to schedule transfersA dedicated mid size machine

FroNTier: Squid to cache locally alignment and calibration constants

A small size machine every 800 slots

Besides it may operate some other servicesA login facility for local users: User Interfaces, interactive access to data locally stored,…Local non mandatory GRID and CMS services to improve the local users experience: Local WMS, CRAB Server, local Data Bookkeeping Service (DBS),...

Related talk by R. Egeland: PhEDEx Data Service (Thur, 16:30 )

Page 6: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009 - 6 -

CMS Data Handling: Transfers at CMS Tier-2sCMS Data Handling: Transfers at CMS Tier-2s

A full metric to commission links (up and down) has been developed

Based on expected data bandwidths and data transfer qualityTo avoid sites with problems overloading good performing sites

Only commissioned links may be used to transfer CMS data

CMS Model is very dependent on an efficient data transfer systemCMS has a very flexible transfer topology

Any Tier-2 downloads and uploads data from any Tier-1Tier-2 to Tier-2 transfers are also allowed (though not encouraged)

• Interesting for Tier-2s associated with the same physics groups

This additional complexity in the operation of the Tier-2 network:

Multiple SRM connections must be managed by the sitesThe different latencies make optimization difficultOperators are geographically spread in different time zones difficulting communications

T0(CERN)

T1(ASGC)

T1(FZK)

T1(CNAF)

T1(FNAL)

T1(PIC)

T2(ES)

T2(…)

T2(…) T2

(…)T2(…)

Page 7: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009 - 7 -

CMS Data Handling: Commisioning links at CMS Data Handling: Commisioning links at Tier-2Tier-2A big effort has been put by CMS Facility Operations to improve the amount of active links The downlink mesh is almost fullAround 50% of the uplinks have been commissioned

At least two uplinks are mandatory for every Tier-2

The Debugging Data Transfers effort, still ongoing, is helping Tiert-2s to fill the meshAlso working on reducing dataset transfer latencies so data can be used sooner at sites

For more details see the poster from J. Letts: Debugging Data Transfers in CMS (Thur - 024)

Page 8: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009 - 8 -

CMS Data Handling: Transfers to and from Tier-CMS Data Handling: Transfers to and from Tier-22

PhEDEx takes care of the transfers using a subscription mechanism

Transfers use SRMv2 scheduled with FTS

A set of agents take care of the different activities needed: download, upload, data consistency checks, etc…PhEDEx also provides data validation and monitoring tools

The Tier-2s need to set a UI machinePhEDEx software is centrally distributed through apt-getLocal operators need to configure the agents

Tuning them is not always trivialLots of documentation and examples available and public

A XML file (the Trivial File Catalog) takes care of converting LFN PFN

Transfers to CMS Tier-2Last year: 14,035 TB

Transfers from CMS Tier-2Last year: 4,787 TB

Page 9: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009 - 9 -

CMS Data Handling: Tier-2 Storage distributionCMS Data Handling: Tier-2 Storage distribution

MC Space – 20 TBFor MC produced samples before they are transferred to the Tier-1s

Central Space - 30 TBIntended for RECO samples of Primary Datasets

Physics Group Space - 60-90TBAssigned to 1-3 physics groupsSpace allocated by physics data manager

Local Storage Space - 30TB-60TBIntended to benefit the geographically associated community

User Space – 0.5-1TB per userEach CMS user is associated to a CMS Tier-2 siteBig outputs from user jobs can be staged out to this area

Temporary Space - < 1TB

For more details see the poster from T. Kress: CMS Tier-2 Resource Management (Mon - 089 )

Page 10: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 10

-

CMS Data Handling: Selecting the data at the CMS Data Handling: Selecting the data at the Tier-2Tier-2

Datasets for Central Space is managed by the Data Operations team by subscribing the assigned samplesPAGs and DPGs usually appoint one or two persons responsible for subscribing data to the Physics Group Space at their “associated” Tier-2sPhEDEx keeps track of the “property” of each dataset for this two disk areas:

Easy to follow the correct use of data at the sites

MC Space is filled by the production jobs

Data is requested for deletion as soon as it is transferred to Tier-1s

Each single user in CMS can make a request for a dataset to be placed on the Local Storage Space at any Tier-2Sites are free to manage the use of the User Space the way the prefer: quotas, mail, etc…

Users are usually close to the Tier-2CMS created the role of the Data Manager at each site with special rights:

Reviews every single transfer or deletion request……and approves or denies it

The Data Manager makes sure:The data is in accordance with the site commitments with the Physics and Detector GroupsThere is enough space at the local Storage Element to store the data

At big sites this can be a quite time consuming activity

Page 11: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 11

-

Computing at CMS Tier-2sComputing at CMS Tier-2s

Software Installation is centrally managed by CMS

The VO sgm role is used and is expected to have the highest priority on the queuesDue to some limitations in rpm under SLC4, CMSSW installation needs a 64 bit nodeThe installation of old CMSSW releases needs big amounts of memory in the installation node

• Improvement in newer releases reduce this requirement to O(100MB)

The CMSSW procedure needs write access for all software managers

• Map all sgm grid logins to a single account

The installation area has to be shared among all Worker Nodes

Data access from the WNsCMSSW understands the Trivial File Catalog so it is used to convert LFNs to PFNsPOSIX/dCache/RFIO protocols are supported

Production Workflow:A nominal Tier-2 is expected to reserve half of its CPUs for MC ProductionManaged through the VO production role

GRID access for local users:A User Interface needs to be setWith CRAB manually installed on it

• Really easy to install using a tar file and a automatic configuration script

Related talk by D. Spiga: Automatization of User Analysis Workflow in CMS (Thur, 17:10 )

Page 12: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 12

-

Operating CMS Tier-2s: Central AspectsOperating CMS Tier-2s: Central Aspects

Operating the more than 40 CMS Tier-2 sites is a complex task:Geographically spread around the globe… in different time zonesWith wide variety of sizes, technologies, bandwidths…

Good means to communicate important news, configuration changes, requirements and problems is important:

Special Hypernews forum dedicated to Tier-2s• At least one local operator at every site needs to follow

A Savanah squad per site has been created• Each problem found at a site is assigned to the squad

A new metric has been developed to establish the site capability to contribute efficiently to CMS: Site Readiness

Based on the number of commissioned links, fake analysis jobs (JobRobot) and Site Availability Monitoring (SAM) testsSites are then classified as READY, NOT-READY or WARNING (in danger to become NOT-READY)

See the poster by J. Flix (Thur 040): The commissioning of CMS sites: improving the site reliability

Page 13: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 13

-

Monitoring CMS Tier-2sMonitoring CMS Tier-2s

Workflows can be monitored through the CMS Dashboard

Almost any aspect of analysis and production jobs can be checked:

• Successful/cancelled/aborted jobs• By user, by site, by application, by

dataset, by CE,…• By GRID or Application error code

All aspects of data handling can be monitored through the wide variety of PhEDEx Web Server plots and tables:

Transfer rates and volumesQuality of the transfersErrors detected and reasons for those errorsLatencies, routing details, …

SAM tests and Site Readiness offer its own set of tools integrated in the Dashboard

Many tools have been developed to monitor the different aspects of a Tier-2 from the point of view of CMS

For both local and central operators

Page 14: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 14

-

CMS Tier-2 Workflows: ProductionCMS Tier-2 Workflows: Production

Production uses a special tool developed by CMS: ProdAgentCompletely centralizedNo local operator intervention in the operation

Data is produced at Tier-2s and automatically uploaded to Tier-1

See the poster by F. Van Lingen (Tue 014): CMS production and processing system - Design and experiences

More than 2 billion events produced

during the last 12 months

x 109

Page 15: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 15

-

CMS Tier-2 Workflows: User AnalysisCMS Tier-2 Workflows: User Analysis

39.6%

60.4%

More than 7.5 million user analysis jobs executed at Tier-2s

On produced samplesAnd on real data: Cosmics recorded with full and no magnetic field

Page 16: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 16

-

Future plans…Future plans…

The main goal in the near future is to completely integrate all the CMS Tier-2s into CMS computing operations

Using dedicated task forces to help sites meet the Site Readiness metrics

Improve the availability and reliability of the sites to increase further the efficiency of both analysis and production activitiesComplete the data transfer mesh by commissioning the missing links

Specially Tier-2 Tier-1 linksAnd continuously checking the

Improve the deployment of CMS Software loosening the requisites at the sitesInstall CRAB Servers at more sites:

CRAB Server takes care of some user routine interactions with the GRID improving the user experienceImproves the accounting and helps spotting problems and bugs in CMS softwareA new powerful machine and special software needs to be installed by local operators

CMS is building the tools to allow users to share their data with other users or groups

This will impact on the way data is handled at the sites

Page 17: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 17

-

ConclusionsConclusionsTier-2 sites play a very important role in the CMS Computing Model: They are expected to provide one third of the CMS computing resources

CMS Tier-2 sites handle a mix of centrally controlled activity (MC production) and chaotic workflows (user analysis)

CPU needs to be appropriately set to ensure enough resources are given to each workflowCMS has built the tools to facilitate the day by day handling of data at the sites

The PhEDEx servers located at every site helps transferring data in an unattended wayA Data Manager appointed at every site links CMS central data operations with the local management

CMS has established metrics to validate the availability and readiness of the Tier-2s to contribute efficiently to the collaboration computing needs

By verifying the ability to transfer and analyze dataA big number of monitoring tools have been developed by CMS to monitor every aspect of a Tier-2 in order to better identify and correct the problems that may appear

CMS Tier-2s have proved to be already well prepared for massive data MC production, dynamic data transfer, and efficient data serving to local GRID clustersCMS Tier-2s have proved to be able to provide our physicists with the infrastructure and the computing power to perform their analysis efficiently

CMS Tier-2s have a crucial role to play in the coming years in the experiment,and are already well prepared for the LHC collisions and the CMS data taking

Page 18: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

The End¡Thank you very

much!

Page 19: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 19

-

DRAFTDRAFT

Abstract: In the CMS computing model, about one third of the computing resources are located at Tier-2 sites, which are distributed across the countries in the collaboration. These sites are the primary platform for user analyses; they host datasets that are created at Tier-1 sites, and users from all CMS institutes submit analysis jobs that run on those data through grid interfaces. They are also the primary resource for the production of large simulation samples for general use in the experiment. As a result, Tier-2 sites have an interesting mix of organized experiment-controlled activities and chaotic user-controlled activities. CMS currently operates about 40 Tier-2 sites in 22 countries, making the sites a far-flung computational and social network. We describe our operational experience with the sites, touching on our achievements, the lessons learned, and the challenges for the future.

Page 20: Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration

Operational Experience with CMS Tier-2 Sites - CHEP 2009- 20

-

CMS Tier-2 Workflows: User AnalysisCMS Tier-2 Workflows: User Analysis

Part of the analysis jobs run on real data: Cosmics at full and no magnetic field

Jobs run at each Tier-2 during the last 12 monthsTotal jobs: 7589336

60.4 OK39.6 ERR