CMS Monte Carlo Production in LCG

CMS Monte Carlo Production in LCG

J. Caballero, J.M. Hernández, P. García-Abia (CIEMAT)

CMS Collaboration

Computing in High Energy and Nuclear Physics,

T.I.F.R. Mumbai, India, 13-17 February 2006

CHEP06, MumbaiCMS Monte Carlo Production in LCG 2P. Garcia-Abia / CIEMAT

Outline

Introduction

Monte Carlo production framework: Data tiers, metadata attachment, publication of data

Production workflow

First experiences

Improvements to production: Output ZIP archives, treatment of pile-up, local software installation

Production operations Efficiency, problems

Migration to LFC

The new MC production system

Conclusions


Introduction

Monte Carlo (MC) production is crucial for detector studies and physics analysis

Event simulation and reconstruction typically done in computer farms of CMS institutions

Porting production to LCG allows using a large amount of computing, storage and network resources

MC simulation was previously run in a LCG0-LCG1 dedicated testbed: Low scale production

Low efficiency: RLS, site configuration

We had introduced novel concepts which have made running the full production chain possible on LCG, from the generation of events to the publication of data for analysis

We had coupled production and the CMS data transfer system (PhEDEx) and made tools more robust

Important implications for the design of the new production framework


Introduction II

The CMS event data model (EDM) and the MC production framework are somewhat monolithic, not suitable for a Grid environment: Lack of modularity

Grid provides basic services: Reliability, stability and flexibility are important issues

We have identified the main weak points of LCG and made the production framework more robust: Efficient running of production in LCG is manpower intensive

Availability of the resources and responsiveness of the local administrators are crucial

Code development-and-testing and running of production in LCG done by ~1.5=FTE at CIEMAT


Production framework

The basic unit in MC production is the dataset: a given physics process with a well defined set of parameters

The production chain is: generation of events, simulation (hits), digitization (digis) and

reconstruction (DST)

these are called data tiers or steps

owner: data tier with defined geometry, SW version and pile-up (PU) sample

Detector and physics groups request events of a specific dataset/owner pair

For practical reasons, requests are split in small assignments composed of a number of runs (~1000 events)

The data relevant to production (requests, owner/dataset, assignments, runs, data attributes) are kept in a global database (RefDB)

The MC production framework is McRunjob, a python application developed at FNAL, used for local farm production since long


Data tiers

Generation: no input, small output (10 to 50 MB ntuples)

pure CPU: few minutes, up to few hours if hard filtering present

Simulation (hits): GEANT4 small input

CPU and memory intensive: 24 to 48 hours

large output: ~500 MB in three files (EVD files), the smallest is ~ 100 KB !

Digitization: lower CPU/memory requirements: 5 to 10 hours

I/O intensive: persistent reading of PU through LAN

large output: similar to simulation

Reconstruction: even less CPU: ~5 hours

smaller output: ~200 MB in two files


Event metadata attachment

In order to run the digitization step, event metadata have to be generated for the whole collection of simulated events

When running reconstruction the metadata of both the simulated and the digitized events are required

The generation of metadata (metadata attachment) needs direct access to the event files, not suitable for distributed systems: output of the jobs potentially distributed among several Storage Elements

with no POSIX I/O-like access

Metadata attachment was the main show-stopper for porting the MC production system to LCG: lack of modularity (atomicity) in the old EDM

We introduced the concept of atomic attachment: Metadata attachment done on the Worker Node for the run to be processed

Negligible overhead: EVD files already in the working area


Publication of data

We have coupled production in LCG and the CMS data transfer system: PhEDEx is used to collect event files in the T1s/T2s that host data for analysis

However, data handling for intermediate steps not done by PhEDEx...

(this is one of the main problems in production)

For each owner/dataset, a global metadata attachment is performed: metadata and local XML POOL file catalogs are produced and made public in

the data location system (global RefDB and local data location DB -PubDB-)

Analysis tools inspect RefDB/PubDB for data discovery: Analysis jobs are submitted to the appropriate T1/T2


Production workflow

Job preparation: McRunjob downloads assignment information from RefDB:

• List of runs, job templates, application data-cards, input file specification, input/output virgin metadata

Jobs are created for each run using the templates:• Application scripts

• JDL file with grid requirements (CPU, memory, SW tags, site...)

• Wrapper script: specific stuff to let the job run in a LCG WN

Jobs are submitted to a LCG CE using the JDL

At runtime on the WN: Input files are downloaded from SE

Metadata are generated for the input event files

After the application runs the output EVDs are copied to the SE

The summary file is returned in the output sandbox and sent to RefDB from the UI for validation of the job

Originally, also the application output/error was returned

CHEP06, MumbaiCMS Monte Carlo Production in LCG 10

P. Garcia-Abia / CIEMAT

First experiences

The first experiences were disappointing: Extremely high submission time (very low rate)

Very low job efficiency

Job retrieval time too high: huge output

Failure causes: Local configuration problems: unavailability of CMS software (installation

problems), NFS

Instability of the RLS global catalog

Problems staging in/out files: weak staging procedure, copy from/to the SE unreliable

Poor error report from the application: hard to automatise job resubmission, typically done after visual inspection of the logs

Real time monitoring unavailable



Improvements to production

We introduced new ideas in the production system in order to make it more robust

Output/error files of the application removed from the output sandbox: Size largely reduced: significant improvement of the job retrieval rate

Virgin metadata and XML POOL catalog of the job removed from the input sandbox (size reduced to 10 KB): Stored in several SEs at job submission time (atomic operation) to improve

their availability

Significant improvement of the job submission rate

More robust stage in/out procedure: Failing input/output operations to/from the SE are retried several times (with a

delay) to avoid temporary access problems to the SE/RLS

The copy of the job output is tried on several SEs if one fails



Output ZIP archives At job completion time, the output EVD files are packed in a ZIP archive

(without compression) together with other important files: Checksum of the EVD files, XML POOL catalog fragment of the output files,

summary file, output and error files of the application

Just one big file is copied to the SE, instead of several EVD files One of the EVDs is only 100 KB in size (very bad for MSS performance)

CMS applications can read files inside uncompressed ZIP archives (without unpacking them)

Zipping had implications for the job preparation of subsequent steps: We instrumented the job wrapper to deal properly with ZIPs

... and in the publication of data: The publication tool, CMSGLIDE (M.A. Afaq, FNAL), was modified to be able

to create XML POOL catalogs and attached metadata for production ZIPs Global metadata attachment done using ZIPs (w/o unzipping)

Zipping of EVDs widely adopted in CMS: EVDs produced in local farmas merged into 2 GB files

Great benefits for PhEDEx and MSS: much less, much larger, files



Treatment of pile-up

Proper simulation of events requires the superposition of events from inelastic pp interactions on the events of simulated physics processes

Large pile-up (PU) samples prepared at CERN: about 100 GB

Local farm: PU installed locally and made accessible to the jobs at runtime via POSIX I/O-like (rfio, dcap)

LCG: PU sample, EVD (zipped) and metadata transferred with PhEDEx to T1/T2 sites that will run digitization/reconstruction: The XML POOL catalog of the PU, with site-dependent PFNs/protocol, is

placed in a standard location

Publish a LCG software tag for the PU in the grid information system, used as a requirement in the JDL of the jobs

At runtime, the job wrapper merges the PU catalog with that of the job

This simple (novel) implementation has been crucial for running digitization and reconstruction jobs in LCG



Other interesting ideas

Local CMS software installation at runtime: We instrumented the job wrapper to install the CMS software in the working

area of the job

Little overhead: software downloaded from the SE

Avoid NFS problems, software installation problems, black holes

Suitable for running in sites with little or no support for CMS

Local pile-up installation at runtime (a la ATLAS): Store and replicate the PU sample in several SEs

Download a (random) fraction of the PU sample

Generate metadata for the PU runs downloaded

An experimental version exists, not used for physics: Important to determine the number of events required to have minimal or no

impact on physics

Need to study tradeoff between local access and downloading of files (LAN)



Production operations Production in LCG started slowly one year ago with reduced manpower:

Development/testing of the McRunjob-LCG software and production operations done for a long time by ~1.5 FTE

Other production operators joined the effort few months later

The number of events (in millions) produced in LCG per data tier are: 13.1 generated, 11.7 simulated, 11.4 digitized, 5.1 reconstructed

Simulation

11/04 to 02/06

12 M

Digitization

06/05 to 02/06

14 M



Production operations II We decided to use white lists due to grid/site unreliability:

Sites selected for their size and robustness Big sites still running production in local farm mode (FZK, IN2P3) Local administrators providing fast response: fix problems Availability of PU

This represents a fraction (~30 %) of the production in LCG No proper bookkeeping in the initial phase of production

4275 simulation jobs9800 digitization and

reconstruction jobs



Efficiency and failures

Rather low efficiency

Stage in/out and catalog (RLS) related problems

LCG and site problems: RB, CE

Input file stage in 25 %

Output file stage out 16 %

Local data access 19 %

LCG catalog lookup (LFC ~ 0 %) 6 %

Other LCG and site problems 25 %

Application failure 8 %

Unclassified 1 %



Example of problems

Lack of automatic monitoring/resubmission

Lack of coupling of the CMS data management system (pre-staging)

Temporary grid and site problems: CE, SE, RLS

Lack of manpower

Organization: lack of dedicated resources (PU)

Lack of priorities: competition with CMS analysis and other experiments’ jobs

job

s



Migration to LFC

Recently, CMS has migrated from RLS to LFC, as a global file catalog for LCG (thanks to S. Lemaitre, A. Sciabà , J. Casey)

We adapted McRunjob to use LFC instead of RLS

So far, a small fraction of production in LCG done using LFC

Very satisfactory results as compared to RLS



Performance of production (LFC)

Significant improvement in performance when using LFC as a global catalog

A bunch of jobs died due to a unscheduled power cut



New MC production system

Expert system (prodagent)

Automatic data merging step

Job chaining

Coupled to the Data Management System

New EDM (no metadata attachment)

Improved monitoring

Better error handling



Conclusions

End-to-end production system in LCG

Invaluable experience for the next generation Monte Carlo production system

Robustness is very a important issue given the current unreliability and instability of grid/sites

Documents

CMS Monte Carlo Production in LCG