19
Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)

Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)

Embed Size (px)

Citation preview

Patricia Méndez Lorenzo (IT/GS)

ALICE Offline Week (18th March 2009)

IntroductionALICE is interested in the deployment of the CREAM-CE

service at all sites which provide support to the experiment GOAL: Deprecation of the WMS use in benefit of the direct

CREAM-CE submission WMS submission mode to CREAM-CE not required

ALICE has began to test the CREAM-CE since the beginning of Summer 2008 into the real production environment

For the time being, ALICE is the only LHC experiment performing stress and real tests to the CREAM-CE

This talk will focus on the ALICE experiences using CREAM-CE, the expectations, future plans and

requirements for all the sites

18/03/09 2ALICE Offline Week -- CREAM-CE Use and Status for ALICE

The CREAM-CECREAM (Computing Resource Execution And

Management) lightweight service for job management operations at the CE level

Called to be the replacement of the current LCG-CESubmission procedures allowed by CREAM:

Submissions to CREAM via WMS Via generic clients which allow direct submission

The submission method depends basically on the experiment computing modelNormally pilot based follows the direct submission mode

approach (4 LHC experiments)Bulk submissions of real jobs follows the WMS submission

approach (CMS)

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 3

Direct Submission to CREAM-CEExtra elements required for direct submission

Proxy renewal mechanism (required by CMS and ATLAS) Responsible to automatically renew the user proxy if expiring Already (recently) available

The lack of this element is not a showstop for ALICE48h voms extensions ensured by the security team@CERNEnough to run production/analysis jobs without any addition

extension

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 4

The 1st test phase Performed in summer 2008 at FZK (T1 site, Germany)

Tests operated through a second VOBOX parallel to the already existing service at the T1 (operating in WMS submission mode)

Access to the local CREAM-CE was ensured through the PPS infrastructure Initially 30 CPUs Moved to the ALICE production queue in few weeks (production setup)

Intensive functionality and stability tests from July to September 2008 Production stopped to create and ALICE CREAM module into AliEn and to

allow the site to upgrade their system Excellent support from the CREAM-CE developers and the site admins

Specially Massimo Sgaravatto (INFN-Padova) and Angela Poschlad (GridKa T1 site)

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 5

Results of the 1st test phaseMore than 55000 jobs successfully executed through the

CREAM-CE in the mentioned periodNo interventionsin the VOBOX required in the testing phaseCREAM-CE usedto distribute real(standard) ALICE jobs

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 6

Running on the production queue

Running on PPS nodes

Implementation into AliEn (I)Creation of a new CREAM module

Specific for CREAM-CE submissionsAvailable since AliEn v2-16In parallel with the usual LCG module (restricted to WMS

submissions only)Change on the jdl construction

The current ALICE jdl contained the outputsandbox field which specifies the standard outputs of the job agents CREAM-CE requires a new jdl field which declares the gridftp

server where to retrieve the standard outputsALICE PROCEDURE: to remove the outputsandbox field

of the jdl files created by the CREAM module Only available in case of submission in debug mode

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 7

Implementation into AliEn (II)gridftp server is required

Required to retrieve the standard outputs of the job agentsSites are free to decide ist implementation (proposal:

VOBOX)200 GB of space requiredIt will be used ONLY if the submission has been done in

debug modeChange on the proxy renewal mechanism

Submision optimization purposeThe user proxy will be renewed only once per hour

In previous AliEn version this procedure was executed BEFORE each agent submission

The procedure has been implemented ALSO in LCG.pm

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 8

The 2nd test phaseAfter a debug phase of the CREAM module in January 2009,

the new CREAM module in production the 19th of February (2nd testing phase started)Stability and performance are currently the most important test

issues at the sites providing CREAM-CEThe deployment of a 2nd VOBOX ensures that the production will

continue on parallel through the WMS A unique VOBOX would require a dedicated babysitting of the

system (not realistic)Feedback of all issues are directly provided to the CREAM

developersAs of today, 11 sites are providing CREAM CE

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 9

Site queues Status of the queues 2nd VOBOX VOBOX with clients General Status

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 10

Site CREAM-CEs

CREAM Status

2nd VOBOX Clients in VOBOX

General Status

FZK 1 (4 queues) OK YES YES OK

Kolkata 2 OK YES YES OK

Athens 1 OK NO NO NOT OK

KISTI 1 OK YES YES OK

GSI 1 OK NO YES NOT OK*

IHEP

RAL 1 OK NO YES OK*

CNAF 1 OK YES YES OK

CERN 2 (3 queues each)

OK YES YES OK

Torino 1 OK YES YES OK

SARA 1 OK In preparation

YES In testing

Status of the sites (I)Site queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 11

FZKMinor actions required during the 2nd phase test

Delete some sandbox directories (hitting file limit again 32K subdirs) Procedure not neccessary in the next CREAM versions

46530 jobs since the 19th of Feb through the FZK CREAM-CERAL

No special actions reported by the site for service maintenance2678 jobs executed using the local CREAM-CE

KolkataDebugging phase performed directly with the developer

(Massimo Sgaravatto) In production from 9th of March

Status of the sites (II)Site queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 12

CERNTwo CEs have been provided the 9th of March to ALICE for

testing In production since the 10th of March (voalice03 used for this

production)SLC5 WNs behind the CREAM-CE17247 jobs since the 10th of March

GSIStill pending the setup of a 2nd VOBOXThe CREAM-CE performing well

CNAFCREAM-CE ready to enter production at the end of February After some instabilities observed last week (lack of automatic

purge, entered the production back the 13th of March) Info provider of the CREAM-CE showing certain instabilities

Status of the sites (III)Site queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 13

KISTIInstabilities at the VOBOX level prevents the full setup of

the local CREAM-CE in productionCREAM-CE system performing well

ATHENSThe CREAM-CE is working but the site cannot be put in

production No CREAM clients on the VOBOX

IHEPCREAM-CE is not working yet (siter admin working on)Missing infrastructure - no 2nd VOBOX (it will be provided

next week)

Status of the sites (IV)Site queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 14

SARASystem tested yesterday evening with some few jobsStill in testing phase

TorinoSystem in production since last weekAlready 744 jobs executed through the local CREAM system

Subatech2nd vobox already provided, the setup of the CREAM-CE is

ongoing

Reminder: How to provide CREAM-CE services for ALICESite queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 15

During the last October pre-GDB meeting it was explicitly mentioned:Unlikely to be deployable as an lcg-CE replacement on this

timescale (downtime period), but we can continue with rollout in parallel.

In addition during the November pre-GDB meeting it was concluded:The lcg-CE replacement will required the WMS submission in

place and the resolution of the proxy renewal issue (among more other points related to the service performance)

It was encouraged however the deployment of the system in parallel to the LCG-CE

Reminder: How to provide CREAM-CE services for ALICE (II)Site queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 16

The parallel LCG-CE vs. CREAM-CE setup in terms of ALICE computing model means the deployment of a 2nd VOBOX Each VOBOX is able to submit to a specific backend One VOBOX LCG-CE OR CREAM-CE submission: replacement

approach Two VOBOXES LCG-CE AND CREAM-CE submission: parallel

approachThis is a temporary solution during the parallel running phase

As soon as the replacement is ensured and the LCG-CE is deprecated ALICE will not required a 2nd VOBOX

Remarks for the 2nd VOBOX deployment Its setup is not sign with blood Each case can be studied individually BUT! Sites with important Storage capability for ALICE should be

included in the list of sites providing a 2nd VOBOX

Reminder: How to provide CREAM-CE services for ALICE (III)Site queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 17

Setup of the ALICE production queue behind the CREAM-CEThis procedure puts the CREAM-CE directly in production

GridFTP serverRequired to retrieve the job (agent) outputsRemoved from the VOBOX in January 2008 with the

deployment of the gLite3.1 VOBOX It was not longer required by the 4 LHC experiments at that

timeNo specific wish for the placement of this service

It can be provided into the VOBOX but this site decision

Future PlansSite queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 18

Small changes in the CREAM module are still neededThe current implementation of the CREAM-CE via CLI

allows the declaration of a single queue onlySites can provide several queues per site (moreover T0/T1

sites)The implementation of submission to several queues must

be done to the application levelPROPOSAL for ALICE (in 3 lines of code):

Definition of a range per queue at the LDAP levelCalculation of a random number before each agent

submissionAssignment of a queue based on the random number/range

matchmaking

ConclusionsSite queues

18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 19

The ALICE experience with the current CREAM-CE service is very positiveStable (and maintenance-free) operation is achieved

quickly after the initial debugging periodHigh performance and scalability (FZK 2000+ parallel

jobs) served by a single CREAM-CEExcellent support provided by the developers

Special thanks to Massimo Sgravatto (INFN Padova)ALICE is working with all sites to install a CREAM-

CE In full production before start of data taking