17
Status of the use of Status of the use of LCG LCG Stefano Bagnasco, INFN Torino ALICE Offline Week CERN June 1, 2004

Status of the use of LCG

Embed Size (px)

DESCRIPTION

Status of the use of LCG. Stefano Bagnasco, INFN Torino. ALICE Offline Week CERN June 1, 2004. Being bold: from AliEn to a Meta-Grid. LCG-2 core sites CERN, CNAF, FZK, NIKHEF, RAL, Taiwan, Cyfronet, IC, Cambridge (more than 1000 CPUs) GRID.IT sites - PowerPoint PPT Presentation

Citation preview

Page 1: Status of the use of LCG

Status of the use of Status of the use of LCGLCG

Stefano Bagnasco, INFN Torino

ALICE Offline Week

CERN June 1, 2004

Page 2: Status of the use of LCG

2Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Being bold: from AliEn to a Meta-Grid

● LCG-2 core sites CERN, CNAF, FZK, NIKHEF, RAL, Taiwan, Cyfronet, IC, Cambridge

(more than 1000 CPUs)● GRID.IT sites

LNL.INFN, PD.INFN and several smaller ones (about 400 CPUs not including CNAF)

● Pull-model is well-suited for implementing higher-level submission systems, since it does not require knowledge about the periphery, that may be very complex:

“A Grid is a system that […] coordinates resources that are not subject to centralized control […] using standard, open, general-purpose protocols and interfaces […] to deliver nontrivial qualities of service.”

I. Foster“What is the Grid? A three Point Checklist”Grid Today (2001)

Page 3: Status of the use of LCG

3Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

From AliEn to a Meta-Grid – cont’d

Design strategy:

● Use AliEn as a general front-end Owned and shared resource are exploited transparently

● Minimize points of contact between the systems No need to reimplement services etc. No special services required to run on remote CE/WNs

● Make full use of provided services: Data Catalogues, scheduling, monitoring… Let the Grids do their jobs (they should know how)

● Use high-level tools and APIs to access Grid resources Developers put a lot of abstraction effort into hiding the

complexity and shielding the user from implementation changes

Page 4: Status of the use of LCG

4Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Implementation

Implementation:

● Manage LCG (and Grid.it) resources through a “gateway”: an AliEn client (CE+SE) sitting on top of an LCG User Interface. The whole of LCG computing is seen as a single, large AliEn CE

associated with a single, large SE.

● Job management interface JDL translation (incl. InputData statements) JobID bookkeeping Command proxying (Submit, Cancel, Status query…)

● Data management interface Put() and Get() implementation AliEn PFN is LCG GUID (plus SE host to allow for optimisation) AliEn knows nothing about LCG replicas (but RLS/ROS does!)

Page 5: Status of the use of LCG

5Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Interfacing AliEn and LCG

LCG SiteEDG CE

WNAliEn

LCG SE

LCG

RB

Server

Interface Site

AliEn CE

LCG UI

AliEn SE

Job submission

Status report

Replica Catalogue

Data RegistrationLFN

Data Catalogue

Data RegistrationPFN = LFN

LFN

PFN

Page 6: Status of the use of LCG

6Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Implementation details● Job management interface

JDL translation (incl. InputData statements)• Remote catalogue query for each job – may slow down submission

significantly JobID bookkeeping

• AliEn job states do not map onto LCG ones• Job states from the two system are often out-of-sync (LCG

bookkkeeping slower) Command proxying (Submit, Cancel, Status query…)

• Have to take care of LCG LB delays

● Data management interface Put() and Get() implementation AliEn PFN is LCG GUID (plus SE host to allow for

optimisation)• lcg://tbed0115.cern.ch/8f6064e7-b580-11d8-ba2a-f8adc5715cd0

AliEn knows nothing about LCG replicas (but RLS/ROS does!)

• Again, let the RB do its job!

Page 7: Status of the use of LCG

7Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Cheating: two grids, same resources!

Alien CE

LCG UI

AliEnCE/SE

Server

SubmissionA User submits jobs

LCG RB

LCGCE/SE

WN

WN

WN

WN

WN

● “Double access” for selected sites (CNAF and CT.INFN)

Page 8: Status of the use of LCG

8Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Software installation

● Both AliEn and AliRoot installed via “plain” LCG jobs Do some checks, download tarballs, uncompress, build

environment script and publish relevant tags Single command available to get the list of available sites, send

the jobs everywhere and wait for completion. Full update on LCG-2 + GRID.IT (16 sites) takes ~30’ (if queues are not full…)

Manual intervention still needed in few sites (e.g. CERN/LSF) Ready for integration into AliEn automatic installation system

● Experiment software shared area misconfiguration caused most of the trouble in the beginning LCG-UI

NIKHEF

Taiwan

RAL

CNAF

TO.INFN

installAlice.shinstallAlice.jdl

installAliEn.shinstallAliEn.jdl

Page 9: Status of the use of LCG

9Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Monitoring● AliEn Job monitoring

Through interface “keyhole” State misalignment

● GridIce (LCG native monitoring) Only CE/SE load, integrated info No job monitoring (but coming soon)

● MonALISA sensors on server Checking storage capacity etc. Looking directly at LCG Info Providers via LDAP

● MonALISA sensors on interface Job distribution Not much automatism (but then, they do the job) AliEn knows nothing about LCG replicas (but RLS and ROS do!)

Page 10: Status of the use of LCG

10Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

The debugging loop

Queued

Started

Running

Saving

Done

Ready

Waiting

Scheduled

Running

Done Aborted

Error_*

Assigned

AliEn

Sta

tus

LC

G S

tatu

s

Logs from LCGOutputSandbox

Logs from AliEnCatalogue

Couldn’t save & Ran out of WallClockTime?No logs at all!

(Some) logs from LCG L&BK

Page 11: Status of the use of LCG

11Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

PDC2004 - Status● Statistics after round 1 (ended april, 4): job

distribution Alice::CERN::LCG is the interface to LCG-2 Alice::Torino::LCG is the interface to GRID.IT

Page 12: Status of the use of LCG

12Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

AliEn Vs. AliEn+LCG● LCG-2 jobs seen through AliEn MonaLisa monitoring

Ramp-up slope shows no major performance degradation

AliEn native site

LCG-2

GRID.IT

Page 13: Status of the use of LCG

13Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Grid.it starting up

Larger sites filled first

Page 14: Status of the use of LCG

14Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Phase II: Reconstrution● Will need local storage for all sites

So will need use of native LCG storage SRM in front of CASTOR, but ugly “Classic SE” in front of

disk pools SRM-to-dCache not working yet

● Interface system available, installed everywhere and tested on the EIS testbed. SRM everywhere would simplify things a lot Use of GUIDs for files also simplified things Eagerly waiting to see if large-scale production works…

● Many sites have ludicrously small storage LCG management is informed…

Page 15: Status of the use of LCG

15Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Phase II

[aliendb3.cern.ch:3307] /proc/296915/ > ls -l-rwxr-xr-x sbagnasc z2 40190 Jun 16 19:06 ConfigPPR.C -rwxr-xr-x sbagnasc z2 9416 Jun 16 19:10 EMCAL.Hits.root -rwxr-xr-x sbagnasc z2 8538 Jun 16 19:11 FMD.Hits.root -rwxr-xr-x sbagnasc z2 568379 Jun 16 19:12 galice.root -rwxr-xr-x sbagnasc z2 405 Jun 16 19:06 grunPPR.C -rwxr-xr-x sbagnasc z2 13703 Jun 16 19:12 ITS.Hits.root -rwxr-xr-x sbagnasc z2 19652 Jun 16 19:13 Kinematics.root -rwxr-xr-x sbagnasc z2 9552 Jun 16 19:14 MUON.Hits.root -rwxr-xr-x sbagnasc z2 7195 Jun 16 19:14 PHOS.Hits.root -rwxr-xr-x sbagnasc z2 6822 Jun 16 19:15 PMD.Hits.root -rwxr-xr-x sbagnasc z2 1431 Jun 16 19:20 resources -rwxr-xr-x sbagnasc z2 7222 Jun 16 19:15 RICH.Hits.root -rwxr-xr-x sbagnasc z2 10608 Jun 16 19:16 START.Hits.root -rwxr-xr-x sbagnasc z2 2105 Jun 16 19:20 stderr -rwxr-xr-x sbagnasc z2 277336 Jun 16 19:20 stdout -rwxr-xr-x sbagnasc z2 9799 Jun 16 19:17 TOF.Hits.root -rwxr-xr-x sbagnasc z2 35321 Jun 16 19:17 TPC.Hits.root -rwxr-xr-x sbagnasc z2 100688 Jun 16 19:18 TrackRefs.root -rwxr-xr-x sbagnasc z2 32475 Jun 16 19:19 TRD.Hits.root -rwxr-xr-x sbagnasc z2 10992 Jun 16 19:19 VZERO.Hits.root -rwxr-xr-x sbagnasc z2 8069 Jun 16 19:20 ZDC.Hits.root [aliendb3.cern.ch:3307] /proc/296915/ > whereis -o EMCAL.Hits.root And the file is in Alice::CERN::LCGtest lcg://tbed0115.cern.ch/139fff74-bfb8-11d8-91c1-be96c4d6b0ee

● The full system was tested in the EIS testbed With a somewhat simpler JDL than for prduction Awkward to do tests on the real infrastructure production

Page 16: Status of the use of LCG

16Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Conclusions & lessons learned - 1● In the first half of Phase I, LCG (+ grid.it) was

able to provide ~50% of computing cycles Fraction was reduced afterwards, but we should be able to

have it back

● LCG Workload Management proved itself reasonably robust We never lost a job due because of RB failures

● The remote site configuration is still the major source of problems, LCG-side. Things are unlikely to get better with the use of LCG

storage… Software management tools are still rudimentary Large sites have often tighter security restrictions & other

idiosincracies Investigating and fixing problems is hard and time-

consuming

Page 17: Status of the use of LCG

17Status of the use of LCGStefano Bagnasco, INFN Torino

ALICE Offline WeekCERN June 1, 2004

Conclusions & lessons learned - 2● The most difficult part of the management is

monitoring LCG through a “keyhole”. Only integrated information available natively MonALISA for AliEn, GridICE for LCG Some safety mechanisms are too coarse for this approach

(queue blocking)

● Anti-blackhole mechanisms being studied in grid.it Based on run time statistics, CPUTime/WallClockTime ratio…

● For short jobs, submission time (and thus the interface system performance) can limit the number of jobs But the system is trivially scalable InputData translation may slow down things significantly