23
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ([email protected]) Speaker: Pierre Girard ([email protected])

The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( [email protected] ) Speaker: Pierre Girard ( [email protected])

Embed Size (px)

Citation preview

Page 1: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

The SAM-Grid / LCG Interoperability Test Bed

Gabriele Garzoglio ([email protected])Speaker: Pierre Girard ([email protected])

Page 2: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Overview

The Interoperability Test BedMotivationsArchitecture

Status ReportLesson learned / Problems encounteredStill discussing…

Conclusions

Page 3: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Motivations for the interoperability project

The SAM-Grid is a convenient meta-computing system for the RunII experiments because it offers…

…transparent access to the experiment data through SAM…integrated application management (job environment preparation, application-sensitive policies, job aggregation)

But deployment is expensive…The idea: DZero will increase its resource pool within the framework of LCG (EGEE), while relying on the SAM-Grid data and application management

Page 4: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Basic Architecture

SAM-Grid LCG

SAM-Grid / LCG Forwarding Node

SAM-Grid VO-Specific Services

Flow of Job SubmissionOffers services to …

•Main issues to track down:•Accessibility of the services•Usability of the resources•Scalability

Page 5: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Service/Resource Multiplicity

FW

FW

FWSAM-Grid

C

C

C

C

C

C

C

C

C

S

S

S

FW

C

S

Network Boundaries

Forwarding Node

LCG Cluster

VO-Service (SAM)

Job Flow

Offers Service

Page 6: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Current Test Bed Configuration

FW

SAM-Grid

C

S

FW

C

S

Network Boundaries

Forwarding Node

LCG Cluster

Integration in Progress

VO-Service (SAM)

Job Flow

Offers Service

Wuppertal

CCIN2P3

C

Clermont-Ferrand

CC

C

ImperialCollege

RAL

Lancaster

C

Page 7: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Job Scheduling System Adaptation I

The SAM-Grid sees the FW node as another gatewayThe SAM-Grid has developed a grid-to-fabric interface (job-manager) that interacts with multiple fabric services (SAM, Monitoring, Environment Preparation): the Batch System is one of them.Batch system adaptation is done through a layer of abstraction and implemented via robust local scheduler handlers.

Page 8: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Job Scheduling System Adaptation II

This mechanism is so flexible that allowed the adaptation of SAM-Grid to LCGJob Management (submit, status poll, kill, output gathering, …) is implemented via an LCG “scheduler” handler The handler uses the LCG UI to submit jobs to an LCG broker (logically part of the FW node, in practice can be anywhere)

Page 9: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Overview

The Interoperability Test BedMotivationsArchitecture

Status ReportLesson learned / Problems encounteredStill discussing…

Conclusions

Page 10: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Status Report

We can submit real DZero data reprocessing and montecarlo jobs to LCG via SAM-GridJobs land on the available LCG clustersJobs rely on the SAM station at CCIN2P3 to handle input (binaries and data) and output…see the SAM-Grid monitoring

Page 11: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Problems/Lesson Learned I

Scratch management is responsibility of the site OR the application.

DZero requirements on local scratch space• Cannot run on NFS because of intensive I/O• Need 4 GB of local space

SAM-Grid uses job wrappers to do “smart” scratch management (find best scratch area to use)These wrappers rely on the job managers to set up scratch variables ($TMP_DIR, …)Under discussion: one aspect of considering a cluster DZero-certified should be having the scratch variables defined

Page 12: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Problems/Lesson Learned II

Use of the LCG brokersExperienced problems with disk space for the input sandbox (input sandbox 4 MB, all the rest via SAM)Needed administrative action to resolve the problemPossibly mitigated since we can use multiple brokers (tested with Wupperal and CCIN2P3 brokers)

Page 13: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Problems/Lesson Learned III

Job Failure AnalysisIn general, for a single SAM-Grid job, the forwarding node submits multiple LCG jobs (aggregation management). The output of all the jobs is bundled together in an output sandbox.We observed problems retrieving the output of “aborted” LCG jobs“Maradona” fails in handling the outputIn this case, it is tough to understand what went wrong with the job

Page 14: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Problems/Lesson Learned IV

Resubmission of non-reentrant jobs

Some jobs should not be resubmitted in case of failure. They will be recovered as a separate activityProblems overriding retrials of job submission from the JDL and the UI configurationIs this a known bug? A configuration problem on our part?

Page 15: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Problems/Lesson Learned V

Network configurationSites hosting SAM must allow incoming network traffic from the FW node and from all LCG clusters (worker nodes) to allow data handling control and transportSAM should be modified to provide port range control

Page 16: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Problems/Lesson Learned VI

SAM configurationSAM can only use TCP-based communication (as expected, UDP does not work in practice on the WAN)SAM had to be modified to allow service accessibility for jobs within private networks (pull-based vs call-back interfaces)

Page 17: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Still discussing... I

What does it mean certifying LCG for a certain DZero activity?

For reprocessing, all the SAM-Grid clusters have undergone an initial certification phaseThe cluster processes a well known dataset, then results are compared with a reference resultWhat do we do for LCG? Should every individual cluster be certified? Should the LCG as a whole be certified?The answer probably depends on the type of activity (Reprocessing, Montecarlo, Analysis, …)

Page 18: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Still discussing... II

Who operates the SAM-Grid / LCG interoperability system?

For the SAM-Grid DZero reprocessing, people at the facilities had interest in having their resources utilized: people at each facility have run operations submitting jobs to their own facilitiesRunning “operations” means being responsible for the production of the data (routine job submission/monitoring, troubleshooting, facility maintenance/upgrade, …)How do we organize the people that operate the LCG interoperability system? Is one responsible person enough?

Page 19: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Still discussing... III

Support on LCGIn case something goes wrong on the LCG, DZero has to learn the best channels to request supportWhat response can DZero expect now and in 2 years?As the system becomes more complex, it becomes difficult for the operators to pin point the reasons for job failures. LCG will get reports for failures of the SAM-Grid side… and vice-versa.

Page 20: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Overview

The Interoperability Test BedMotivationsArchitecture

Status ReportLesson learned / Problems encounteredStill discussing…

Conclusions

Page 21: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Conclusions / SAM

We are moving the test bed to “production” by

expanding the systemramping up usage

We are discussing open issues in operating the interoperability system

LCG certificationOrganizing the operationsObtaining support for LCG problems

Our principal target production application is montecarlo for DZero

Page 22: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

Conclusions / LCG

Grid batch job environment variables

Proposal for standardization made at last HEPIX and last Operations Workshop (Bologna)• http://edms.cern.ch/document/630962

What is the next step ? How to proceed with implementation ?

Make easier the MW errors handlingBy using a well defined set of MW error codes ?Suitable for automatic handling

Page 23: The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr)

Sep 28, 2005 Gabriele Garzoglio

More info at…

http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG-integration.pdf

http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG-integration-Lyon-report.pdf

http://samgrid.fnal.gov:8080/

http://www-d0.fnal.gov/computing/grid/

http://d0db.fnal.gov/sam/