Upload
maud-franklin
View
213
Download
0
Embed Size (px)
Citation preview
The SAM-Grid / LCG Interoperability Test Bed
Gabriele Garzoglio ([email protected])Speaker: Pierre Girard ([email protected])
Sep 28, 2005 Gabriele Garzoglio
Overview
The Interoperability Test BedMotivationsArchitecture
Status ReportLesson learned / Problems encounteredStill discussing…
Conclusions
Sep 28, 2005 Gabriele Garzoglio
Motivations for the interoperability project
The SAM-Grid is a convenient meta-computing system for the RunII experiments because it offers…
…transparent access to the experiment data through SAM…integrated application management (job environment preparation, application-sensitive policies, job aggregation)
But deployment is expensive…The idea: DZero will increase its resource pool within the framework of LCG (EGEE), while relying on the SAM-Grid data and application management
Sep 28, 2005 Gabriele Garzoglio
Basic Architecture
SAM-Grid LCG
SAM-Grid / LCG Forwarding Node
SAM-Grid VO-Specific Services
Flow of Job SubmissionOffers services to …
•Main issues to track down:•Accessibility of the services•Usability of the resources•Scalability
Sep 28, 2005 Gabriele Garzoglio
Service/Resource Multiplicity
FW
FW
FWSAM-Grid
C
C
C
C
C
C
C
C
C
S
S
S
FW
C
S
Network Boundaries
Forwarding Node
LCG Cluster
VO-Service (SAM)
Job Flow
Offers Service
Sep 28, 2005 Gabriele Garzoglio
Current Test Bed Configuration
FW
SAM-Grid
C
S
FW
C
S
Network Boundaries
Forwarding Node
LCG Cluster
Integration in Progress
VO-Service (SAM)
Job Flow
Offers Service
Wuppertal
CCIN2P3
C
Clermont-Ferrand
CC
C
ImperialCollege
RAL
Lancaster
C
Sep 28, 2005 Gabriele Garzoglio
Job Scheduling System Adaptation I
The SAM-Grid sees the FW node as another gatewayThe SAM-Grid has developed a grid-to-fabric interface (job-manager) that interacts with multiple fabric services (SAM, Monitoring, Environment Preparation): the Batch System is one of them.Batch system adaptation is done through a layer of abstraction and implemented via robust local scheduler handlers.
Sep 28, 2005 Gabriele Garzoglio
Job Scheduling System Adaptation II
This mechanism is so flexible that allowed the adaptation of SAM-Grid to LCGJob Management (submit, status poll, kill, output gathering, …) is implemented via an LCG “scheduler” handler The handler uses the LCG UI to submit jobs to an LCG broker (logically part of the FW node, in practice can be anywhere)
Sep 28, 2005 Gabriele Garzoglio
Overview
The Interoperability Test BedMotivationsArchitecture
Status ReportLesson learned / Problems encounteredStill discussing…
Conclusions
Sep 28, 2005 Gabriele Garzoglio
Status Report
We can submit real DZero data reprocessing and montecarlo jobs to LCG via SAM-GridJobs land on the available LCG clustersJobs rely on the SAM station at CCIN2P3 to handle input (binaries and data) and output…see the SAM-Grid monitoring
Sep 28, 2005 Gabriele Garzoglio
Problems/Lesson Learned I
Scratch management is responsibility of the site OR the application.
DZero requirements on local scratch space• Cannot run on NFS because of intensive I/O• Need 4 GB of local space
SAM-Grid uses job wrappers to do “smart” scratch management (find best scratch area to use)These wrappers rely on the job managers to set up scratch variables ($TMP_DIR, …)Under discussion: one aspect of considering a cluster DZero-certified should be having the scratch variables defined
Sep 28, 2005 Gabriele Garzoglio
Problems/Lesson Learned II
Use of the LCG brokersExperienced problems with disk space for the input sandbox (input sandbox 4 MB, all the rest via SAM)Needed administrative action to resolve the problemPossibly mitigated since we can use multiple brokers (tested with Wupperal and CCIN2P3 brokers)
Sep 28, 2005 Gabriele Garzoglio
Problems/Lesson Learned III
Job Failure AnalysisIn general, for a single SAM-Grid job, the forwarding node submits multiple LCG jobs (aggregation management). The output of all the jobs is bundled together in an output sandbox.We observed problems retrieving the output of “aborted” LCG jobs“Maradona” fails in handling the outputIn this case, it is tough to understand what went wrong with the job
Sep 28, 2005 Gabriele Garzoglio
Problems/Lesson Learned IV
Resubmission of non-reentrant jobs
Some jobs should not be resubmitted in case of failure. They will be recovered as a separate activityProblems overriding retrials of job submission from the JDL and the UI configurationIs this a known bug? A configuration problem on our part?
Sep 28, 2005 Gabriele Garzoglio
Problems/Lesson Learned V
Network configurationSites hosting SAM must allow incoming network traffic from the FW node and from all LCG clusters (worker nodes) to allow data handling control and transportSAM should be modified to provide port range control
Sep 28, 2005 Gabriele Garzoglio
Problems/Lesson Learned VI
SAM configurationSAM can only use TCP-based communication (as expected, UDP does not work in practice on the WAN)SAM had to be modified to allow service accessibility for jobs within private networks (pull-based vs call-back interfaces)
Sep 28, 2005 Gabriele Garzoglio
Still discussing... I
What does it mean certifying LCG for a certain DZero activity?
For reprocessing, all the SAM-Grid clusters have undergone an initial certification phaseThe cluster processes a well known dataset, then results are compared with a reference resultWhat do we do for LCG? Should every individual cluster be certified? Should the LCG as a whole be certified?The answer probably depends on the type of activity (Reprocessing, Montecarlo, Analysis, …)
Sep 28, 2005 Gabriele Garzoglio
Still discussing... II
Who operates the SAM-Grid / LCG interoperability system?
For the SAM-Grid DZero reprocessing, people at the facilities had interest in having their resources utilized: people at each facility have run operations submitting jobs to their own facilitiesRunning “operations” means being responsible for the production of the data (routine job submission/monitoring, troubleshooting, facility maintenance/upgrade, …)How do we organize the people that operate the LCG interoperability system? Is one responsible person enough?
Sep 28, 2005 Gabriele Garzoglio
Still discussing... III
Support on LCGIn case something goes wrong on the LCG, DZero has to learn the best channels to request supportWhat response can DZero expect now and in 2 years?As the system becomes more complex, it becomes difficult for the operators to pin point the reasons for job failures. LCG will get reports for failures of the SAM-Grid side… and vice-versa.
Sep 28, 2005 Gabriele Garzoglio
Overview
The Interoperability Test BedMotivationsArchitecture
Status ReportLesson learned / Problems encounteredStill discussing…
Conclusions
Sep 28, 2005 Gabriele Garzoglio
Conclusions / SAM
We are moving the test bed to “production” by
expanding the systemramping up usage
We are discussing open issues in operating the interoperability system
LCG certificationOrganizing the operationsObtaining support for LCG problems
Our principal target production application is montecarlo for DZero
Sep 28, 2005 Gabriele Garzoglio
Conclusions / LCG
Grid batch job environment variables
Proposal for standardization made at last HEPIX and last Operations Workshop (Bologna)• http://edms.cern.ch/document/630962
What is the next step ? How to proceed with implementation ?
Make easier the MW errors handlingBy using a well defined set of MW error codes ?Suitable for automatic handling
Sep 28, 2005 Gabriele Garzoglio
More info at…
http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG-integration.pdf
http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG-integration-Lyon-report.pdf
http://samgrid.fnal.gov:8080/
http://www-d0.fnal.gov/computing/grid/
http://d0db.fnal.gov/sam/