28
The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM- Grid Team Fermilab, Computing Division

The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Embed Size (px)

Citation preview

Page 1: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

The SAM-Grid and the use of Condor-G as a grid job management middleware

Gabriele Garzoglio for the SAM-Grid TeamFermilab, Computing Division

Page 2: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Overview

Computation in High Energy PhysicsThe SAM-Grid computing infrastructureThe Job Management and Condor-GReal life experienceFuture work

Page 3: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

High Energy Physics ChallengesHigh Energy Physics studies the fundamental interactions of Nature.Few laboratories around the world provide each unique facilities (accelerators) to study particular aspects of the field: the collaborations are geographically distributed.Experiments become every decade more challenging/expensive: the collaborations are large groups of people.The phenomena studied are statistical in nature and very rare events: a lot of data/statistics is needed

Page 4: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

A HEP laboratory: Fermilab

Page 5: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

FNAL Run II detectors

Page 6: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

DZero

FNAL Run II detectors

Page 7: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

The Size of the D0 Collaboration

~500 Physicists72 institutions18 Countries

DZero and CDF Institutions

Page 8: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Data size for the D0 ExperimentDetector Data

1,000,000 ChannelsEvent size 250KBEvent rate ~50 HzOn-line Data Rate 12 MBps100 TB/year

Total datadetector, reconstructred, simulated400 TB/year

Page 9: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Typical DZero activities

Activity Description Community Load time/jobReconstruction data filtering small CPU & I/O 10 hoursMontecarlo data simulation small CPU 10 hoursAnalysis data mining large CPU & I/O hours

Activity Input/Job Output/Job Input/Year Output/YearReconstruction GB GB 100s TB 100s TBMontecarlo None 10 GB None TBAnalysis 100 GB GB varies varies

Page 10: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Overview

Computation in High Energy PhysicsThe SAM-Grid computing

infrastructureThe Job Management and Condor-GReal life experienceFuture work

Page 11: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

The SAM-Grid Project

Mission: enable fully distributed computing for DZero and CDFStrategy: enhance the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing (JIM)History: SAM from 1997, JIM from end of 2001Funds: the Particle Physics Data Grid (US) and GridPP (UK) People: Computer scientists and Physicists from Fermilab and the collaborating Institutions

Page 12: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Page 13: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Overview

Computation in High Energy PhysicsThe SAM-Grid computing

infrastructureThe Job Management and Condor-G

Real life experienceFuture work

Page 14: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Job Management: RequirementsFoster site autonomyOperate in batch mode: submit and disconnectReliability: handle the job request persistently; execute it and retrieve output and/or errors. Flexible automatic resource selection: optimization of various metrics/policiesFault tolerance: transient service disruption; automatic rematching and resubmitting capabilitiesAutomatic execution of complex interdependent job structures.

Page 15: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Service Architecture

SiteSite SiteSite SiteSite

Resource Selector

Info Collector

Info Gatherer

Match Making

User InterfaceUser Interface User InterfaceUser Interface

SubmissionGlobal Job Queue

Grid Client

SubmissionSubmission

User InterfaceUser Interface User InterfaceUser Interface

Global DH ServicesSAM Naming Server

SAM Log Server

Resource Optimizer

SAM DB ServerRC MetaData Catalog

Bookkeeping Service

SAM Stager(s)

SAM Station(+other servs)

Data Handling

Worker Nodes

Grid Gateway

Local Job Handler(CAF, D0MC, BS, ...)

JIM Advertise

Local Job Handling

Cluster

AAA

Dist.FS

Info Manager

XML DB server

Site Conf.Glob/Loc JID map...

Info Providers

MDS

MSS Cache Site

Web ServGrid Monitoring

User Tools

Flow of: job data meta-data

Page 16: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Technological choices (2001)

Low level resource management: Globus GRAM. Clearly not enough...Condor-G: right components and functionalities, but not enough in 2001...DZero and the Condor Team have been collaborating since, under the auspices of PPDG to address the requirements of a large distributed system, with distributively owned and shared resources.

Page 17: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Condor-G: added functionalities I

Use of the condor Match Making Service as Grid Resource Selector

Advertisement of grid site capabilities to the MMSDynamic $$(gatekeeper) selection for jobs specifying requirements on grid sites

Concurrent submission of multiple jobs to the same grid resource

at any given moment, a grid site is capable of accepting up to N jobs the MMS was modified to push up to N jobs to the same site in the same negotiation cycle

Page 18: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Condor-G: added functionalities II

Flexible Match Making logicthe job/resource match criteria should be arbitrarily complex (based on more info than what fits in the classad), statefull (remembers match history), “pluggable” (by administrators and users)Example: send the job where most of the data are. The MMS contacts the site data handling service to rank a job/site matchThis leads to a very thin and flexible “grid broker”

Page 19: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Condor-G: added functionalities III

Light clientsA user should be able to submit a job from a laptop and turn it offClient software (condor_submit, etc.) and queuing service (condor_schedd) should be on different machinesThis leads to a 3 tiers architecture for Condor-G: client, queuing, execution sites. Security was implemented via X509.

Page 20: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Condor-G: added functionalities IV

Resubmission/Rematching logic If the MMS matched a job to a site, which cannot accept it after trying the submission N times, the job should be rematched to a different siteFlexible penalization of already failed matches

Page 21: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Overview

Computation in High Energy PhysicsThe SAM-Grid computing

infrastructureThe Job Management and Condor-GReal life experience

Future work

Page 22: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele GarzoglioJO

B

Computing Element

Submission Client

User Interface

QueuingSystem

Job ManagementUser Interface

User Interface

BrokerMatch

Making Service

Information Collector

Execution Site #1

Submission Client

Submission Client

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Queuing System

Queuing System

Grid Sensors

Storage Element

Storage Element

Computing Element

Storage Element

Data Handling System

Data Handling System

Storage Element

Storage Element

Storage Element

Storage Element

Information Collector

Information Collector

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Computing Element

Computing Element

Data Handling System

Data Handling System

ext.logic

ext.logic

MyType "Machine"TargetType "Job"Name "ccin2p3-analysis.d0.prd.jobmanager-runjob"gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob"DbURL "http://ccd0.in2p3.fr:7080/Xindice"sam_nameservice_ "IOR:000000000000002a49444c3........."station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3" ...

MyType "Machine"TargetType "Job"Name "ccin2p3-analysis.d0.prd.jobmanager-runjob"gatekeeper_url_ "ccd0.in2p3.fr:2119/jobmanager-runjob"DbURL "http://ccd0.in2p3.fr:7080/Xindice"sam_nameservice_ "IOR:000000000000002a49444c3........."station_name_ "ccin2p3-analysis" station_experiment_ "d0" station_universe_ "prd" cluster_architecture_ "Linux+2.4" cluster_name_ "LyonsGrid" local_storage_path_ "/samgrid/disk" local_storage_node_ "ccd0.in2p3.fr" schema_version_ "1_1" site_name_ "ccin2p3" ...

MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" && ...) Rank 0.000000 station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar ..."Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457" ......

MyType "Job" TargetType "Machine" ClusterId 304 JobType “montecarlo" GlobusResource "$$(gatekeeper_url_)" Requirements (TARGET.station_name_ == "ccin2p3-analysis" && ...) Rank 0.000000 station_univ "prd" station_ex "d0" RequestId "11866" ProjectId "sam_ccd0_012457_25321_0" DbURL "$$(DbURL)" cert_subject "/DC=org/DC=doegrids/OU=People/CN=Aditya Nishandar ..."Env "MATCH_RESOURCE_NAME=$$(name);\ SAM_STATION=$$(station_name_);\ SAM_USER_NAME=aditya;..." Args "--requestId=11866" "--gridId=sam_ccd0_012457" ......

job_type = montecarlostation_name = ccin2p3-analysisrunjob_requestid = 11866runjob_numevts = 10000d0_release_version = p14.05.01jobfiles_dataset = san_jobset2minbias_dataset = ccin2p3_minbias_datasetsam_experiment = d0sam_universe = prdgroup = testinstances = 1

job_type = montecarlostation_name = ccin2p3-analysisrunjob_requestid = 11866runjob_numevts = 10000d0_release_version = p14.05.01jobfiles_dataset = san_jobset2minbias_dataset = ccin2p3_minbias_datasetsam_experiment = d0sam_universe = prdgroup = testinstances = 1

Page 23: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Montecarlo Production Statistics

Started beginning of 2004.Ramped up in March.3 Sites: Wisconsin (...via Miron), Manchester, Lyon. New sites are joining (UTA, LU, OU, LTU,...)Inefficiency due to the Grid infrastructure « 5%30 GB/week = 80,000 events/week (about 1/4 of total production)

Page 24: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Overview

Computation in High Energy PhysicsThe SAM-Grid computing

infrastructureThe Job Management and Condor-GReal life experienceFuture work

Page 25: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Future work of DZero with Condor

Use of DAGMan to automate the management of interdependent grid job structures.Address potential scalability limits.Investigate non-central brokering service via grid flocking.Integrate/Implement a proxy management infrastructure (e.g. MyProxy).All the rest (...fix bugs, improve error reporting, hand holding, sailing...)

Page 26: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Conclusions

The collaboration between DZero and the Condor team has been very fruitful since 2001.DZero has worked together with Condor to enhance the Condor-G framework, in order to address the requirements on distributed computing of a large HEP experiment.DZero is running “production” jobs on the Grid.

Page 27: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

Acknowledgments

Condor TeamPPDGDZeroCDF

Page 28: The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Apr 16, 2004 Gabriele Garzoglio

More info at…

http://www-d0.fnal.gov/computing/grid/

http://samgrid.fnal.gov:8080/

http://d0db.fnal.gov/sam/