11
JSS Job Submission Service Massimo Sgaravatto INFN Padova

JSS Job Submission Service Massimo Sgaravatto INFN Padova

Embed Size (px)

DESCRIPTION

Condor-G Architecture Condor Master Condor Schedd Condor GridManager Globus resource Globus resource Globus resource condor_submit condor_q condor_rm One GridManager per user

Citation preview

Page 1: JSS Job Submission Service Massimo Sgaravatto INFN Padova

JSS Job Submission Service

Massimo SgaravattoINFN Padova

Page 2: JSS Job Submission Service Massimo Sgaravatto INFN Padova

JSS Wrapper of Condor-G identified as JSS

for Testbed 1

Condor-G is a Personal Condor enhanced with Globus services Used to submit jobs from the user ws to

remote Globus resources Condor-G keeps track of the progress of

these jobs

Page 3: JSS Job Submission Service Massimo Sgaravatto INFN Padova

Condor-G ArchitectureCondorMaster

CondorSchedd

CondorGridManager

Globusresource

Globusresource

Globusresource

condor_submitcondor_qcondor_rm

One GridManager per user

Page 4: JSS Job Submission Service Massimo Sgaravatto INFN Padova

Condor-G commands condor_submit CondorSubmitFile

To submit jobs to a Globus resource condor_q {id}

To monitor the status of the job(s) condor_rm id

To remove the job from the queue

Page 5: JSS Job Submission Service Massimo Sgaravatto INFN Padova

Example condor_submit myfilemyfile:

Universe = globusTransferExecutable=TrueExecutable = /home/userx/startsim.shTransferInput=TrueInput=/home/userx/inp.$(Process)TransferOutput=FalseOutput = /data/out.$(Process)TransferError=TrueError = /home/userx/error.$(Process)Environment = CMSVER=118Log = /home/userx/log.$(Process)Arguments=123GlobusRSL=(queue=cmsprod)GlobusScheduler = pcmsfarm01.pi.infn.it/jobmanager-lsfQueue 10

Page 6: JSS Job Submission Service Massimo Sgaravatto INFN Padova

Condor-G job log file Info reported

When the job has been inserted in the Condor-G queue The IP address of the submitting machine (Condor-G

machine) When the job has started its execution The IP name of the gatekeeper machine where the job

has been submitted (could be different from the actual executing machine)

When the job has completed its execution Condor-G relies on both callbacks and polling to

create this log file Library already available to “parse” this job log file

Not tested yet

Page 7: JSS Job Submission Service Massimo Sgaravatto INFN Padova

“Abnormal” events The submission to Globus fails

Condor-G tries again after 5 minutes This event is reported in the GridManager log file (not in the

job log file) The gatekeeper can’t be contacted (for an already

submitted job) The job remains in the Condor-G queue, and Condor-G tries

again later The Gatekeeper can be contacted, but the job

manager can’t be contacted Now: job completed with exit status 1

Exit status 0 for the “normal” jobs Enhanced when the new persistent job manager will be

released (see next slides)

Page 8: JSS Job Submission Service Massimo Sgaravatto INFN Padova

Condor-G problems The failures submitting jobs to Globus resources

and the reasons of these failures are reported in the GridManager log file instead of the job log file

The log file doesn’t report when the job “arrives” at the Globus resource (i.e. when the job manager is created)

It is reported when it is inserted in the Condor-G queue and when it starts its execution in the Globus resource

API missing Not possible to be asynchronously notified about

job status transitions (i.e. callbacks)

Page 9: JSS Job Submission Service Massimo Sgaravatto INFN Padova

Issues not addressed by Condor-G Condor-G is not able to discover if a job “disappears” without

any exit status, and the underlying LRMS is not able to manage the problems

In this case Globus reports a “done” callback Do we really have to manage this problem ?

Exit status of jobs Globus doesn’t report the exit status of jobs

The job status transitions: running suspended (job transition #5 wrt Cesnet doc) running can’t be detected

Globus doesn’t detect these transitions Expiration of proxy

Just a parameter in the Condor-G conf file defining the minimum lifetime of the proxy

Not possible to move from/to the executing machines other files besides executable/standard input/output/error

Page 10: JSS Job Submission Service Massimo Sgaravatto INFN Padova

Other issues Proxy

Page 11: JSS Job Submission Service Massimo Sgaravatto INFN Padova

Future developments Next future (1 month ?)

Two phase commit submission protocol Persistent Globus job manager

(save_state=yes) when submitting a job (recover=ContactStringOfJobManager) to restart a

job manager and “reattach” it to a running job Condor GridManager able to automatically

exploit the new job manager Used when Condor-G looses track of a job

Long term GRAM-2