JSS Job Submission Service Massimo Sgaravatto INFN Padova

JSS Job Submission Service

Massimo SgaravattoINFN Padova

JSS Wrapper of Condor-G identified as JSS

for Testbed 1

Condor-G is a Personal Condor enhanced with Globus services Used to submit jobs from the user ws to

remote Globus resources Condor-G keeps track of the progress of

these jobs

Condor-G ArchitectureCondorMaster

CondorSchedd

CondorGridManager

Globusresource

Globusresource

Globusresource

condor_submitcondor_qcondor_rm

One GridManager per user

Condor-G commands condor_submit CondorSubmitFile

To submit jobs to a Globus resource condor_q {id}

To monitor the status of the job(s) condor_rm id

To remove the job from the queue

Example condor_submit myfilemyfile:

Universe = globusTransferExecutable=TrueExecutable = /home/userx/startsim.shTransferInput=TrueInput=/home/userx/inp.$(Process)TransferOutput=FalseOutput = /data/out.$(Process)TransferError=TrueError = /home/userx/error.$(Process)Environment = CMSVER=118Log = /home/userx/log.$(Process)Arguments=123GlobusRSL=(queue=cmsprod)GlobusScheduler = pcmsfarm01.pi.infn.it/jobmanager-lsfQueue 10

Condor-G job log file Info reported

When the job has been inserted in the Condor-G queue The IP address of the submitting machine (Condor-G

machine) When the job has started its execution The IP name of the gatekeeper machine where the job

has been submitted (could be different from the actual executing machine)

When the job has completed its execution Condor-G relies on both callbacks and polling to

create this log file Library already available to “parse” this job log file

Not tested yet

“Abnormal” events The submission to Globus fails

Condor-G tries again after 5 minutes This event is reported in the GridManager log file (not in the

job log file) The gatekeeper can’t be contacted (for an already

submitted job) The job remains in the Condor-G queue, and Condor-G tries

again later The Gatekeeper can be contacted, but the job

manager can’t be contacted Now: job completed with exit status 1

Exit status 0 for the “normal” jobs Enhanced when the new persistent job manager will be

released (see next slides)

Condor-G problems The failures submitting jobs to Globus resources

and the reasons of these failures are reported in the GridManager log file instead of the job log file

The log file doesn’t report when the job “arrives” at the Globus resource (i.e. when the job manager is created)

It is reported when it is inserted in the Condor-G queue and when it starts its execution in the Globus resource

API missing Not possible to be asynchronously notified about

job status transitions (i.e. callbacks)

Issues not addressed by Condor-G Condor-G is not able to discover if a job “disappears” without

any exit status, and the underlying LRMS is not able to manage the problems

In this case Globus reports a “done” callback Do we really have to manage this problem ?

Exit status of jobs Globus doesn’t report the exit status of jobs

The job status transitions: running suspended (job transition #5 wrt Cesnet doc) running can’t be detected

Globus doesn’t detect these transitions Expiration of proxy

Just a parameter in the Condor-G conf file defining the minimum lifetime of the proxy

Not possible to move from/to the executing machines other files besides executable/standard input/output/error

Other issues Proxy

Future developments Next future (1 month ?)

Two phase commit submission protocol Persistent Globus job manager

(save_state=yes) when submitting a job (recover=ContactStringOfJobManager) to restart a

job manager and “reattach” it to a running job Condor GridManager able to automatically

exploit the new job manager Used when Condor-G looses track of a job

Long term GRAM-2

Documents

JSS Job Submission Service Massimo Sgaravatto INFN Padova