Upload
andrea-powers
View
213
Download
0
Embed Size (px)
DESCRIPTION
Condor-G Architecture Condor Master Condor Schedd Condor GridManager Globus resource Globus resource Globus resource condor_submit condor_q condor_rm One GridManager per user
Citation preview
JSS Job Submission Service
Massimo SgaravattoINFN Padova
JSS Wrapper of Condor-G identified as JSS
for Testbed 1
Condor-G is a Personal Condor enhanced with Globus services Used to submit jobs from the user ws to
remote Globus resources Condor-G keeps track of the progress of
these jobs
Condor-G ArchitectureCondorMaster
CondorSchedd
CondorGridManager
Globusresource
Globusresource
Globusresource
condor_submitcondor_qcondor_rm
One GridManager per user
Condor-G commands condor_submit CondorSubmitFile
To submit jobs to a Globus resource condor_q {id}
To monitor the status of the job(s) condor_rm id
To remove the job from the queue
Example condor_submit myfilemyfile:
Universe = globusTransferExecutable=TrueExecutable = /home/userx/startsim.shTransferInput=TrueInput=/home/userx/inp.$(Process)TransferOutput=FalseOutput = /data/out.$(Process)TransferError=TrueError = /home/userx/error.$(Process)Environment = CMSVER=118Log = /home/userx/log.$(Process)Arguments=123GlobusRSL=(queue=cmsprod)GlobusScheduler = pcmsfarm01.pi.infn.it/jobmanager-lsfQueue 10
Condor-G job log file Info reported
When the job has been inserted in the Condor-G queue The IP address of the submitting machine (Condor-G
machine) When the job has started its execution The IP name of the gatekeeper machine where the job
has been submitted (could be different from the actual executing machine)
When the job has completed its execution Condor-G relies on both callbacks and polling to
create this log file Library already available to “parse” this job log file
Not tested yet
“Abnormal” events The submission to Globus fails
Condor-G tries again after 5 minutes This event is reported in the GridManager log file (not in the
job log file) The gatekeeper can’t be contacted (for an already
submitted job) The job remains in the Condor-G queue, and Condor-G tries
again later The Gatekeeper can be contacted, but the job
manager can’t be contacted Now: job completed with exit status 1
Exit status 0 for the “normal” jobs Enhanced when the new persistent job manager will be
released (see next slides)
Condor-G problems The failures submitting jobs to Globus resources
and the reasons of these failures are reported in the GridManager log file instead of the job log file
The log file doesn’t report when the job “arrives” at the Globus resource (i.e. when the job manager is created)
It is reported when it is inserted in the Condor-G queue and when it starts its execution in the Globus resource
API missing Not possible to be asynchronously notified about
job status transitions (i.e. callbacks)
Issues not addressed by Condor-G Condor-G is not able to discover if a job “disappears” without
any exit status, and the underlying LRMS is not able to manage the problems
In this case Globus reports a “done” callback Do we really have to manage this problem ?
Exit status of jobs Globus doesn’t report the exit status of jobs
The job status transitions: running suspended (job transition #5 wrt Cesnet doc) running can’t be detected
Globus doesn’t detect these transitions Expiration of proxy
Just a parameter in the Condor-G conf file defining the minimum lifetime of the proxy
Not possible to move from/to the executing machines other files besides executable/standard input/output/error
Other issues Proxy
Future developments Next future (1 month ?)
Two phase commit submission protocol Persistent Globus job manager
(save_state=yes) when submitting a job (recover=ContactStringOfJobManager) to restart a
job manager and “reattach” it to a running job Condor GridManager able to automatically
exploit the new job manager Used when Condor-G looses track of a job
Long term GRAM-2