of 23 /23
Basic Grid Basic Grid Projects – Condor Projects – Condor Part II Part II Sathish Vadhiyar Sathish Vadhiyar Sources/Credits: Condor Project web pages

Basic Grid Projects – Condor Part II

Embed Size (px)


Basic Grid Projects – Condor Part II. Sathish Vadhiyar. Sources/Credits: Condor Project web pages. Checkpointing. Checkpointing is used to vacate job from one idle workstation to another A Condor checkpoint library linked with the program’s code - PowerPoint PPT Presentation

Text of Basic Grid Projects – Condor Part II

  • Basic Grid Projects Condor Part IISathish VadhiyarSources/Credits: Condor Project web pages

  • CheckpointingCheckpointing is used to vacate job from one idle workstation to anotherA Condor checkpoint library linked with the programs codeCheckpoint library installs signal handler for handling SIGSTP signal.Checkpoints either stored on local disk of submitting machine or on checkpoint serversStores unix process states including text, stack, data segments, files, pointers etc.Condor also provides periodic checkpointing

  • Checkpointing OverviewWhen startd daemon detects policy violations, sends a signal to the processThe signal handler in the process is invoked, process state is checkpointedCheckpoints sent to shadow process which stores itWhen a new machine is chosen, the executable and checkpoint is sent to remote machineWhen the job is started on the remote machine, it detects that it is a restart; reads the checkpoint; some manipulations done such that process state at the time of checkpoint is restored.It appears to the user code that the process has just returned from the signal handler

  • Checkpointing Details (Refer to postscript file)Preserving and restoring text area (same executable), data area (using sbrk(0)) and stackPreserving stack state consists of storing and restoring 2 parts stack context and stack spaceStack context stored by setjmp and restored by longjmpStack space replacement is tricky performed by using a secure data region for stackOpen filesstate saved by augmenting open callslseek performed during checkpointing to obtain offset informationSignals sigaction, sigispending

  • Checkpoint summaryCheckpoint library installs signal handler called checkpoint()Then calls main()At the time of checkpoint, SIGSTP signal sent, checkpoint() invokedcheckpoint()Write open files, signals, stack context to data areaStores data and stack segments

  • Restart Summaryrestore()Overwrites data segment with that in checkpointRestores file and signal informationSwitches to a temporary location in data segment, replaces its stack spacePerforms longjmp() pointing to checkpoint() signal handlerCheckpoint routine returns and restores CPU registers

  • LimitationsCannot checkpoint fork()/exec() or multi-processCan checkpoint only on homogeneous systemsCannot checkpoint communicating multi-processes

  • Condor UniversesUniverse specified during job submissionTypes:StandardSystem calls transferred to submit machinesProvides for checkpointing and migrationRelink program with condor_compileVanillaFor programs that cannot be relinkedDoes not provide for checkpointing and migration WHY?For accessing to files, use Condor File Transfer mechanismSchedulerFor job that should act as metaschedulerMpi, pvm, java,globus

  • Condor Commandscondor_compileRelinks source or object files with condor librariesCondor library provides checkpointing, migration, remote system callscondor_submit - Takes as input submit description file and produces a job classAd for further processing by central manager condor_status to view about various machines in the Condor poolcondor_q for viewing job status

  • DAGManMeta scheduler for CondorManages dependencies between jobs at a higher levelSits on top of CondorInput of one program depends on the othercondor_ submit_dag DAGInputFileName DAG within a DAG is supported

  • Example input file for DAGMan# Filename: diamond.dag #Job A A.condorJob B B.condorJob C C.condorJob D D.condor


    Retry C 3

  • Condor File System and File Transfer MechanismApplicable for only vanilla jobsBy default a shared file system is assumed between submitting machine and executing machineMachine classAd attributes FileSystemDomain and UidDomainTo bypass default: say something like: Requirements = UidDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu''

  • Condor File System and File Transfer MechanismIf machines do not share file systems or the file systems not explicitly specified, enable Condor File Transfer Mechanism: should_transfer_files = YES when_to_transfer_output = ON_EXIT Any files that are generated or modified in the remote working directory are transferred back to the submit machine

  • References / Sources / CreditsCondor manualCondor web pagesMichael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Sciences Technical Report #1346, April 1997.James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi-Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10) San Francisco, California, August 7-9, 2001.Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking: Distributed Resource Management for High Throughput Computing", Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998, Chicago, IL.Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104-111, June, 1988.

  • Submit description filesDirects queuing of jobsContainsExecutable locationCommand line arguments to jobstdin, stderr, stdoutInitial working directoryshould_transfer_files = . NO disables condor file transfer mechanismwhen_to_transfer_output = < ON_EXIT | ON_EXIT_OR_EVICT >

  • Submit description filerequirements = By default, Arch, OpSys, Disk, virtualMemory, FileSystemDomain for vanilla are setrequirements = + =

  • Machine ClassAd AttributesActivityArchCondorLoadAvg, ConsoleIdle, Disk, Cpus, KeyboardIdle, LoadAvg, KFlops, Mips, Memory, OpSys, FileSystemDomain, Requirements, StartdIpAddrClientMachine, CurrentRank, RemoteOwner, LastPeriodicCheckpoint

  • Job ClassAd AttributesCompletionDate, RemoteIwd

  • Heterogeneous job submissionWorks well with the vanilla universe since checkpoint is not taken.For standard universe,# Added by CondorCkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \ ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))Requirements = () && $(CkptRequirements)

  • Submission stepsJob preparationChoosing a universeSubmit description filecondor_submit

  • Job MigrationSIGSTP and signal handler in standard universeSIGTERM in vanilla

  • Condor SecuritySchedd starts shadow with the effective UID of job ownerDifferent methods like Kherberos and GSI for authentication, different encryption mechanisms, authorization are supported between client and daemonsSockets and ports condor collector and negotiator start on well known ports. Other daemons start on ephermeral ports.

  • CheckpointingCkptArch, CkptOpSys, LastCkptServer, LastCkptTime, NumCkpts classAds generated automatically for job