23
Basic Grid Basic Grid Projects – Condor Projects – Condor Part II Part II Sathish Vadhiyar Sathish Vadhiyar Sources/Credits: Condor Project web pages

Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Embed Size (px)

Citation preview

Page 1: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Basic Grid Projects – Basic Grid Projects – Condor Part IICondor Part II

Sathish VadhiyarSathish Vadhiyar

Sources/Credits: Condor Project web pages

Page 2: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

CheckpointingCheckpointing

Checkpointing is used to vacate job from one Checkpointing is used to vacate job from one idle workstation to anotheridle workstation to anotherA Condor checkpoint library linked with the A Condor checkpoint library linked with the program’s codeprogram’s codeCheckpoint library installs signal handler for Checkpoint library installs signal handler for handling SIGSTP signal.handling SIGSTP signal.Checkpoints either stored on local disk of Checkpoints either stored on local disk of submitting machine or on checkpoint serverssubmitting machine or on checkpoint serversStores unix process’ states including text, Stores unix process’ states including text, stack, data segments, files, pointers etc.stack, data segments, files, pointers etc.Condor also provides periodic checkpointingCondor also provides periodic checkpointing

Page 3: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Checkpointing OverviewCheckpointing Overview

When startd daemon detects policy violations, When startd daemon detects policy violations, sends a signal to the processsends a signal to the processThe signal handler in the process is invoked, The signal handler in the process is invoked, process state is checkpointedprocess state is checkpointedCheckpoints sent to shadow process which stores Checkpoints sent to shadow process which stores ititWhen a new machine is chosen, the executable When a new machine is chosen, the executable and checkpoint is sent to remote machineand checkpoint is sent to remote machineWhen the job is started on the remote machine, it When the job is started on the remote machine, it detects that it is a restart; reads the checkpoint; detects that it is a restart; reads the checkpoint; some manipulations done such that process state some manipulations done such that process state at the time of checkpoint is restored.at the time of checkpoint is restored.It appears to the user code that the process has It appears to the user code that the process has just returned from the signal handlerjust returned from the signal handler

Page 4: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Checkpointing Details (Refer to Checkpointing Details (Refer to postscript file)postscript file)

Preserving and restoring text area (same executable), Preserving and restoring text area (same executable), data area (using sbrk(0)) and stackdata area (using sbrk(0)) and stackPreserving stack state consists of storing and restoring Preserving stack state consists of storing and restoring 2 parts – stack context and stack space2 parts – stack context and stack spaceStack context stored by setjmp and restored by Stack context stored by setjmp and restored by longjmplongjmpStack space replacement is tricky – performed by Stack space replacement is tricky – performed by using a secure data region for stackusing a secure data region for stackOpen filesOpen files state saved by augmenting open callsstate saved by augmenting open calls lseek performed during checkpointing to obtain lseek performed during checkpointing to obtain

offset informationoffset informationSignals – sigaction, sigispendingSignals – sigaction, sigispending

Page 5: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Checkpoint summaryCheckpoint summary

Checkpoint library installs signal Checkpoint library installs signal handler called checkpoint()handler called checkpoint()Then calls main()Then calls main()At the time of checkpoint, SIGSTP At the time of checkpoint, SIGSTP signal sent, checkpoint() invokedsignal sent, checkpoint() invokedcheckpoint()checkpoint() Write open files, signals, stack context Write open files, signals, stack context

to data areato data area Stores data and stack segmentsStores data and stack segments

Page 6: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Restart SummaryRestart Summary

restore()restore() Overwrites data segment with that in Overwrites data segment with that in

checkpointcheckpoint Restores file and signal informationRestores file and signal information Switches to a temporary location in data Switches to a temporary location in data

segment, replaces its stack spacesegment, replaces its stack space Performs longjmp() pointing to Performs longjmp() pointing to

checkpoint() signal handlercheckpoint() signal handler Checkpoint routine returns and restores Checkpoint routine returns and restores

CPU registersCPU registers

Page 7: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

LimitationsLimitations

Cannot checkpoint fork()/exec() or Cannot checkpoint fork()/exec() or multi-processmulti-process

Can checkpoint only on Can checkpoint only on homogeneous systemshomogeneous systems

Cannot checkpoint communicating Cannot checkpoint communicating multi-processesmulti-processes

Page 8: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Condor UniversesCondor Universes

Universe specified during job submissionUniverse specified during job submissionTypes:Types:StandardStandard

System calls transferred to submit machinesSystem calls transferred to submit machines Provides for checkpointing and migrationProvides for checkpointing and migration Relink program with condor_compileRelink program with condor_compile

VanillaVanilla For programs that cannot be relinkedFor programs that cannot be relinked Does not provide for checkpointing and migration – WHY?Does not provide for checkpointing and migration – WHY? For accessing to files, use Condor File Transfer mechanismFor accessing to files, use Condor File Transfer mechanism

SchedulerScheduler For job that should act as metaschedulerFor job that should act as metascheduler

Mpi, pvm, java,globusMpi, pvm, java,globus

Page 9: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Condor CommandsCondor Commands

condor_compilecondor_compile Relinks source or object files with condor Relinks source or object files with condor

librarieslibraries Condor library provides checkpointing, Condor library provides checkpointing,

migration, remote system callsmigration, remote system calls

condor_submit - Takes as input submit condor_submit - Takes as input submit description file and produces a job classAd description file and produces a job classAd for further processing by central manager for further processing by central manager condor_status – to view about various condor_status – to view about various machines in the Condor poolmachines in the Condor poolcondor_q – for viewing job statuscondor_q – for viewing job status

Page 10: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

DAGManDAGMan

Meta scheduler for CondorMeta scheduler for Condor

Manages dependencies between jobs Manages dependencies between jobs at a higher levelat a higher level

Sits on top of CondorSits on top of Condor

Input of one program depends on the Input of one program depends on the otherother

condor_ submit_dagcondor_ submit_dag DAGInputFileNameDAGInputFileName

DAG within a DAG is supportedDAG within a DAG is supported

Page 11: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Example input file for DAGMan

# Filename: diamond.dag #Job A A.condorJob B B.condorJob C C.condorJob D D.condor

PARENT A CHILD B CPARENT B C CHILD D

Retry C 3

Page 12: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Condor File System and File Condor File System and File Transfer MechanismTransfer Mechanism

Applicable for only vanilla jobsApplicable for only vanilla jobsBy default a shared file system is assumed By default a shared file system is assumed between submitting machine and between submitting machine and executing machineexecuting machineMachine classAd attributes – Machine classAd attributes – FileSystemDomain and UidDomainFileSystemDomain and UidDomainTo bypass default: say something like:To bypass default: say something like:

Requirements = UidDomain == Requirements = UidDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu'' ``cs.wisc.edu''

Page 13: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Condor File System and File Condor File System and File Transfer MechanismTransfer Mechanism

If machines do not share file systems If machines do not share file systems or the file systems not explicitly or the file systems not explicitly specified, enable Condor File specified, enable Condor File Transfer Mechanism:Transfer Mechanism:

should_transfer_files = YESshould_transfer_files = YES when_to_transfer_output = ON_EXIT when_to_transfer_output = ON_EXIT

Any files that are generated or Any files that are generated or modified in the remote working modified in the remote working directory are transferred back to the directory are transferred back to the submit machinesubmit machine

Page 14: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

References / Sources / CreditsReferences / Sources / CreditsCondor manualCondor manualCondor web pagesCondor web pagesMichael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Sciences Processing System", University of Wisconsin-Madison Computer Sciences Technical Report #1346, April 1997.Technical Report #1346, April 1997.James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi-Tuecke, "Condor-G: A Computation Management Agent for Multi-Institutional Grids", Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10)Performance Distributed Computing (HPDC10) San Francisco, California, San Francisco, California, August 7-9, 2001.August 7-9, 2001.Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking: Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking: Distributed Resource Management for High Throughput Computing", Distributed Resource Management for High Throughput Computing", Proceedings of the Seventh IEEE International Symposium on High Proceedings of the Seventh IEEE International Symposium on High Performance Distributed ComputingPerformance Distributed Computing, July 28-31, 1998, Chicago, IL., July 28-31, 1998, Chicago, IL.Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Workstations", Proceedings of the 8th International Conference of Proceedings of the 8th International Conference of Distributed Computing Systems, Distributed Computing Systems, pages 104-111, June, 1988.pages 104-111, June, 1988.

Page 15: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Submit description filesSubmit description files

Directs queuing of jobsDirects queuing of jobs

ContainsContains Executable locationExecutable location Command line arguments to jobCommand line arguments to job stdin, stderr, stdoutstdin, stderr, stdout Initial working directoryInitial working directory should_transfer_files = <YES | NO | IF_NEEDED >should_transfer_files = <YES | NO | IF_NEEDED >. .

NO disables condor file transfer mechanismNO disables condor file transfer mechanism when_to_transfer_output = < ON_EXIT | when_to_transfer_output = < ON_EXIT |

ON_EXIT_OR_EVICT >ON_EXIT_OR_EVICT >

Page 16: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Submit description fileSubmit description file

requirements = <ClassAd Boolean requirements = <ClassAd Boolean Expression>Expression> By default, Arch, OpSys, Disk, virtualMemory, By default, Arch, OpSys, Disk, virtualMemory,

FileSystemDomain for vanilla are setFileSystemDomain for vanilla are set

requirements = <ClassAd Boolean requirements = <ClassAd Boolean Expression>Expression>

+<attribute> = <value>+<attribute> = <value>

Page 17: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Machine ClassAd AttributesMachine ClassAd Attributes

ActivityActivityArchArchCondorLoadAvg, ConsoleIdle, Disk, Cpus, CondorLoadAvg, ConsoleIdle, Disk, Cpus, KeyboardIdle, LoadAvg, KFlops, Mips, KeyboardIdle, LoadAvg, KFlops, Mips, Memory, OpSys, Memory, OpSys, FileSystemDomain, Requirements, FileSystemDomain, Requirements, StartdIpAddrStartdIpAddrClientMachine, CurrentRank, ClientMachine, CurrentRank, RemoteOwner, LastPeriodicCheckpointRemoteOwner, LastPeriodicCheckpoint

Page 18: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Job ClassAd AttributesJob ClassAd Attributes

CompletionDate, RemoteIwdCompletionDate, RemoteIwd

Page 19: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Heterogeneous job submissionHeterogeneous job submission

Works well with the vanilla universe since Works well with the vanilla universe since checkpoint is not taken.checkpoint is not taken.

For standard universe,For standard universe,

# Added by Condor# Added by Condor

CkptRequirements = ((CkptArch == Arch) || CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \ ((CkptOpSys (CkptArch =?= UNDEFINED)) && \ ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))== OpSys) || (CkptOpSys =?= UNDEFINED))

Requirements = (<user specified policy>) && $Requirements = (<user specified policy>) && $(CkptRequirements) (CkptRequirements)

Page 20: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Submission stepsSubmission steps

Job preparationJob preparation

Choosing a universeChoosing a universe

Submit description fileSubmit description file

condor_submitcondor_submit

Page 21: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Job MigrationJob Migration

SIGSTP and signal handler in standard SIGSTP and signal handler in standard universeuniverse

SIGTERM in vanillaSIGTERM in vanilla

Page 22: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Condor SecurityCondor Security

Schedd starts shadow with the effective Schedd starts shadow with the effective UID of job ownerUID of job ownerDifferent methods like Kherberos and GSI Different methods like Kherberos and GSI for authentication, different encryption for authentication, different encryption mechanisms, authorization are supported mechanisms, authorization are supported between client and daemonsbetween client and daemonsSockets and ports – condor collector and Sockets and ports – condor collector and negotiator start on well known ports. Other negotiator start on well known ports. Other daemons start on ephermeral ports.daemons start on ephermeral ports.

Page 23: Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

CheckpointingCheckpointing

CkptArch, CkptOpSys, LastCkptServer, CkptArch, CkptOpSys, LastCkptServer, LastCkptTime, NumCkpts classAds LastCkptTime, NumCkpts classAds generated automatically for jobgenerated automatically for job