90
Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison [email protected]

Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison [email protected]

Embed Size (px)

Citation preview

Page 1: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

Condor Tutorial for UsersINFN-Bologna, 6/29/99

Derek WrightComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]

Page 2: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

2

Conventions Used In This Presentation

A slide with an all-yellow background is the beginning of a new “chapter”• The slides after it will describe each entry

on the yellow slide in great detail A Condor tool that users would use will

be in red italics A ClassAd attribute name will be in blue A UNIX shell command or file name will

be in courier font

Page 3: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

3

What is Condor?

A system for “High-Throughput Computing”

Lots of jobs over a long period of time, not a short burst of “high-performance”

Condor manages both resources (machines) and resource requests (jobs)

Supports additional features for jobs that are re-linked with Condor libraries:• checkpointing• remote system calls

Page 4: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

4

What’s Condor Good For?

Managing a large number of jobs• You specify the jobs in a file and submit

them to Condor, which runs them all and sends you email when they complete

• Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc.

• Condor can handle inter-job dependencies (DAGMan)

Page 5: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

5

What’s Condor Good For? (cont’d)

Robustness• Checkpointing allows guaranteed forward

progress of your jobs, even jobs that run for weeks before completion

• If an execute machine crashes, you only loose work done since the last checkpoint

• Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover

Page 6: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

6

What’s Condor Good For? (cont’d)

Giving you access to more computing resources• Checkpointing allows your job to run on

“opportunistic resources” (not dedicated)• Checkpointing also provides “migration” -

if a machine is no longer available, move!• With remote system calls, you don’t even

need an account on a machine where your job executes

Page 7: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

7

What is a Condor Pool?

“Pool” can be a single machine, or a group of machines

Determined by a “central manager” - the matchmaker and centralized information repository

Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself

Page 8: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

8

What Kind of Job Do You Have?

You must know some things about your job to decide if and how it will work with Condor:• What kind of I/O does it do?• Does it use TCP/IP? (network sockets)• Can the job be resumed?• Is the job multi-process (fork(),

pvm_addhost(), etc.)

Page 9: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

9

What Kind of I/O Does Your Job Do?

Interactive TTY “Batch” TTY (just reads from STDIN

and writes to STDOUT or STDERR, but you can redirect to/from files)

X Windows NFS, AFS, or another network file

system Local file system TCP/IP

Page 10: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

10

What Does Condor Support?

Condor can support various combinations of these features in different “Universes”

Different Universes provide different functionality for your job:• Vanilla• Standard• Scheduler• PVM

Page 11: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

11

What Does Condor Support?

I nteractive TTY

X windowsNFS/AFS

Local fi les

TCP/ I P ResumeMulti-

process

Vanilla X X X X

Standard X X XScheduler X X X X XPVM X X X X X

Page 12: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

12

Condor Universes A Universe specifies a Condor

runtime environment:• STANDARD

– Supports CheckpointingSupports Checkpointing– Supports Remote System CallsSupports Remote System Calls– Has some limitations (Has some limitations (nono fork()fork(), , socket()socket(), etc.), etc.)

• VANILLA– Any Unix executable (shell scripts, etc)Any Unix executable (shell scripts, etc)– No Condor Checkpointing or Remote I/ONo Condor Checkpointing or Remote I/O

Page 13: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

13

Condor Universes (cont’d)

• PVM (Parallel Virtual Machine)– Allows you to run parallel jobs in Condor Allows you to run parallel jobs in Condor

(more on this later)(more on this later)

• SCHEDULER– Special kind of Condor job: the job is run on Special kind of Condor job: the job is run on

the the submitsubmit machine, not a remote execute machine, not a remote execute machinemachine

– Job is automatically restarted is the Job is automatically restarted is the condor_schedd is shutdowncondor_schedd is shutdown

– Used to schedule jobs (e.g. DAGMan)Used to schedule jobs (e.g. DAGMan)

Page 14: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

14

Submitting Jobs to Condor Choosing a “Universe” for your job (already

covered this) Preparing your job

• Making it “batch-ready”• Re-linking if checkpointing and remote system

calls are desired (condor_compile) Creating a submit description file Running condor_submit

• Sends your request to the User Agent (condor_schedd)

Page 15: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

15

Preparing Your Job Making your job “batch-ready”

• Must be able to run in the background: no interactive input, windows, GUI, etc.

• Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices

• If your job expects input from the keyboard, you have to put the input you want into a file

Page 16: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

16

Preparing Your Job (cont’d)

If you are going to use the standard universe with checkpointing and remote system calls, you must re-link your job with Condor’s special libraries

To do this, you use condor_compile• Place “condor_compile” in front of the

command you normally use to link your job:

condor_compile gcc -o myjob myjob.c

Page 17: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

17

Creating a Submit Description File A plain ASCII text file Tells Condor about your job:

• Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)

Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.

Page 18: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

18

Example Submit Description File

# Example condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = standardExecutable = /home/wright/condor/my_job.condorInput = my_job.stdinOutput = my_job.stdoutError = my_job.stderrLog = my_job.logArguments = -arg1 -arg2InitialDir = /home/wright/condor/run_1Queue

Page 19: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

19

Example Submit Description File Described

Submits a single job to the standard universe, specifies files for STDIN, STDOUT and STDERR, creates a UserLog defines command line arguments, and specifies the directory the job should be run in

Equivalent to (for outside of Condor):% cd /home/wright/condor/run_1% /home/wright/condor/my_job.condor -arg1 -arg2 \ > my_job.stdout 2> my_job.stderr \ < my_job.stdin

Page 20: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

20

“Clusters” and “Processes”

If your submit file describes multiple jobs, we call this a “cluster”

Each job within a cluster is called a “process” or “proc”

If you only specify one job, you still get a cluster, but it has only one process

A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”)

Process numbers always start at 0

Page 21: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

21

Example Submit Description File for a Cluster

# Example condor_submit input file that defines# a whole cluster of jobs at onceUniverse = standardExecutable = /home/wright/condor/my_job.condorInput = my_job.stdinOutput = my_job.stdoutError = my_job.stderrLog = my_job.logArguments = -arg1 -arg2InitialDir = /home/wright/condor/run_$(Process)Queue 500

Page 22: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

22

Example Submit Description File for a Cluster - Described

Now, the initial directory for each job is specified with the $(Process) macro, and instead of submitting a single job, we use “Queue 500” to submit 500 jobs at once

$(Process) will be expaned to the process number for each job in the cluster (from 0 up to 499 in this case), so we’ll have “run_0”, “run_1”, … “run_499” directories

All the input/output files will be in different directories!

Page 23: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

23

Running condor_submit

You give condor_submit the name of the submit file you have created

condor_submit parses the file and creates a “ClassAd” that describes your job(s)

Creates the files you specified for STDOUT and STDERR

Sends your job’s ClassAd(s) and executable to the condor_schedd, which stores the job in its queue

Page 24: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

24

Monitoring Your Jobs

Using condor_q Using a “User Log” file Using condor_status Using condor_rm Getting email from Condor Once they complete, you can use

condor_history to examine them

Page 25: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

25

Using condor_q

To view the jobs you have submitted, you use condor_q

Displays the status of your job, how much compute time it has accumulated, etc.

Many different options:• A single job, a single cluster, all jobs that

match a certain constraint, or all jobs• Can view remote job queues (either

individual queues, or “-global”)

Page 26: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

26

Using a “User Log” file

A UserLog must be specified in your submit file:• Log = filename

You get a log entry for everything that happens to your job:• When it was submitted, when it starts

executing, if it is checkpointed or vacated, if there are any problems, etc.

Very useful! Highly recommended!

Page 27: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

27

Using condor_status

To view the status of the whole Condor pool, you use condor_status

Can use the “-run” option to see which machines are running jobs, as well as:• The user who submitted each job• The machine they submitted from

Can also view the status of various submitters with “-submitter <name>”

Page 28: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

28

Using condor_rm

If you want to remove a job from the Condor queue, you use condor_rm

You can only remove jobs that you own (you can’t run condor_rm on someone else’s jobs unless you are root)

You can give specific job ID’s (cluster or cluster.proc), or you can remove all of your jobs with the “-a” option.

Page 29: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

29

Getting Email from Condor

By default, Condor will send you email when your jobs completes

If you don’t want this email, put this in your submit file:notification = never

If you want email every time something happens to your job (checkpoint, exit, etc), use this:notification = always

Page 30: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

30

Getting Email from Condor (cont’d)

If you only want email if your job exits with an error, use this:notification = error

By default, the email is sent to your account on the host you submitted from. If you want the email to go to a different address, use this:notify_user = [email protected]

Page 31: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

31

Using condor_history

Once your job completes, it will no longer show up in condor_q

Now, you must use condor_history to view the job’s ClassAd

The status field (“ST”) will have either a “C” for “completed”, or an “X” if the job was removed with condor_rm

Page 32: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

32

Any questions?

Nothing is too basic If I was unclear, you probably are not

the only person who doesn’t understand, and the rest of the day will be even more confusing

Page 33: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

Hands-On Exercise #1 Submitting and Monitoring a Simple

Test Job

Page 34: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

34

Hands-On Exercise #1 Login to your machine as user “condor” You will see two windows:

• Netscape, with instructions• An xterm, where you execute commands

To begin, click on Simple Test Job Please follow the directions carefully Any lines beginning with % are

commands that you should execute in your xterm

If you accidentally exit Netscape, click on “Tutorial” in the Start menu

Page 35: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

Lunch break

Please be back by 13:30

Page 36: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

Welcome Back

Page 37: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

37

Classified Advertisements ClassAds

• Language for expressing attributes• Semantics for evaluating them

Intuitively, a ClassAd is a set of named expressions• Each named expression is an attribute

Expressions are similar to C …• Constants, attribute references, operators

Page 38: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

38

Classified Advertisements: Example

MyType = "Machine"

TargetType = "Job"

Name = "froth.cs.wisc.edu"

StartdIpAddr="<128.105.73.44:33846>"

Arch = "INTEL"

OpSys = "SOLARIS26"

VirtualMemory = 225312

Disk = 35957

KFlops = 21058

Mips = 103

LoadAvg = 0.011719

KeyboardIdle = 12

Cpus = 1

Memory = 128

Requirements = LoadAvg <= 0.300000 && KeyboardIdle > 15 * 60

Rank = 0

Page 39: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

39

Classified Advertisements: Matching

ClassAds are always considered in pairs:• Does ClassAd A match ClassAd B (and vice

versa)?• This is called “2-way matching”

If the same attribute appears in both ClassAds, you can specify which attribute you mean by putting “MY.” or “TARGET.” in front of the attribute name

Page 40: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

40

Classified Advertisements: Examples

ClassAd AMyType = "Apartment" TargetType =

"ApartmentRenter" SquareArea = 3500RentOffer = 1000HeatIncluded = FalseOnBusLine = TrueRank = UnderGrad==False +

TARGET.RentOfferRequirements = MY.RentOffer

- TARGET.RentOffer < 150

ClassAd BMyType =

"ApartmentRenter"TargetType = "Apartment"UnderGrad = FalseRentOffer = 900Rank = 1/(TARGET.RentOffer

+ 100.0) + 50*HeatIncluded

Requirements = OnBusLine &&

SquareArea > 2700

Page 41: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

41

ClassAds in the Condor System

ClassAds allow Condor to be a general system• Constraints and ranks on matches

expressed by the entities themselves• Only priority logic integrated into the

Match-Maker All principal entities in the Condor

system are represented by ClassAds• Machines, Jobs, Submitters

Page 42: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

42

ClassAds in Condor: Requirements and Rank(Example for Machines)

Friend = Owner == "tannenba" || Owner == "wright"

ResearchGroup = Owner == "jbasney" || Owner == "raman"

Trusted = Owner != "rival" && Owner != "riffraff"

Requirements = Trusted && ( ResearchGroup || (LoadAvg < 0.3 && KeyboardIdle > 15*60) )

Rank = Friend + ResearchGroup*10

Page 43: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

43

Requirements for Machine Example Described

Machine will never start a job submitted by “rival” or “riffraff”

If someone from ResearchGroup (“jbasney” or “raman”) submits a job, it will always run, regardless of keyboard activity or load average

If anyone else submits a job, it will only run here if the keyboard has been idle for more than 15 minutes and the load average is less than 0.3

Page 44: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

44

Machine Rank Example Described If the machine is running a job submitted

by owner “foo”, it will give this a Rank of 0, since foo is neither a friend nor in the same research group

If “wright” or “tannenba” submits a job, it will be ranked at 1 (since Friend will evaluate to 1 and ResearchGroup is 0)

If “raman” or “jbasney” submit a job, it will have a rank of 10

While a machine is running a job, it will be preempted for a higher ranked job

Page 45: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

45

ClassAds in Condor: Requirements and Rank

(Example for Jobs)

Requirements = Arch == “INTEL” && OpSys == “LINUX” && Memory > 20

Rank = (Memory > 32) * ( (Memory * 100) + (IsDedicated * 10000) + Mips )

Page 46: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

46

Job Example Described

The job must run on an Intel CPU, running Linux, with at least 20 megs of RAM

All machines with 32 megs of RAM or less are Ranked at 0

Machines with more than 32 megs of RAM are ranked according to how much RAM they have, if the machine is dedicated (which counts a lot to this job!), and how fast the machine is, as measured in Million Instructions Per Second

Page 47: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

47

Finding and Using the ClassAd Attributes in your Pool

Condor defines a number of attributes by default, which are listed in the User Manual (“About Requirements and Rank”)

To see if machines in your pool have other attributes defined, use:• condor_status -long <hostname>

A custom-defined attribute might not be defined on all machines in your pool, so you’ll probably want to use “meta-operators”

Page 48: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

48

ClassAd “Meta-Operators” Meta operators allow you to compare

against “UNDEFINED” as if it were a real value:• =?= is “meta-equal-to”• =!= is “meta-not-equal-to”• Color != “Red” (non-meta) would

evaluate to UNDEFINED if Color is not defined

• Color =!= “Red” would evaluate to True if Color is not defined, since UNDEFINED is not “Red”

Page 49: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

Hands-On Exercise #2 Submitting Jobs with Requirements

and Rank

Page 50: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

50

Hands-On Exercise #2

Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Requirements and Rank• Again, read the instructions carefully

and execute any commands on a line beginning with % in your xterm

If you exited Netscape, just click on “Tutorial” from your Start menu

Page 51: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

51

Priorities In Condor Two kinds of priorities:

• User Priorities– Priorities between users in the pool to ensure Priorities between users in the pool to ensure

fairnessfairness– The The lowerlower the value, the better the priority the value, the better the priority

• Job Priorities – Priorities that users give to their own jobs to Priorities that users give to their own jobs to

determine the order in which they will rundetermine the order in which they will run– The The higherhigher the value, the better the priority the value, the better the priority– Only matters within a given user’s jobsOnly matters within a given user’s jobs

Page 52: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

52

User Priorities in Condor Each active user in the pool has a user

priority Viewed or changed with

condor_userprio The lower the number, the better A given user’s share of available

machines is inversely related to the ratio between user priorities.• Example: Fred’s priority is 10, Joe’s is 20. Fred

will be allocated twice as many machines as Joe.

Page 53: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

53

User Priorities in Condor, cont.

Condor continuously adjusts user priorities over time• machines allocated > priority, priority worsens• machines allocated < priority, priority improves

Priority Preemption• Higher priority users will grab machines away from

lower priority users (thanks to Checkpointing…)• Starvation is prevented• Priority “thrashing” is prevented

Page 54: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

54

Job Priorities in Condor

Can be set at submit-time in your description file with:prio = <number>

Can be viewed with condor_q Can be changed at any time with

condor_prio The higher the number, the more

likely the job will run (only among the jobs of an individual user)

Page 55: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

55

Managing a Large Cluster of Jobs

Condor can manage huge numbers of jobs

Special features of the submit description file make this easier

Condor can also manage inter-job dependencies with condor_dagman• For example: job A should run first, then, run

jobs B and C, when those finish, submit D, etc…

• We’ll discuss DAGMan later

Page 56: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

56

Submitting a Large Cluster

Anywhere in your submit file, if you use $(Process), that will expand to the process number of each job in the cluster: input = my_input.$(process) arguments = $(process)

It is common to use $(Process) to specify InitialDir, so that each process runs in its own directory: InitialDir = dir.$(process)

Page 57: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

57

Submitting a Large Cluster (cont’d)

Can either have multiple Queue entries, or put a number after Queue to tell Condor how many to submit: Queue 1000

A cluster is more efficient: Your jobs will run faster, and they’ll use less space

Can only have one executable per cluster: Different executables must be different clusters!

Page 58: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

Hands-On Exercise #3 Submitting a

Large Cluster of Jobs

Page 59: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

59

Hands-On Exercise #3

Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Large Clusters• Again, read the instructions carefully

and execute any commands on a line beginning with % in your xterm

If you exited Netscape, just click on “Tutorial” from your Start menu

Page 60: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

10 Minute Break

Questions are welcome….

Page 61: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

61

Inter-Job Dependencies with DAGMan

DAGMan can be used to handle a set of jobs that must be run in a certain order

Also provides “pre” and “post” operations, so you can have a program or script run before each job is submitted and after it completes

Robust: handles errors and submit-machine crashes

Page 62: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

62

Using DAGMan

You define a DAG description file, which is similar in function to the submit file you give to condor_submit

DAGMan restrictions:• Each job in the DAG must be in its own

cluster (this is a limitation we will remove in future versions)

• All jobs in the DAG must have a User Log and must share the same file

Page 63: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

63

Format of the DAGMan Description File

# is a comment First section names the jobs in your

DAG and associates a submit description file with each job

Second (optional) section defines PRE and POST scripts to run

Final section defines the job dependencies

Page 64: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

64

Example DAGMan Description File

# Example DAGMan input fileJob A A.submitJob B B.submitJob C C.submitJob D D.submitScript PRE D d_input_checkerScript POST A a_output_processor A.outPARENT A CHILD B CPARENT B C CHILD D

Page 65: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

65

Setting up a DAG for Condor

Must create the DAG description file Must create all the submit description

files for the individual jobs Must prepare any executables you plan

to use If you want, you can have a mix of

Vanilla and Standard jobs Must setup any PRE/POST commands

or scripts you wish to use

Page 66: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

66

Submitting a DAG to Condor

Once you have everything in place, to submit a DAG, you use condor_submit_dag and give it the name of your DAG description file

This will check your input file for errors and submit a copy of condor_dagman as a scheduler universe job with all the necessary command-line arguments

Page 67: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

67

Removing a DAG

Removing a DAG is easy:• Just use on the scheduler universe job

(condor_dagman)• On shutdown, DAGMan will remove any

jobs that are currently in the queue that are associated with its DAG

• Once all jobs are gone, DAGMan itself will exit, and the scheduler universe job will be removed from the queue

Page 68: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

Hands-On Exercise #4 Using DAGMan

Page 69: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

69

Hands-On Exercise #4

Please point your browser to the new instructions:• Go back to the tutorial homepage• Click on Using_DAGMan• Again, read the instructions carefully

and execute any commands on a line beginning with % in your xterm

If you exited Netscape, just click on “Tutorial” from your Start menu

Page 70: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

70

What’s Wrong with my Vanilla Job?

Special requirements expressions for vanilla jobs

You didn’t submit it from a directory that is shared

Condor isn’t running as root (more on this later)

You don’t have your file permissions setup correctly (more on this later)

Page 71: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

71

Special Requirements Expressions for Vanilla Jobs

When you submit a vanilla job, Condor automatically appends two extra Requirements:• UID_DOMAIN == <submit_uid_domain>• FILESYSTEM_DOMAIN == <submit_fs>

Since there are no remote system calls with Vanilla jobs, they depend on a shared file system and a common UID space to run as you and access your files

Page 72: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

72

Special Requirements Expressions for Vanilla Jobs

By default, each machine in your pool is in its own UID_DOMAIN and FILESYSTEM_DOMAIN, so your pool administrator has to configure your pool specially if there really is a common UID space and a network file system

If you don’t have an account on the remote system, Vanilla jobs won’t work

Page 73: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

73

Shared File Systems for Vanilla Jobs

Just because you have AFS or NFS doesn’t mean ALL files are shared• Initialdir = /tmp will probably

cause trouble for Vanilla jobs! You must be sure to set Initialdir to a

shared directory (or cd into it to run condor_submit) for Vanilla jobs

Page 74: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

74

Why Don’t My Jobs Run?

Try using condor_q -analyze Try specify a User Log for your job Look at condor_userprio: maybe you

have a bad priority and higher priority users are being served

Problems with file permissions or network file systems

Look at the SchedLog

Page 75: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

75

Using condor_q -analyze

condor_q -analyze will analyze your job’s ClassAd, get all the ClassAds of the machines in the pool, and tell you what’s going on:• Will report errors in your Requirements

expression (impossible to match, etc.)• Will tell you about user priorities in the

pool (other people have better priority)

Page 76: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

76

Looking at condor_userprio

You can look at condor_userprio yourself

If your priority value is a really high number (because you’ve been running a lot of Condor jobs), other users will have priority to run jobs in your pool

Page 77: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

77

File Permissions in Condor

If Condor isn’t running as root, the condor_shadow process runs as the user the condor_schedd is running as (usually “condor”)

You must grant this user write access to your output files, and read access to your input files (both STDOUT, STDIN from your submit file, as well as files your job explicitly opens)

Page 78: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

78

File Permissions in Condor (cont’d)

Often, there will be a “condor” group and you can make your files owned and write-able by this group

For vanilla jobs, even if the UID_DOMAIN setting is correct, and they match for your submit and execute machines, if Condor isn’t running as root, your job will be started as user Condor, not as you!

Page 79: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

79

Problems with NFS in Condor

For NFS, sometimes the administrators will setup read-only mounts, or have UIDs remapped for certain partitions (the classic example is root = nobody, but modern NFS can do arbitrary remappings)

Page 80: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

80

Problems with NFS in Condor (cont’d)

If your pool uses NFS automounting, the directory that Condor thinks is your InitialDir (the directory you were in when you ran condor_submit) might not exist on a remote machine• E.g. you’re in /mnt/tmp/home/me/...

With automounting, you always need to specify InitialDir explicitly • InitialDir = /home/me/...

Page 81: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

81

Problems with AFS in Condor

If your pool uses AFS, the condor_shadow, even if it’s running with your UID, will not have your AFS token• You must grant an unauthenticated AFS

user the appropriate access to your files• Some sites provide a better alternative that

world-writable files– Host ACLsHost ACLs– Network-specific ACLsNetwork-specific ACLs

Page 82: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

82

Looking at the SchedLog

Looking at the log file of the condor_schedd, the “SchedLog” file can possibly give you a clue if there are problems• Find it with:

condor_config_val schedd_log

• You might need your pool administrator to turn on a higher “debugging level” to see more verbose output

Page 83: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

83

Other User Features

Submit-Only installation Heterogeneous Submit PVM jobs

Page 84: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

84

Submit-Only Installation

Can install just a condor_master and condor_schedd on your machine

Can submit jobs into a remote pool Special option to condor_install

Page 85: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

85

Heterogeneous Submit The job you submit doesn’t have to be

the same platform as the machine you submit from• Maybe you have access to a pool that’s full

of Alphas, but you have a Sparc on your desk, and moving all your data is a pain

You can take an Alpha binary, copy it to your Sparc, and submit it with a requirements expression that says you need to run on ALPHA/OSF1

Page 86: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

86

Parallel Jobs in Condor

Condor can run parallel applications • Written to the popular PVM message

passing library• Future work includes support for MPI

Master-Worker Paradigm What does Condor-PVM do? How to compile and submit Condor-

PVM jobs

Page 87: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

87

Master-Worker Paradigm

Condor-PVM is designed to run PVM applications which follow the master-worker paradigm.

Master• has a pool of work, sends pieces of work to

the workers, manages the work and the workers

Worker• gets a piece of work, does the computation,

sends the result back

Page 88: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

88

What does Condor-PVM do?

Condor acts as the PVM resource manager. All pvm_addhost requests get re-mapped

to Condor. • Condor dynamically constructs PVM virtual

machines out of non-dedicated desktop machines.

When a machine leaves the pool, the user gets notified via the normal PVM notification mechanisms.

Page 89: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

89

How to compile and submit Condor-PVM jobs

Binary Compatible• Compile and link with PVM library just as

normal PVM applications. No need to link with Condor.

Submit In the submit description file, set:universe = PVMmachine_count = <min>..<max>

Page 90: Condor Tutorial for Users INFN-Bologna, 6/29/99 Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

90

Obtaining Condor Condor can be downloaded from the

Condor web site at:http://www.cs.wisc.edu/condor

Complete Users and Administrators manual available

http://www.cs.wisc.edu/condor/manual Contracted Support is available Questions? Email:

[email protected]