68
1 http://www.cs.wisc.edu/condor Using Condor An Introduction ICE 2008

1 Using Condor An Introduction ICE 2008

  • View
    229

  • Download
    3

Embed Size (px)

Citation preview

Page 1: 1  Using Condor An Introduction ICE 2008

1http://www.cs.wisc.edu/condor

Using Condor An Introduction

ICE 2008

Page 2: 1  Using Condor An Introduction ICE 2008

2http://www.cs.wisc.edu/condor

The Condor Project (Established ‘85)

Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

Page 3: 1  Using Condor An Introduction ICE 2008

3http://www.cs.wisc.edu/condor

Definitions› Job

The Condor representation of your work

› Machine The Condor representation of computers

and that can perform the work

› Match Making Matching a job with a machine

“Resource”

Page 4: 1  Using Condor An Introduction ICE 2008

4http://www.cs.wisc.edu/condor

Job

Jobs state their requirements and preferences:I need a Linux/x86 platformI need the machine at least 500 MbI prefer a machine with more memory

Page 5: 1  Using Condor An Introduction ICE 2008

5http://www.cs.wisc.edu/condor

Machine

Machines state their requirements and preferences:Run jobs only when there is no keyboard

activityI prefer to run Frieda’s jobsI am a machine in the econ departmentNever run jobs belonging to Dr. Smith

Page 6: 1  Using Condor An Introduction ICE 2008

6http://www.cs.wisc.edu/condor

The Magic of Matchmaking

› Jobs and machines state their requirements and preferences

› Condor matches jobs with machinesbased on requirements and

preferences

Page 7: 1  Using Condor An Introduction ICE 2008

7http://www.cs.wisc.edu/condor

Getting Started:Submitting Jobs to

Condor› Overview:

Choose a “Universe” for your job Make your job “batch-ready” Create a submit description file Run condor_submit to put your job in

the queue

Page 8: 1  Using Condor An Introduction ICE 2008

8http://www.cs.wisc.edu/condor

1. Choose the “Universe”› Controls how Condor handles jobs

› Choices include: Vanilla Standard Grid Java Parallel VM

Page 9: 1  Using Condor An Introduction ICE 2008

9http://www.cs.wisc.edu/condor

Using the Vanilla Universe

• The Vanilla Universe:– Allows running almost

any “serial” job– Provides automatic

file transfer, etc.– Like vanilla ice cream

• Can be used in just about any situation

Page 10: 1  Using Condor An Introduction ICE 2008

10http://www.cs.wisc.edu/condor

2. Make your job batch-ready

Must be able to run in the background

• No interactive input

• No GUI/window clicks

• No music ;^)

Page 11: 1  Using Condor An Introduction ICE 2008

11http://www.cs.wisc.edu/condor

Make your job batch-ready (continued)…

Job can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices

Similar to UNIX shell:•$ ./myprogram <input.txt >output.txt

Page 12: 1  Using Condor An Introduction ICE 2008

12http://www.cs.wisc.edu/condor

3. Create a Submit Description File

› A plain ASCII text file

› Condor does not care about file extensions

› Tells Condor about your job: Which executable, universe, input, output and

error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)

› Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.

Page 13: 1  Using Condor An Introduction ICE 2008

13http://www.cs.wisc.edu/condor

Simple Submit Description File

# Simple condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = vanillaExecutable = my_jobOutput = output.txt Queue

Page 14: 1  Using Condor An Introduction ICE 2008

14http://www.cs.wisc.edu/condor

4. Run condor_submit› You give condor_submit the name

of the submit file you have created: condor_submit my_job.submit

› condor_submit: Parses the submit file, checks for

errors Creates a “ClassAd” that describes

your job(s) Puts job(s) in the Job Queue

Page 15: 1  Using Condor An Introduction ICE 2008

15http://www.cs.wisc.edu/condor

The Job Queue

› condor_submit sends your job’s ClassAd(s) to the schedd

› The schedd (more details later): Manages the local job queue Stores the job in the job queue

• Atomic operation, two-phase commit• “Like money in the bank”

› View the queue with condor_q

Page 16: 1  Using Condor An Introduction ICE 2008

16http://www.cs.wisc.edu/condor

Examplecondor_submit and

condor_q% condor_submit my_job.submitSubmitting job(s).1 job(s) submitted to cluster 1.

% condor_q

-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job

1 jobs; 1 idle, 0 running, 0 held

%

Page 17: 1  Using Condor An Introduction ICE 2008

17http://www.cs.wisc.edu/condor

Input, output & error files› Controlled by submit file settings

› You can define the job’s standard input, standard output and standard error: Read job’s standard input from “input_file”:

• Input = input_file• Shell equivalent: program <input_file

Write job’s standard ouput to “output_file”:• Output = output_file• Shell equivalent: program >output_file

Write job’s standard error to “error_file”:• Error = error_file• Shell equivalent: program 2>error_file

Page 18: 1  Using Condor An Introduction ICE 2008

18http://www.cs.wisc.edu/condor

Email about your job

• Condor sends email about job events to the submitting user

• Specify “notification” in your submit file to control which events:

Notification = completeNotification = neverNotification = errorNotification = always

Default

Page 19: 1  Using Condor An Introduction ICE 2008

19http://www.cs.wisc.edu/condor

Feedback on your job

› Create a log of job events

› Add to submit description file:log = sim.log

› Becomes the Life Story of a Job Shows all events in the life of a job Always have a log file

Page 20: 1  Using Condor An Introduction ICE 2008

20http://www.cs.wisc.edu/condor

Sample Condor User Log

000 (0001.000.000) 05/25 19:10:03 Job submitted from host: <128.105.146.14:1816>

...

001 (0001.000.000) 05/25 19:12:17 Job executing on host: <128.105.146.14:1026>

...

005 (0001.000.000) 05/25 19:13:06 Job terminated.

(1) Normal termination (return value 0)

...

Page 21: 1  Using Condor An Introduction ICE 2008

21http://www.cs.wisc.edu/condor

Example Submit Description File With

Logging# Example condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = vanillaExecutable = /home/frieda/condor/my_job.condorLog = my_job.log ·Job log (from Condor)Input = my_job.in ·Program’s standard inputOutput = my_job.out ·Program’s standard outputError = my_job.err ·Program’s standard errorArguments = -a1 -a2 ·Command line argumentsInitialDir = /home/frieda/condor/runQueue

Page 22: 1  Using Condor An Introduction ICE 2008

22http://www.cs.wisc.edu/condor

Let’s run a job

› First, need a terminal emulator http://www.putty.org

• (or similar)

› Login to chopin.cs.wisc.edu as cguserXX, and the given password

› source /scratch/ice08

Page 23: 1  Using Condor An Introduction ICE 2008

23http://www.cs.wisc.edu/condor

Logged In?

› condor_q

› condor_status

Page 24: 1  Using Condor An Introduction ICE 2008

24http://www.cs.wisc.edu/condor

Create submit file

› nano submit• universe = vanilla• executable = /bin/echo• Arguments = hello world• Should_transfer_files = always• When_to_transfer_output = on_exit• Output = out• Log = log• queue

Page 25: 1  Using Condor An Introduction ICE 2008

25http://www.cs.wisc.edu/condor

And submit it…

› condor_submit submit

› (wait… remember the HTC bit?)

› Condor_q xx

› cat output

Page 26: 1  Using Condor An Introduction ICE 2008

26http://www.cs.wisc.edu/condor

“Clusters” and “Processes”

› If your submit file describes multiple jobs, we call this a “cluster”

› Each cluster has a unique “cluster number”› Each job in a cluster is called a “process”

Process numbers always start at zero› A Condor “Job ID” is the cluster number, a period,

and the process number (i.e. 2.1) A cluster can have a single process

• Job ID = 20.0 ·Cluster 20, process 0 Or, a cluster can have more than one process

• Job ID: 21.0, 21.1, 21.2 ·Cluster 21, process 0, 1, 2

Page 27: 1  Using Condor An Introduction ICE 2008

27http://www.cs.wisc.edu/condor

Submit File for a Cluster# Example submit file for a cluster of 2 jobs

# with separate input, output, error and log filesUniverse = vanillaExecutable = my_jobArguments = -x 0log = my_job_0.logInput = my_job_0.inOutput = my_job_0.outError = my_job_0.errQueue ·Job 2.0 (cluster 2, process 0)Arguments = -x 1log = my_job_1.logInput = my_job_1.inOutput = my_job_1.outError = my_job_1.errQueue ·Job 2.1 (cluster 2, process 1)

Page 28: 1  Using Condor An Introduction ICE 2008

28http://www.cs.wisc.edu/condor

% condor_submit my_job.submit-file

Submitting job(s).

2 job(s) submitted to cluster 2.

% condor_q

-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2

2.0 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 0

2.1 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 1

3 jobs; 2 idle, 1 running, 0 held

%

Submitting The Job

Page 29: 1  Using Condor An Introduction ICE 2008

29http://www.cs.wisc.edu/condor

Organize your files and directories for big runs

› Create subdirectories for each “run” run_0, run_1, … run_599

› Create input files in each of these run_0/simulation.in run_1/simulation.in … run_599/simulation.in

› The output, error & log files for each job will be created by Condor from your job’s output

Page 30: 1  Using Condor An Introduction ICE 2008

30http://www.cs.wisc.edu/condor

Submit Description File for 600 Jobs

# Cluster of 600 jobs with different directoriesUniverse = vanillaExecutable = simLog = simulation.log...Arguments = -x 0

InitialDir = run_0 ·Log, input, output & error files -> run_0

Queue ·Job 3.0 (Cluster 3, Process 0)

Arguments = -x 1

InitialDir = run_1 ·Log, input, output & error files -> run_1

Queue ·Job 3.1 (Cluster 3, Process 1)

·Do this 598 more times…………

Page 31: 1  Using Condor An Introduction ICE 2008

31http://www.cs.wisc.edu/condor

Submit File for a Big Cluster of Jobs

› We just submitted 1 cluster with 600 processes

› All the input/output files will be in different directories

› The submit file is pretty unwieldy (over 1200 lines)

› Isn’t there a better way?

Page 32: 1  Using Condor An Introduction ICE 2008

32http://www.cs.wisc.edu/condor

Submit File for a Big Cluster of Jobs (the better

way) #1› We can queue all 600 in 1 “Queue” command Queue 600

› Condor provides $(Process) and $(Cluster) $(Process) will be expanded to the

process number for each job in the cluster• 0, 1, … 599

$(Cluster) will be expanded to the cluster number• Will be 4 for all jobs in this cluster

Page 33: 1  Using Condor An Introduction ICE 2008

33http://www.cs.wisc.edu/condor

Submit File for a Big Cluster of Jobs (the better

way) #2› The initial directory for each job can

be specified using $(Process) InitialDir = run_$(Process) Condor will expand these to “run_0”,

“run_1”, … “run_599” directories

› Similarly, arguments can be variable Arguments = -x $(Process) Condor will expand these to “-x 0”, “-x 1”, … “-x 599”

Page 34: 1  Using Condor An Introduction ICE 2008

34http://www.cs.wisc.edu/condor

Better Submit File for 600 Jobs

# Example condor_submit input file that defines# a cluster of 600 jobs with different directoriesUniverse = vanillaExecutable = my_jobLog = my_job.logInput = my_job.inOutput = my_job.outError = my_job.errArguments = –x $(Process) ·–x 0, -x 1, … -x 599InitialDir = run_$(Process) ·run_0 … run_599Queue 600 ·Jobs 4.0 … 4.599

Page 35: 1  Using Condor An Introduction ICE 2008

35http://www.cs.wisc.edu/condor

Now, we submit it…$ condor_submit my_job.submitSubmitting

job(s) ...............................................................................................................................................................................................................................................................

Logging submit event(s) ...............................................................................................................................................................................................................................................................

600 job(s) submitted to cluster 4.

Page 36: 1  Using Condor An Introduction ICE 2008

36http://www.cs.wisc.edu/condor

And, Check the queue$ condor_q

-- Submitter: x.cs.wisc.edu : <128.105.121.53:510> : x.cs.wisc.edu

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 0

4.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 1

4.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 2

4.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3

...

4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 598

4.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599

600 jobs; 599 idle, 1 running, 0 held

Page 37: 1  Using Condor An Introduction ICE 2008

37http://www.cs.wisc.edu/condor

Removing jobs› If you want to remove a job from

the Condor queue, you use condor_rm

› You can only remove jobs that you own

› Privileged user can remove any jobs “root” on UNIX “administrator” on Windows

Page 38: 1  Using Condor An Introduction ICE 2008

38http://www.cs.wisc.edu/condor

Removing jobs (continued)

› Remove an entire cluster: condor_rm 4 ·Removes the whole

cluster

› Remove a specific job from a cluster: condor_rm 4.0 ·Removes a single job

› Or, remove all of your jobs with “-a” condor_rm -a ·Removes all jobs / clusters

Page 39: 1  Using Condor An Introduction ICE 2008

39http://www.cs.wisc.edu/condor

Submit cluster of 10 jobs

› nano submit• universe = vanilla• executable = /bin/echo• Arguments = hello world $(PROCESS)• Should_transfer_files = always• When_to_transfer_output = on_exit• Output = out.$(PROCESS)• Log = log• Queue 10

Page 40: 1  Using Condor An Introduction ICE 2008

40http://www.cs.wisc.edu/condor

And submit it…

› condor_submit submit

› (wait…)

› Condor_q xx

› cat log

› cat output.yy

Page 41: 1  Using Condor An Introduction ICE 2008

41http://www.cs.wisc.edu/condor

My new jobs run for 20 days…

› What happens when a job is forced off it’s CPU? Preempted by higher priority

user or job Vacated because of user

activity

› How can I add fault tolerance to my jobs?

Page 42: 1  Using Condor An Introduction ICE 2008

42http://www.cs.wisc.edu/condor

Condor’s Standard Universe to the rescue!› Support for transparent process

checkpoint and restart› Remote system calls (remote

I/O) Your job can read / write files as if they were local

Page 43: 1  Using Condor An Introduction ICE 2008

43http://www.cs.wisc.edu/condor

Remote System Calls inthe Standard Universe

› I/O system calls are trapped and sent back to the submit machineExamples: open a file, write to a file

› No source code changes typically required

› Programming language independent

Page 44: 1  Using Condor An Introduction ICE 2008

44http://www.cs.wisc.edu/condor

Process Checkpointing in the

Standard Universe› Condor’s process checkpointing provides a mechanism to automatically save the state of a job

› The process can then be restarted from right where it was checkpointed After preemption, crash, etc.

Page 45: 1  Using Condor An Introduction ICE 2008

45http://www.cs.wisc.edu/condor

Checkpointing:Process Starts

checkpoint: the entire state of a program, saved in a file CPU registers, memory image, I/O

time

Page 46: 1  Using Condor An Introduction ICE 2008

46http://www.cs.wisc.edu/condor

Checkpointing:Process Checkpointed

time

1 2 3

Page 47: 1  Using Condor An Introduction ICE 2008

47http://www.cs.wisc.edu/condor

Checkpointing:Process Killed

time

3

3

Killed!

Page 48: 1  Using Condor An Introduction ICE 2008

48http://www.cs.wisc.edu/condor

Checkpointing:Process Resumed

time

3

3

goodput badput goodput

Page 49: 1  Using Condor An Introduction ICE 2008

49http://www.cs.wisc.edu/condor

When will Condor checkpoint your job?

› Periodically, if desired For fault tolerance

› When your job is preempted by a higher priority job

› When your job is vacated because the execution machine becomes busy

› When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command

Page 50: 1  Using Condor An Introduction ICE 2008

50http://www.cs.wisc.edu/condor

Making the Standard Universe Work

› The job must be relinked with Condor’s standard universe support library

› To relink, place condor_compile in front of the command used to link the job:

% condor_compile gcc -o myjob myjob.c

- OR -

% condor_compile f77 -o myjob filea.f fileb.f

- OR -

% condor_compile make –f MyMakefile

Page 51: 1  Using Condor An Introduction ICE 2008

51http://www.cs.wisc.edu/condor

Limitations of the Standard Universe

› Condor’s checkpointing is not at the kernel level. Standard Universe the job may not:

• Fork()• Use kernel threads• Use some forms of IPC, such as pipes and shared

memory

› Must have access to source code to relink

› Many typical scientific jobs are OK

Page 52: 1  Using Condor An Introduction ICE 2008

52http://www.cs.wisc.edu/condor

Submitting Std uni job

› #include <stdio.h>

› int main(int argc, char **argv) {

› int i;for(i = 0 ; i < 10000000; i++) {}

› }

Page 53: 1  Using Condor An Introduction ICE 2008

53http://www.cs.wisc.edu/condor

And submit…

› condor_compile –o foo foo.c

› condor_submit

Page 54: 1  Using Condor An Introduction ICE 2008

54http://www.cs.wisc.edu/condor

My jobs have have dependencies…

Can Condor help solve my dependency problems?

Page 55: 1  Using Condor An Introduction ICE 2008

55http://www.cs.wisc.edu/condor

Condor Universes:Scheduler and Local

› Scheduler Universe Plug in a meta-scheduler Developed for DAGMan (more later) Similar to Globus’s fork job manager

› Local Very similar to vanilla, but jobs run on

the local host Has more control over jobs than

scheduler universe

Page 56: 1  Using Condor An Introduction ICE 2008

56http://www.cs.wisc.edu/condor

Frieda learns DAGMan

› Directed Acyclic Graph Manager

› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Page 57: 1  Using Condor An Introduction ICE 2008

57http://www.cs.wisc.edu/condor

What is a DAG?

› A DAG is the data structure used by DAGMan to represent these dependencies.

› Each job is a “node” in the DAG.

› Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Job A

Job B

Job C

Job D

Page 58: 1  Using Condor An Introduction ICE 2008

58http://www.cs.wisc.edu/condor

Defining a DAG

› A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

› each node will run the Condor job specified by its accompanying Condor submit file

Job A

Job B Job C

Job D

Page 59: 1  Using Condor An Introduction ICE 2008

59http://www.cs.wisc.edu/condor

Submitting a DAG

› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:

% condor_submit_dag diamond.dag

› condor_submit_dag is run by the schedd DAGMan daemon itself is “watched” by

Condor, so you don’t have to

Page 60: 1  Using Condor An Introduction ICE 2008

60http://www.cs.wisc.edu/condor

DAGMan

Running a DAG

› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

CondorJobQueue

B C

D

A

A

.dagFile

Page 61: 1  Using Condor An Introduction ICE 2008

61http://www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› DAGMan holds & submits jobs to the Condor queue at the appropriate times.

CondorJobQueue D

B

C

B

A

C

Page 62: 1  Using Condor An Introduction ICE 2008

62http://www.cs.wisc.edu/condor

Running a DAG (cont’d)

› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.

CondorJobQueue DAGMan

X

D

A

BRescue

File

Page 63: 1  Using Condor An Introduction ICE 2008

63http://www.cs.wisc.edu/condor

Recovering a DAG

› Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.

CondorJobQueue

RescueFile

CDAGMan D

A

B C

Page 64: 1  Using Condor An Introduction ICE 2008

64http://www.cs.wisc.edu/condor

DAGMan

Recovering a DAG (cont’d)

› Once that job completes, DAGMan will continue the DAG as if the failure never happened.

CondorJobQueue

C

D

A

B

D

Page 65: 1  Using Condor An Introduction ICE 2008

65http://www.cs.wisc.edu/condor

DAGMan

Finishing a DAG

› Once the DAG is complete, the DAGMan job itself is finished, and exits.

CondorJobQueue

C

D

A

B

Page 66: 1  Using Condor An Introduction ICE 2008

66http://www.cs.wisc.edu/condor

Additional DAGMan Features

› Provides other handy features for job management…

nodes can have PRE & POST scripts failed nodes can be automatically re-

tried a configurable number of times job submission can be “throttled”

Page 67: 1  Using Condor An Introduction ICE 2008

67http://www.cs.wisc.edu/condor

General User Commands› condor_status View Pool Status

› condor_q View Job Queue› condor_submit Submit new Jobs› condor_rm Remove Jobs› condor_prio Intra-User Prios› condor_history Completed Job Info› condor_submit_dag Submit new DAG› condor_checkpoint Force a checkpoint› condor_compile Link Condor library

Page 68: 1  Using Condor An Introduction ICE 2008

68http://www.cs.wisc.edu/condor

Thank you!

Check us out on the Web:http://www.condorproject.org

Email:[email protected]