70
Condor: Overview and User Guide to the Condor Biostatistics Environment

Condor: Overview and User Guide to the Condor Biostatistics Environment

Embed Size (px)

Citation preview

Page 1: Condor: Overview and User Guide to the Condor Biostatistics Environment

Condor: Overview and User Guide to the

Condor Biostatistics Environment

Page 2: Condor: Overview and User Guide to the Condor Biostatistics Environment

2

Autoria

• Autores– Patrícia Kayser Vargas– Setembro de 2002– Palestra na Biostat, Wisconsin, EUA

• Revisões– V1

• C. Geyer• PDP/2005-2, PPGC, UFRGS• Dezembro 2005

Page 3: Condor: Overview and User Guide to the Condor Biostatistics Environment

3

Topics

• Introduction– What is Condor?

– Why and when use Condor?

– What are Condor Universes?

• Running Jobs on Condor– C programs

• YAP

– Java Programs

• Final Remarks

Page 4: Condor: Overview and User Guide to the Condor Biostatistics Environment

4

Introduction

Page 5: Condor: Overview and User Guide to the Condor Biostatistics Environment

5

What is Condor?

• Condor– is a distributed batch scheduling system

• “The goal of Condor is to provide the highest feasible throughput by executing the most jobs over extended periods of time.” [1]

• What is a job? – Several possibilities

Page 6: Condor: Overview and User Guide to the Condor Biostatistics Environment

6

What is Condor?

• Condor– is composed of a collection of different daemons

that provide various services, such as • mecanismo de fila de jobs,• políticas de escalonamento,• esquema de prioridades,• monitoramento,• resource management,• job management,• matchmaking...

Page 7: Condor: Overview and User Guide to the Condor Biostatistics Environment

7

What is Condor? Architecture

[1]

Page 8: Condor: Overview and User Guide to the Condor Biostatistics Environment

What is Condor? Architecture

• Tipos de máquinas– Central Manager

• Gerente de uma rede (grade) Condor• Uma por “pool”• Ponto de falha central ()

– Submit Machines• Máquinas de usuários• Usuário submete, monitora e controla execução de 1 job

– Execution Machine (escravo)• Executa jobs

– Uma máquina pode ter vários papéis

Page 9: Condor: Overview and User Guide to the Condor Biostatistics Environment

What is Condor? Architecture

• Tipos de máquinas (cont.)– CheckPoint Server

• Opcional• Armazena arquivos com checkpoints

Page 10: Condor: Overview and User Guide to the Condor Biostatistics Environment

10

What is Condor? Architecture

• Condor has four daemons• On Central Manager and on Submit Machines

– startd: • monitors the conditions of the resource where it runs• publishes ClassAds resource offer, and • is responsible for enforcing the resource owner’s policy

for starting, suspending, and evicting jobs. – schedd:

• maintains a persistent job queue• publishes ClassAds resource request, and • negotiates for available resources

Page 11: Condor: Overview and User Guide to the Condor Biostatistics Environment

11

What is Condor? Architecture

• Only on Central Manager:– collector:

• is the central repository of information • startd and schedd send periodic updates to the collector

– negotiator:• periodically performs a negotiation cycle

– process of matchmaking

– negotiator tries to find matches between various ClassAds,

– of resource offers and requests, and

– once a match is made, both parties are notified and are responsible for acting on that match

Page 12: Condor: Overview and User Guide to the Condor Biostatistics Environment

12

What is Condor? Architecture

[1]

Page 13: Condor: Overview and User Guide to the Condor Biostatistics Environment

13

What is Condor? Architecture

[1]

Submitter Executing

Page 14: Condor: Overview and User Guide to the Condor Biostatistics Environment

14

What is Condor? Architecture

• Publicação de ClassAds de recursos e de jobs que são enviados ao collector– Startd envia (de) recursos– Schedd envia (de) jobs

• O collector tudo envia ao negotiator que faz o matchmaking

Page 15: Condor: Overview and User Guide to the Condor Biostatistics Environment

15

What is Condor? Architecture

• Algoritmo de matchmaking– o negotiator pode descobrir recursos no qual um

job pode ser executado– ele avisa ao daemon schedd, da máquina que

submeteu, com quem ela deve se comunicar para exportar o job

– ele avisa o daemon startd da máquina escolhida para executar (recurso ocioso que tem os requisitos) que vai receber um tarefa

Page 16: Condor: Overview and User Guide to the Condor Biostatistics Environment

16

What is Condor? Architecture

• Neste ponto o central manager não age mais, são as duas máquinas que vão executar o job– a máquina de submissão cria um processo shadown

• para enviar a tarefa e receber os resultados– a máquina que vai executar

• cria um processo starter que recebe a tarefa e• um “user job” que por sua vez executa a tarefa • e ao final os resultados são enviados à máquina

de submissão

Page 17: Condor: Overview and User Guide to the Condor Biostatistics Environment

17

Why and when use Condor?

• Condor is useful when– there are several jobs to be submitted– there is one executable and several different input

data

Page 18: Condor: Overview and User Guide to the Condor Biostatistics Environment

18

Why and when use Condor?

• Condor is useful because– can use different available machines

• opportunistic scheduling– controls file transfers

• the job must be able to access the data files from any machine on which it can potentially run

– send email notifying when job has completed• except if jobs submitted from a Linux machine

Page 19: Condor: Overview and User Guide to the Condor Biostatistics Environment

19

What are Condor Universes?

• Types of universes– standard

– vanilla

– java

– parallel

• The Universe attribute is specified in the submit description file– the default is standard

Page 20: Condor: Overview and User Guide to the Condor Biostatistics Environment

20

What are Condor Universes?

• standard– provides

• checkpointing and • remote system calls

– job more reliable and uniform access to resources from anywhere in the pool

– to prepare a program as a standard universe job, it must be relinked with condor_ compile

Page 21: Condor: Overview and User Guide to the Condor Biostatistics Environment

21

What are Condor Universes?

• standard– there are a few restrictions

– complete list in manualhttp://www.cs.wisc.edu/condor/manual/v6.4/2_4Road_map_running.html

– examples• no multi-process jobs (no fork(), exec(), and system()) • no inter-process communication

(includes pipes, semaphores, and shared memory)• no sending or receiving the SIGUSR2 or SIGTSTP• all files must be opened read-only or write-only

Page 22: Condor: Overview and User Guide to the Condor Biostatistics Environment

22

What are Condor Universes?

• vanilla– used for programs which cannot be successfully

re-linked

– useful for shell scripts

– cannot checkpoint or use remote system calls

– sometimes a job must restart from the beginning on another machine in the pool

• sem checkpoint

Page 23: Condor: Overview and User Guide to the Condor Biostatistics Environment

23

What are Condor Universes?

• java– can execute on any machine in the pool that will

run the Java Virtual Machine

– at the moment it does not work at Biostat• departamento de Wisconsin

– compiled Java programs can be submitted

– creating jar file for programs with several classes is recommended

Page 24: Condor: Overview and User Guide to the Condor Biostatistics Environment

24

What are Condor Universes?

• parallel– MPI and PVM

• used for parallel programs using message passing

– Globus• must have Condor-G installed

– I did not check if they work at Biostats

Page 25: Condor: Overview and User Guide to the Condor Biostatistics Environment

25

Running Jobs on Condor

Page 26: Condor: Overview and User Guide to the Condor Biostatistics Environment

26

Running Jobs on Condor

• You can submit your jobs from any biostat machine, since all run schedd and startd

• You must – set PATH environment variable– prepare a submission file– compile your job with condor_compile if using

standard universe– submit your job(s) with condor_submit command

Page 27: Condor: Overview and User Guide to the Condor Biostatistics Environment

27

Running Jobs on Condor

• Submission file– o submit description file é o arquivo que diz

• qual é o executável• diretório onde vão ser colocados os arquivos de

saída• quantos jobs vão ser instanciados, etc

Page 28: Condor: Overview and User Guide to the Condor Biostatistics Environment

28

Running Jobs on Condor

• Submission file– esse arquivo é transformado em um ClassAdd para

cada job que precisa ser instanciado• p.ex. se no arq tiver o comando 'queue 50', vão

ter que ser executados 50 jobs daquele programa

• portanto vão ser publicados 50 ClassAds no central manager

Page 29: Condor: Overview and User Guide to the Condor Biostatistics Environment

29

Running Jobs on Condor Setting PATH environment variable

• Change PATH to find Condor commands (conforme shell)

bash:source /s/pkg/condor/condor.sh

PATH=$PATH:/s/pkg/`/s/share/ostoken`/condor/bin; export PATH

csh:source /s/pkg/condor/condor.cshset path = ( $path /s/pkg/`/s/share/ostoken`/condor/bin )rehash

Page 30: Condor: Overview and User Guide to the Condor Biostatistics Environment

30

Running Jobs on Condor Preparing a submission file

• ClassAds (Classified Advertisement)– pairs of values

– syntax similar to C/Java

• The commands are case insensitive, i.e., executable = fact Executable = fact

Page 31: Condor: Overview and User Guide to the Condor Biostatistics Environment

31

Running Jobs on Condor Preparing a submission file

• At least, must have the “executable” attribute: your program/binary

Executable = fact

• Other useful attribute: input file – your data

input = test.data

Page 32: Condor: Overview and User Guide to the Condor Biostatistics Environment

32

Running Jobs on CondorCompiling your job with condor_compile

• If using standard universe:– use condor_compile

• it is necessary to relink the program with the Condor library

condor_compile gcc fact.c -o fact

Page 33: Condor: Overview and User Guide to the Condor Biostatistics Environment

33

Running Jobs on CondorSubmitting your job(s) with condor_submit

• In any Condor Universe– jobs submitted using condor_submit command

with submission file as parameter condor_submit condor1.sub

– -v option to see information about submission (full ClassAd generated)

• somente uma lista e encerra (não interativo) condor_submit -v condor1.sub

Page 34: Condor: Overview and User Guide to the Condor Biostatistics Environment

Example of C Program

Page 35: Condor: Overview and User Guide to the Condor Biostatistics Environment

35

Running Jobs on Condor C programs

• options:– gcc (the GNU C compiler)

– cc (the system C compiler)

– acc (ANSI C compiler, on Sun systems)

– CC (the system C++ compiler)

– …(http://www.cs.wisc.edu/condor/manual/v6.4/condor_compile.html)

bash-2.03$ condor_compile gcc fact.c -o fact

Page 36: Condor: Overview and User Guide to the Condor Biostatistics Environment

36

Running Jobs on Condor C programs – exemplo de “submission file”

#################### # C Example: demonstrate use of multiple directories # "Arguments = 5" to pass integer 5 as parameter # #################### Executable = fact Universe = standard output = loop.out error = loop.error Log = loop.log Arguments = 5

Initialdir = run_1 Queue Initialdir = run_2 Queue

Page 37: Condor: Overview and User Guide to the Condor Biostatistics Environment

37

Running Jobs on Condor C programs

• Log– contém informações importantes para avaliar a

execução/desempenho da aplicação– para um usuário comum talvez não seja tão

relevante– descreve cada evento que ocorre com o job,

contendo informações de data/hora/máquina• quando: foi submetido, iniciou execução, foi

suspendido, foi migrado, terminou (com erro ou com sucesso

Page 38: Condor: Overview and User Guide to the Condor Biostatistics Environment

38

Running Jobs on Condor C programs

• Arguments– parâmetros para o executável– no exemplo;

• arguments = 5• equivaleria a executar no terminal 'fact 5'

• Initialdir – onde os arquivos output/erro/log vão ser

armazenados– initialdir= run_1

• Diretório “run_1”

Page 39: Condor: Overview and User Guide to the Condor Biostatistics Environment

39

Running Jobs on Condor C programs

• Queue– roda uma única instância de job, usando run_1

como initialdir– diretório deve ser criado antes de rodar o

condor_sub senão dá erro

• “Initialdir = run_2” e “Queue”– mais uma instância do job agora em outro diretório

Page 40: Condor: Overview and User Guide to the Condor Biostatistics Environment

40

Running Jobs on Condor C programs

outro exemplo de “submission file”

#################### # C Example: # each job runs with a different argument and # store results in different files #################### Executable = fact notify_user = [email protected]

Input = in.$(Process) Output = out.$(Process) Error = err.$(Process) Log = fact.log

Queue 2

Page 41: Condor: Overview and User Guide to the Condor Biostatistics Environment

41

Running Jobs on Condor C programs

• notify_user = [email protected]– diz para enviar msg avisando do término do job

• Input = in.$(Process)– $(Process): variável do condor Process

• que é instanciada com número inteiro sequencial para cada job criado

• assim: vai criar in.0, in.1, in.2 e

Page 42: Condor: Overview and User Guide to the Condor Biostatistics Environment

42

Running Jobs on Condor C programs

• Log = fact.log– um único arquivo de log apesar de vários jobs– eventos são anotados com número do job

• Queue 2– cria dois jobs– pode ser colocado qq nro inteiro– Queue 100

• cria 100 tarefas

Page 43: Condor: Overview and User Guide to the Condor Biostatistics Environment

43

Running Jobs on Condor C programs – YAP

• To configure YAP with Condor:

configure --enable-depth-limit --enable-condor

make

Page 44: Condor: Overview and User Guide to the Condor Biostatistics Environment

44

Running Jobs on Condor C programs – YAP

• condor.subUniverse = standardExecutable = /u/dutra/Yap-4.3.20/condor/yap.$$(Arch).$$(OpSys)Initialdir = /u/dutra/App/f1/train_bestLog = /u/dutra/App/f1/train_best/logRequirements = ((Arch == "INTEL" && OpSys == "LINUX") && (Mips >=

500) || (IsDedicated && UidDomain == "cs.wisc.edu"))

Arguments = -b /u/dutra/Yap-4.3.20/condor/../pl/boot.yapInput = condor.in.$(Process)Output = /dev/nullError = /dev/null

Queue 300

Page 45: Condor: Overview and User Guide to the Condor Biostatistics Environment

45

Running Jobs on Condor C programs – YAP

• condor.in.0[‘~/Yap-4.3.20/condor/../pl/init.yap'].module(user).[‘~/Aleph/aleph.pl'].read_all(‘~/App/f1/train_best/train').set(i,5).set(minacc,0.7).set(clauselength,5).set(recordfile,‘~/App/f1/train_best/trace-0.7-5.0').set(test_pos,‘~/App/f1/train_best/test.f').set(test_neg,‘~/App/f1/train_best/test.n').set(evalfn,coverage).induce.write_rules(‘~/App/f1/train_best/theory-0.7-5.0').halt.

Page 46: Condor: Overview and User Guide to the Condor Biostatistics Environment

Example of Java Program

Page 47: Condor: Overview and User Guide to the Condor Biostatistics Environment

47

Running Jobs on CondorJava programs

• Using Java Universe• Does not need to compile with Condor• Use jar file to programs with several classes:

http://java.sun.com/docs/books/tutorial/jar/

• If using Computer Science environment, must grant access of files to be used on AFS

http://www.cs.wisc.edu/condor/uwcs/

Page 48: Condor: Overview and User Guide to the Condor Biostatistics Environment

48

Running Jobs on CondorJava programs

#################### # Example in Java Universe # executable must have the .class file and # arguments must have the main class as first argument #################### universe = java executable = Fact.class arguments = Fact notify_user = [email protected] output = loop.out error = loop.error log = loop.log Queue

Page 49: Condor: Overview and User Guide to the Condor Biostatistics Environment

49

Running Jobs on CondorJava programs

#################### # Example in Java Universe using jar file #################### universe = java executable = jgfSection2.jar arguments = JGFAllSizeA 4 jar_files = jgfSection2.jar transfer_files = ALWAYS output = logAllSection2f.out error = logAllSection2f.error log = logAllSection2f.log Queue

Page 50: Condor: Overview and User Guide to the Condor Biostatistics Environment

50

Running Jobs on CondorJava programs

• executable = jgfSection2.jar– é um jar– não um .class como no exemplo anterior

• arguments = JGFAllSizeA 4– dois argumentos– exemplo gerado a partir do JavaGrand

• jar_files = jgfSection2.jar– parece redundante– mas sem esse argumento arquivo não é transferido

Page 51: Condor: Overview and User Guide to the Condor Biostatistics Environment

51

Running Jobs on CondorJava programs

• transfer_files = ALWAYS– idem: para transferir .jar– talvez um erro que tenha sido resolvido

Page 52: Condor: Overview and User Guide to the Condor Biostatistics Environment

52

Running Jobs on CondorInspecting Condor Jobs

• Some useful commands:– condor_q

• mostra fila de jobs submetidos localmente

– condor_q -analyze•mais informações•permitindo entender se um job não está executando pq teve algum problema nos requisitos ou se não há recurso

•condor_q –submitter <user>

Page 53: Condor: Overview and User Guide to the Condor Biostatistics Environment

53

Running Jobs on CondorInspecting Condor Jobs

• condor_q -run– mostra apenas os jobs que estão em execução

• condor_q -submitter <user>– filtra pra mostrar informações apenas dos jobs submetidos pelo “user”

Page 54: Condor: Overview and User Guide to the Condor Biostatistics Environment

54

Running Jobs on CondorInspecting Condor Jobs

• condor_status– mostra cada uma das máquinas da condor_pool

– mostrando informações• estáticas (p.ex. qual o SO)• dinâmicas (p.ex. se está ociosa ou ocupada)

Page 55: Condor: Overview and User Guide to the Condor Biostatistics Environment

55

Running Jobs on CondorInspecting Condor Jobs

• condor_rm– se resolver remover um job ou conjunto de jobs da

fila– parecido como o kill– precisa dar o número do job

• condor_q -global– mostra informações de todas as filas– em todas as máquinas onde houve submissão

Page 56: Condor: Overview and User Guide to the Condor Biostatistics Environment

56

Final Remarks

Page 57: Condor: Overview and User Guide to the Condor Biostatistics Environment

57

Final Remarks

• So, Condor...– controls execution of several jobs

– can really improve your runtime• Yap+Aleph: during three months: 53,000 CPU

hours (peak of 400 machines)

• But, Condor...– does not automatically parallelize your job

Page 58: Condor: Overview and User Guide to the Condor Biostatistics Environment

58

Final Remarks

• Running Jobs on Condor - Observations:– input data file and directory used to output/log/error must be

previously created, • otherwise an error will be reported and no job will be

executed– for each execution,

• the outputs are appended to log files• the results are overwritten to out files

– error, log and out files must have different names• to avoid race conditions

Page 59: Condor: Overview and User Guide to the Condor Biostatistics Environment

59

Final Remarks

• Trabalhos sobre gerenciamento de dados– mas não sei até que ponto integrados ao Condor?– Stork (Data Placement Scheduler):

http://www.cs.wisc.edu/condor/stork– Kangaroo (parece que esse foi abandonado):

http://www.cs.wisc.edu/condor/kangaroo– NeST: Network Storage :

http://www.cs.wisc.edu/condor/nest/

Page 60: Condor: Overview and User Guide to the Condor Biostatistics Environment

60

Final Remarks

• Trabalho sobre monitoração– Hawkeye System Monitoring Tool:

http://www.cs.wisc.edu/condor/hawkeye/

Page 61: Condor: Overview and User Guide to the Condor Biostatistics Environment

61

Final Remarks

• More information about Condor:http://www.cs.wisc.edu/condor/

• Tutoriais– http://www.cs.wisc.edu/condor/CondorWeek2006/

– http://www.cs.wisc.edu/condor/CondorWeek2005/presentations.html

• More information about running Condor:http://www.cs.wisc.edu/condor/manual/v6.4/

Page 62: Condor: Overview and User Guide to the Condor Biostatistics Environment

62

Final Remarks

• References:– [1] WRIGHT, Derek. Cheap cycles from the desktop to the

dedicated cluster: combining opportunistic and dedicated scheduling with Condor. In: Conference on Linux Clusters: The HPC Revolution, June, 2001, Champaign - Urbana, IL - USA. http://www.cs.wisc.edu/condor/doc/cheap-cycles.pdf

Page 63: Condor: Overview and User Guide to the Condor Biostatistics Environment
Page 64: Condor: Overview and User Guide to the Condor Biostatistics Environment

NMR-Star file to ClassAd

Patrícia Kayser Vargas Mangan

[email protected]

September, 2002

Page 65: Condor: Overview and User Guide to the Condor Biostatistics Environment

65

NMR-Star to ClassAd

• BioMagResBank (http://www.bmrb.wisc.edu)– an international repository for biological NMR (nuclear

magnetic resonance) data

– uses the NMR Self-defining Text Archival and Retrieval (NMR-STAR) format to store its data

• NMR-STAR is characterized by a set of information organized as a hierarchical tree – stored as plain text file

– some may have inconsistencies that are manually verified

Page 66: Condor: Overview and User Guide to the Condor Biostatistics Environment

66

NMR-Star to ClassAd

• ClassAds– a simple representation language used first in the

Condor context,

• Steps:– conversion of NMR-STAR data to ClassAds format

using starlibj (Java package)– use to detect inconsistencies on NMR-STAR files

Page 67: Condor: Overview and User Guide to the Condor Biostatistics Environment

67

NMR-Star to ClassAd

• Future work:– Matchmaking as consistency checker

– try to “learn” similarities among NMR data

• Working with R. Kent Wenger from the Condor team of UW-Madison

Page 68: Condor: Overview and User Guide to the Condor Biostatistics Environment

68

Page 69: Condor: Overview and User Guide to the Condor Biostatistics Environment

TALK 1: Condor: Managing Resources in the Biostatistics Department Environment

TALK 2: Using ClassAds to Represent NMR Data

Page 70: Condor: Overview and User Guide to the Condor Biostatistics Environment

70

What is Condor? Architecture

• After schedd receives a match for a given job, the schedd enters into a claiming protocol directly with the startd

• Through this protocol, the schedd presents the job ClassAd to the startd and requests temporary control over the resource