47
Grid Computing, B. Wilkinson, 2004 6d.1 Schedulers and Resource Brokers

Schedulers and Resource Brokers

  • Upload
    amil

  • View
    43

  • Download
    1

Embed Size (px)

DESCRIPTION

Schedulers and Resource Brokers. Scheduler. Job manager submits jobs to scheduler. Scheduler assigns work to resources to achieve specified time requirements. Scheduling. From "Introduction to Grid Computing with Globus," IBM Redbooks. Advance Reservation. - PowerPoint PPT Presentation

Citation preview

Page 1: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.1

Schedulers and Resource Brokers

Page 2: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.2

Scheduler

• Job manager submits jobs to scheduler.

• Scheduler assigns work to resources to achieve specified time requirements.

Page 3: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.3

Scheduling

From "Introduction to Grid Computing with Globus," IBM Redbooks

Page 4: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.4

Advance Reservation

• Requesting actions at times in future. (“A service level agreement in which the conditions of the agreement start at some agreed-upon time in the future” [2])

[2] “The Grid 2, Blueprint for a New Computing Infrastructure,” I. Foster and C. Kesselman editors, Morgan Kaufmann, 2004.

Page 5: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.5

Resource Broker

• “A scheduler that optimizers the performance of a particular resource. Performance may be measured by such criteria as fairness (to ensure that all requests for the resources are satisfied) or utilization (to measure the amount of the resource used).” [2]

Page 6: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.6

Globus

• Fully-fledged scheduler/resource broker not in Globus.

• For example, Globus does not currently have advance reservation.

• Scheduler/resource broker need to be provided separately on top of Globus, using basic services provided in Globus.

Page 7: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.7

Resource Broker Examples

• Condor-G, Nimrod/G, Cactus

Page 8: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.8

Condor

• System first developed at University of Wisconsin-Madison in mid 1980’s to convert a collection of distributed workstations and clusters into a high-throughput computing facility.

• Key concept - using wasted computer power of idle workstations.

Page 9: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.9

Condor

• Converts collections of distributed workstations and dedicated clusters into a distributed high-throughput computing facility.

Page 10: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.10

Features

• Include:– Resource finder

– Batch queue manager

– Scheduler

– Checkpoint/restart

– Process migration

Page 11: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.11

Intended to run job even if:

• Machines crash

• Disk space exhausted

• Software not installed

• Machines are needed by others

• Machines are managed by others

• Machines are far away

Page 12: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.12

Uses

• Consider following scenario:– I have a simulation that takes two hours to

run on my high-end computer– I need to run it 1000 times with slightly

different parameters each time.– If I do this on one computer, it will take at

least 2000 hours (or about 3 months)

From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23, ,2004

Page 13: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.13

– Suppose my department has 100 PCs like mine that are mostly sitting idle overnight (say 8 hours a day)

– If I could use them when their legitimate users are not using them, so that I do not inconvenience them, I could get about 800 CPU hours/day.

– This is an ideal situation for Condor.

• I could do my simulations in 2.5 days.

From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23, ,2004

Page 14: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.14

How does Condor work?

• A collection of machines running Condor called a pool.

• Individual pools can be joined together in a process called flocking.

From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23, ,2004

Page 15: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.15

Machine Roles

• Machines have one or more of four roles:– Central manager– Submit machine (Submit host)– Execution machine (Execute host)– Checkpoint server

Page 16: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.16

Central Manager

• Resource broker for a pool. Keeps track of which machines are available, what jobs are running, negotiates which machine will run which job, etc.

• Only one central manager per pool.

Page 17: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.17

Submit Machine

• Machine which submits jobs to pool.

• Must be at least one submit machine in a pool, and usually more than one.

Page 18: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.18

Execute Machine

• Machine on which jobs can be run.

• Must be at least one execute machine in a pool, and usually more than one.

Page 19: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.19

Checkpoint Server

• Machine which stores al checkpoint files produced by job which checkpoint.

• Can only be one checkpoint machine in a pool.

• Optional to have a checkpoint machine.

Page 20: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.20

Possible Configuration

• A central manager.

• Some machine that can only be submit hosts.

• Some machine that can be only execute hosts.

• Some machines that can be both submit and execute hosts.

Page 21: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.21

Page 22: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.22

Submitting a job

• Job submitted to submit host

• Submit host tells the central ,manager about job using Condors “ClassAd” Mechanism which may include:– What it requires– What it desires– What it prefers, and– What it will accept

Page 23: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.23

1. Central manager monitoring execute hosts so knows what is available and what type of machines each execute host is, and software.

2. Execute hosts periodically send a ClassAd describing themselves to the central manager.

Page 24: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.24

3. At times, the central manager enters a negotiation cycle where it matches waiting jobs with available execute hosts.

4. Eventually job is matched with a suitable execute host (hopefully) .

Page 25: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.25

5. Central manager informs chosen execute host that is has been claimed and gives it a ticket.

6. Central manage informs submit host which execute host to use and gives it a matching ticket.

Page 26: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.26

7. Submit host contacts execute host presenting its matching ticket and transfers job’s executable and date files to execute host if necessary. (shared file system also possible.)

8. When job finished, results returned to submit host (unless shared file system in use between submit and execute hosts).

Page 27: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.27

Connections

• Connection between submit and execute host usually done with a TCP connection.

• If connection dies, job resubmitted to Condor pool.

• Some jobs might access files and resources on submit host via remote procedure calls.

Page 28: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.28

Checkpointing

• Certain jobs can checkpoint, both periodically for safety and when interrupted.

• If checkpointed job interrupted, it will resume at the last checkpointed state when it starts again.

• Generally no change to source code - need to link Condor’s Standard Universe support library (see later).

Page 29: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.29

Types of Jobs

• Classified according to environment it provides. Currently seven environments:– Standard– Vanilla– PVM– MPI– Globus– Java– Scheduler

Page 30: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.30

Standard

• For jobs compiled with Condor libraries

• Allows for checking pointing and remote system calls.

• Must be single threaded.

• Not available under Windows.

Page 31: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.31

Vanilla

• For jobs that cannot be compiled with Condor libraries, and for shell scripts and Windows batch files.

• No checkpointing or remote system calls.

Page 32: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.32

Job Universes continuedPVM

For PVM programs.

MPIFor MPI programs (MPICH).

Globus

For submitting jobs to resources managed by Globus (version 2.2 and higher).

Page 33: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.33

JavaFor Java programs (written for Java Virtual Interface).

SchedulerA universe not normally used by end-user. Ignores any requirements and runs job on submit host. Never preempted.

Page 34: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.34

Directed Acyclic Graph Manager (DAGMan)

• Allows one to specify dependencies between Condor Jobs.

Example“Do not run Job B until Job A completed

successfully”

Especially important to jobs working together (as in Grid computing).

Page 35: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.35

Directed Acyclic Graph(DAG)

• A data structure used to represent dependencies.

• Each job is a node in the DAG.

• Each node can have any number of parents and childred as long as there are no loops (Acyclic graph).

Page 36: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.36

Defining a DAG

• DAG defined by a .dag file, listing each of the nodes and their dependencies

Example

# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

Job A

Job CJob B

Job D

Page 37: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.37

Running a DAG

• DASGMan acts as a scheduler managing the submission of jobs to Condor based upon DAG dependencies.

• DAGMan holds and submits jobs to Condor queue at appropriate times.

Page 38: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.38

Job Failures

• DAGMan continues until it cannot make progress and then creates a rescue file holding current state of DAG.

• When failed job ready to re-run, rescue file used to restore prior state of DAG.

Page 39: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.39

ClassAd Matchmaking

Used to ensure job done according to constraints of users and owners.

Example of user constraints

“ I need a Pentium IV with at least 512 Mbytes of RAM and speed of at least 3.5 Ghz

Example of machine owner constraints

“Never run jobs owned by Fred”

Page 40: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.40

Condor Submit Description File

# This is a comment, condor submit file

Universe = vanilla

Executable = /home/abw/condor/myProg

Input = myProg.stdin

Output = myProg.stdout

Error = myProg.stderr

Arguments = -arg1 -arg2

InitialDir = /home/abw/condor/assignment4

Queue

Describes job to Condor. Used with Condor _submit command.

Description File Example

Page 41: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.41

Submitting Multiple Jobs

• Submit file can specify multiple jobs

Queue 500 will submit 500 jobs at once• Condor calls groups of jobs a cluster• Each job within cluster called a process• Condor job ID is the cluster number, a period

and process number, for example 26.2• Single jobs also a cluster but with a single

process (process 0)

Page 42: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.42

Specifying Requirements

• A C/Java-like Boolean expression that evaluates to TRUE for a match.

# This is a comment, condor submit file

Universe = vanilla

Executable = /home/abw/condor/myProg

InitialDir = /home/abw/condor/assignment4

Requirements = Memory >= 512 && Disk > 10000

queue 500

Page 43: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.43

Summary of Key Condor Features

• High throughput computing using an opportunitistic environment.

• Matchmaking

• Checkpointing

• DAG scheduling

Page 44: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.44

Condor-G

• Grid enabled version of Condor.

• Uses Globus Toolkit for:– Security (GSI)– managing remote jobs on grid (GRAM)– file handling and remote I/O (GSI-FTP)

Page 45: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.45

Remote execution by Condor-G on Globus-managed resources

From:”Condor-G A Computation Management Agent for Multi-Institutional Grids” by J. Frey, T. Tannenbaum, M. Livny, I. Foster and S. Tuecke. Figure probably refers to Globus version 2.

Page 46: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.46

More!

• Assignment 4 will ask you to submit a job through Condor-G.

• Check out assignment write-up.

Page 47: Schedulers and Resource Brokers

Grid Computing, B. Wilkinson, 2004 6d.47

More Information

• www.cs.wisc.org/condor