38
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison [email protected] www.cs.wisc.edu/condor

Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison [email protected]

Embed Size (px)

Citation preview

Page 1: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

Installing and Managing a Large

Condor PoolDerek Wright

Computer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]/condor

Page 2: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

2

Talk Outline

What is Condor and why is it good for large clusters?• The Condor Daemons (the sys admin

view)• A look at the UW-Madison Computer

Science Condor Pool and Cluster• Some other features of Condor that help

for big pools• Future work

Page 3: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

3

What is Condor?

A system of daemons and tools that harness desktop machines and commodity computing resources for High Throughput Computing• Large numbers of jobs over long

periods of time• Not High Performance Computing,

which is short bursts of lots of compute power

Page 4: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

4

What is Condor? (Cont’d)

Condor matches jobs with available machines using “ClassAds”• “Available machines” can be:

– Idle desktop workstationsIdle desktop workstations– Dedicated clustersDedicated clusters– SMP machinesSMP machines

Can also provide checkpointing and process migration (if you re-link your application against our library)

Page 5: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

5

What’s Condor Good For?

Managing a large number of jobs• You specify the jobs in a file and submit

them to Condor, which runs them all and sends you email when they complete

• Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc.

• Condor can handle inter-job dependencies (DAGMan)

Page 6: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

6

What’s Condor Good For? (cont’d)

Managing a large number of machines• Condor daemons run on all the machines

in your pool and are constantly monitoring machine state

• You can query Condor for information about your machines

• Condor handles all background jobs in your pool with minimal impact on your machine owners

Page 7: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

7

Why is Condor Good for Large Clusters?

Fault-Tolerance at all levels of Condor• Even “dedicated” resources should be

treated like they might disappear at any minute (Condor has been doing this since 1985… we’ve got a lot of experience)

• Checkpointing jobs (when possible) makes scheduling a lot easier, and ensures forward progress

Eases monitoring

Page 8: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

8

Condor on Large Clusters (cont’d)

Manages ALL your resources and jobs under one system• Easier for users and administrators

Easy to install and use• No queues to configure or choose from

It’s developed by former system administrators (all the full-time staff)

It’s free (that scales really well)

Page 9: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

9

What is a Condor Pool?

“Pool” can be a single machine or a group of machines

Determined by a “central manager” - the matchmaker and centralized information repository

Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself

Page 10: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

10

Talk Outline

• What is Condor and why is it good for large clusters?

The Condor Daemons (the sys admin view)• A look at the UW-Madison Computer

Science Condor Pool and Cluster• Some other features of Condor that help

for big pools• Future work

Page 11: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

11

The Condor Daemonscondor_master Administrator Agent

condor_collector Centralized Repository of ClassAds

condor_negotiator Performs Matchmaking

condor_startd Resource Agent (Machine)

condor_schedd User Agent (J obs)

condor_starter Monitors/Manages a J ob Process

condor_shadow Handles Remote System Calls,I ntra- J ob Resource Management

condor_dagman Manage Inter- J ob Dependencies

condor_eventd Pool- Wide Events

Page 12: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

12

Layout of a Personal Condor PoolCentral Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Page 13: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

13

Layout of a General Condor PoolCentral Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Submit-Only

master

schedd

Execute-Only

master

startd

Regular Node

schedd

startd

master

Regular Node

schedd

startd

master

Execute-Only

master

startd

Page 14: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

14

condor_master daemon Starts up all other Condor daemons If there are any problems and a

daemon exists, it restarts the daemon and sends email to the administrator

Checks the time stamps on the binaries it is configured to spawn, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version

Page 15: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

15

condor_master (cont’d) Provides access to many remote

administration commands:• condor_reconfig• condor_restart, condor_off, condor_on

Default server for many other commands:• condor_config_val, etc.

Periodically runs condor_preen to clean up any files Condor might have left on the machine (the rest of the daemons clean up after themselves, as well)

Page 16: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

16

condor_collector

Collects information from all other Condor daemons in the pool

Each daemon sends a periodic update called a “ClassAd” to the collector

Services queries for information:• Queries from other Condor daemons• Queries from users (condor_status)

Can store historical pool data

Page 17: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

17

Page 18: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

18

condor_eventd

Administrators specify events in a config file (similar to a crontab, but not exactly):• Date and time• What kind of event (currently, only

“shutdown” is supported)• What machines the event effects

(ClassAd constraint)

Page 19: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

19

condor_eventd (cont’d)

When event is approaching, EventD will wake up and query the condor_collector for all machines that match the constraint

EventD then knows how big all the jobs are that are currently running on the effected nodes, network bandwidth to the nearest checkpoint servers, etc.

EventD plans evictions to allow the most computation w/o flooding the net

Page 20: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

20

Talk Outline

• What is Condor and why is it good for large clusters?

• The Condor Daemons (the sys admin view)

A look at the UW-Madison Computer Science Condor Pool and Cluster• Some other features of Condor that help

for big pools• Future work

Page 21: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

21

Large Condor Pools in HEP and Government Research

UW-Madison CS (~750 nodes) INFN (~270 nodes) CERN/Chorus (~100 nodes) NASA Ames (~330 nodes) NCSA (~200 nodes)

Page 22: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

22

Central Manager

Dedicated LinuxCluster (~200

cpus)

Instructional Computer Labs

(~225 cpus)

Checkpoint Server Checkpoint Server

Dedicated Scheduler

Layout of the UW-Madison Pool

Desktop Workstations (~325

cpus)

Flocking to other

Pools

Submit-only

machines at

other sites

EventD

Page 23: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

23

Composition of the UW/CS Cluster

Current cluster: 100 Dual XEON 550MHz with 1 gig of RAM (tower cases)

New nodes being installed: 150 Dual 933MHz Pentium III, 36 nodes w/ 2 gigs of RAM, the rest w/ 1 gig (2U racks)

100 Mbit Switched Ethernet to nodes Gigabit Ethernet to the file servers and

checkpoint server

Page 24: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

24

Composition of the rest of the UW/CS Pool

Instructional Labs• 60 Intel/Linux• 60 Sparc/Solaris• 105 Intel/NT

“Desktop Workstations”• Includes 12 and 8-way Ultra E6000s, other

SMPs, and real desktops, etc. Central Manager - 600MHz Pentium III

running Solaris, 512 Megs RAM

Page 25: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

25

Talk Outline

• What is Condor and why is it good for large clusters?

• The Condor Daemons (the sys admin view)

• A look at the UW-Madison Computer Science Condor Pool and Cluster

Some other features of Condor that help for big pools• Future work

Page 26: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

26

Condor’s Configuration

Condor’s configuration is a concatenation of multiple files, in order - definitions in later files overwrites previous definitions

Layout and purpose of the different files:• Global config file• Other shared files• Local config file

Page 27: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

27

Global Config File

All shared settings across your entire pool

Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or the home directory of the “condor” user

Most settings can be in this file Only works as a “global” file if it is on a

shared file system (HIGHLY recommended for large sites!)

Page 28: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

28

Other shared files

You can configure a number of other shared config files:• files to hold common settings to make

it easier to maintain (for example, all policy expressions, which we’ll see later)

• platform-specific config files

Page 29: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

29

Local config file

Any machine-specific settings• local policy settings for a given owner• different daemons to run (for example, on

the Central Manager) Can either be on the local disk of each

machine, or have separate files in a shared directory, each named by hostname

For large sites: keep them all on AFS or NFS, and in CVS, if possible

Page 30: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

30

Daemon-specific configuration

You can also change daemon-specific settings with condor_config_val

Use the “-set” option for persistent changes, or “-rset” for memory-resident only

Used by the EventD Can be used by other entities for

various remote-administration tasks

Page 31: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

31

Advertising Your Own Attributes in the Machine ClassAd

Add new macro(s) to the config file • This is usually done in the local config file• Can name the macros anything, so long as

the names don’t conflict with existing ones Tell the condor_startd to include these

other macros in the ClassAd it sends out• Edit the STARTD_EXPRS macro to include

the names of the macros you want to advertise (comma separated)

Page 32: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

32

Host/IP Security in Condor You can configure each machine in your

pool to allow or deny certain actions from different groups of machines:• “read” access - querying information

– condor_status, condor_qcondor_status, condor_q, etc, etc

• “write” access - updating information– condor_submitcondor_submit, adding a node to the pool, , adding a node to the pool,

etcetc

• “administrator” access– condor_on, off, reconfig, restartcondor_on, off, reconfig, restart... ...

• “owner” access – Things a machine owner can do (Things a machine owner can do (vacatevacate))

Page 33: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

33

The Different Versions of Condor

We distribute two versions of Condor: • Stable Series

– Heavily tested, recommended for useHeavily tested, recommended for use– 2nd2nd number of version string is even (6. number of version string is even (6.22.0).0)

• Development Series– Latest features, not necessarily well-testedLatest features, not necessarily well-tested– 2nd2nd number of version string is odd (6. number of version string is odd (6.33.0).0)– Not recommended unless you know what Not recommended unless you know what

you are doing and/or you are doing and/or needneed a new feature a new feature

Page 34: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

34

Condor Versions (cont’d) All daemons advertise a CondorVersion

attribute in the ClassAd they publish You can also view the version string by

running ident on any Condor binary In general, all parts of Condor on a single

machine should run the same version Machines in a pool can usually run

different versions and communicate with each other

It will be made very clear when a version is incompatible with older versions

Page 35: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

35

Talk Outline

• What is Condor and why is it good for large clusters?

• The Condor Daemons (the sys admin view)

• A look at the UW-Madison Computer Science Condor Pool and Cluster

• Some other features of Condor that help for big pools

Future work

Page 36: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

36

Future Work

User Authentication and Authorization• Have Kerberos and X.509 authentication

in beta mode already• Will integrate w/ Condor tools to get rid of

Host/IP authorization and move to user-based authorization

• Will enable encrypted channels to securely move data (including AFS tokens)

Page 37: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

37

Future Work (cont’d)

Digitally Signed Binaries• Condor Team will digitally sign binaries

we release• condor_master will only spawn new

daemons if they are properly signed More interesting dedicated scheduling Condor RPMs Addressing scalability

Page 38: Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison wright@cs.wisc.edu

38

Obtaining Condor Condor can be downloaded from the

Condor web site at:http://www.cs.wisc.edu/condor

Complete Users and Administrators manual available

http://www.cs.wisc.edu/condor/manual Contracted Support is available Questions? Email:

[email protected]