58
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor What’s new in Condor? Condor Week 2006

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s new in Condor?

Embed Size (px)

DESCRIPTION

3 inint

Citation preview

Page 1: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

Todd TannenbaumComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

What’s new in Condor?Condor Week 2006

Page 2: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

2

So Todd… where is v6.8?

Well, v6.7 has been a challenge…

Page 3: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

3

inint

Page 4: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

4

Changes Per Condor Version

0

10

20

30

40

50

60

6.7.19 6.7.16 6.7.13 6.7.10 6.7.7 6.7.3 6.7.0 6.6.10 6.6.7 6.6.4 6.6.1 6.5.4 6.5.1 6.4.7 6.4.2 6.3.3 6.3.0 6.2.0

Bugs FixedNew Features

Page 5: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

5

Around since the 80’s

Page 6: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

6

Around since the 80’s

80’s Mullet Boy

Page 7: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

7

100 people surveyed! Favorite “ility” ?

Page 8: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

8

100 people surveyed!Favorite “ility” ?

Deployability!

Page 9: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

9

Existing PortsExisting Ports• Digital UNIX 4.0        Alpha• AIX 5.2 (clipped) PowerPC        • Tru64 5.1 (clipped)      Alpha• HP UNIX 10.20 PA RISC• HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC• Irix 6.5 (clipped) SGI• Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha• Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86• Linux 2.4.x (glibc 2.2) - Red Hat 8     Intel x86• Linux 2.4.x (glibc 2.3) - Red Hat 9     Intel x86       • Enterprise Server 8.1  Intel Itanium• Solaris 8       Sparc   • Solaris 9       Sparc• Microsoft Windows 2000 or XP (clipped)   Intel x86

Page 10: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

10

New Ports› Introduced in v6.6.x MacOSX (“clipped") PowerPC Debian Linux 3.1 Intel x86 Fedora Core 1 Intel x86     Red Hat Enterprise Linux 3  Intel x86 SuSE Linux Enterprise Server 8.1  Intel

Itanium › Introduced in v6.7.x

AIX 5.1 (“clipped") PowerPC Fedora Core 2 on x86 Fedora Core 3 on x86 SuSE 8.0 ("clipped") on AMD64 Solaris 10 ("clipped") on Sparc Scientific Linux (Release 303) on x86

› Still to be introduced in v6.7.x (before v6.8.0) HPUX 11i 64-bit pa-risc RHEL 4 on x86 “native” 64 bit AMD Linux

Sigh…

“Psilord” – The Condor porting doctor. Talk to him in person tomorrow.

Page 11: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

11

Porting Table› See http://www.cs.wisc.edu/condor/porting/port_table.html

› Highlights Almost every 32-bit Linux flavor as “full” Every other Unix, MacOS and Windows available as “clipped” Solaris 10 and HP-UX 11.x now “clipped” FreeBSD 4 contribution from Yahoo!, added 5 and 6 X86_64 Linux: “full” running in the lab

Page 12: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

12

Backfill Jobs› Execute machines will run a locally

staged executable when otherwise idle.

› Currently designed for BOINC.# Turn on backfill functionality, and use BOINCENABLE_BACKFILL = TRUE BACKFILL_SYSTEM = BOINC # Spawn a backfill job if we've been Unclaimed for more than 5 minutes START_BACKFILL = $(StateTimer) > (5 * $(MINUTE)) # Evict a backfill job if the machine is busy (based on keyboard # activity or cpu load) EVICT_BACKFILL = $(MachineBusy)

Page 13: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

13

Joining Condor’s Einstein@Home Compute Team› If you’re running BOINC backfill jobs in

Condor and want to use your cycles to help another UW project, please join the Einstein@Home computation

› Join the “Condor Backfill” team: http://einstein.phys.uwm.edu/

team_display.php?teamid=5994 http://einstein.phys.uwm.edu/

create_account_form.php?teamid=5994

Page 14: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

14

More “deployability”› “Personal” Condor Support on Win32

LocalSystem not required› MSI installer on Win32 (thanks Micron!)› New tools

Safe, dynamic Condor service deployment.More info @ Research BOF 9am Rm219 condor_cold_start and condor_cold_stop

Page 15: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

15

100 people surveyed! Favorite “ility” ?

Page 16: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

16

100 people surveyed!Favorite “ility” ?

Availability!

Page 17: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

17

GCB layer

Server app

TCP/IP

GCB layer

Client app

TCP/IP

trans

late

connect

Relay point

listenaccept

Condor with Firewalls and

NATS:GCB in v6.8.0!

Page 18: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

18

Job Progress continues if connection is interrupted

› Now for Vanilla, Java, and Grid universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines. If network outage between execute and submit

machine If submit machine restarts Grid Universe was tricky…

› To take advantage of this feature, put the following line into their job’s submit description file:

JobLeaseDuration = <N seconds>For example:

job_lease_duration = 1200

Page 19: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

19

Job Progress continues if submit machine fails

› Condor can now support a submit machine “hot spare” (schedd failover) If your submit machine A is down for

longer than N minutes, a second machine B can take over

Requires shared filesystem between machines A and B

Page 20: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

20

Central Manager Failover

› Condor Central Manager has two services

› condor_collector Now a list of collectors is supported

› condor_negotiator (matchmaker) If fails, election process, another takes over Accounting state is peridocially replicated Contributed technology from Technion

Page 21: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

21

Reliability, cont.› Time shifts› Quill› Closing windows of vulnerability

Page 22: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

22

100 people surveyed! Favorite “ility” ?

Page 23: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

23

100 people surveyed!Favorite “ility” ?

Lighweight?

Page 24: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

24

100 people surveyed!Favorite “ility” ?

Lighweight?

Page 25: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

25

100 people surveyed! Favorite “ility” ?

Page 26: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

26

100 people surveyed!Favorite “ility” ?

Functionality!

Page 27: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

27

Security› Common Authentication Methods

between Condor on Unix and Win32 Kerberos 1.4

• Additional hopeful benefit: Authentication against MS Active Directory!

SSL Password (shared secret)

› Starter only runs known executables› More powerful, unified map file(s)› GSI credentials delegated

Page 28: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

28

With Condor on Win32, it be nice if …

› My jobs could access my files just like the condor_shadow can

› I didn’t have to tie my execute machines to a single account

› I didn’t have to run condor_store_cred from every machine where my credential is needed

(thank you Optena)

Page 29: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

29

The Windows CredD

y0ursmyp4sswd

C:\>condor_store_cred addAccount: gquinn@CROW

Enter password:

Operation succeeded.

credd

› A centralized repository for user passwords

“store password”

<password>

Page 30: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

30

The Windows CredD

y0ursmyp4sswdschedd

shadowSubmit machines can use the CredD to impersonate the user in the shadow

“fetch password”

<password>

Page 31: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

31

The Windows CredD

y0ursmyp4sswdstarter

condor_exec.exeExecute machines can use the CredD to run jobs as the submitting user!

“fetch password”

<password>

Page 32: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

32

Running Jobs as Submitting User

CREDD_HOST = vault.cs.wisc.edu

STARTER_ALLOW_RUNAS_OWNER = True

CREDD_CACHE_LOCALLY = True

› In submit file: Run_job_as_owner = true

› In config file on submit and execute nodes:

Page 33: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

33

Some Condor APIs› Command Line tools

condor_submit, condor_q, etc -format, -constraint, -xml

› Condor Perl Module› Chirp› Checkpoint Library API › MW --- improved!› DRMAA (Works w/ Win32, on SourceForge)› Condor Grid ASCII Protocol (GAHP)› Web Service Interface

Page 34: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

34

DRMAA› Distributed Resource Management

Application API (DRMAA) GGF Working Group An API specification for the submission and

control of jobs to one or more Distributed Resource Management (DRM) systems

› An API with C and Java bindings not a protocol

› Scope Does: job submission, monitoring, control, final

status Does not: file staging, reservations, security, …

Page 35: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

35

Condor GAHP

› The Condor GAHP is a relatively low-level protocol based on simple ASCII messages through stdin and stdout

› Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events

Page 36: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

36

GAHP, contExample:

R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $S: GRAM_PING 100 vulture.cs.wisc.edu/forkR: ES: RESULTSR: ES: COMMANDSR: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSIONS: VERSIONR: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txtR: SS: GRAM_PING 100 vulture.cs.wisc.edu/forkR: SS: RESULTSR: S 0S: RESULTSR: S 1R: 100 0S: QUITR: S

Page 37: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

37

Web Service Interfaces› SOAP over http or https to

the Condor daemons› Use any language or

platform (where you can find a decent SOAP library)› Functionality Exposed

in current release Submit jobs Retrieve job output Remove/hold/release jobs Query machine status (fetch ads from collector) Query job status (fetch ads from the schedd)

Page 38: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

38

Getting machine status via

SOAP (in Java with Axis)locator = new CondorCollectorLocator();

collector = locator.getcondorCollector(new URL(“http://machine:port”));

ads = collector.queryStartdAds(“Memory>512“);

Because we give you WSDL information you don’thave to write any of these functions.

Page 39: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

39

More Functionality changes..

› FINALLY, clean/consistent cross-platform quoting rules for arguments and environment variables (see condor_submit man page)

› Schedd can run HawkEye modules, just like the Startd Enables monitoring on the submit machine

› condor_history : now faster than a snail, and cleans up droppings.

› DeferralTime, DeferralWindow Coordinated starts

› BIND_ALL_INTERFACES in config file› WANT_REMOTE_IO in job ClassAd

Page 40: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

40

ClassAd Functions in Condor!

› Conditionals IfThenElse(condition,then,else)

› String functions Strcat(), strcmp(), toUpper(), etc.

› StringList functions Example of a “string list” (CSV style)

• Mylist = “Joe, Jon, Jeff, Jim, Jake” StrListContains(), StrListAppend(),

StrListRemove(), etc.› Others

Regular expressions, arithmetic, etc…

Page 41: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

41

Accounting Groups andGroup Quota Support

› Account Group (w/ CORE Feature Animation)› Account Group Quota (inspiration CDF @

Fermi) Sample Problem: Cluster w/ 500 nodes, Chemistry

Dept purchased 100 of them, Chemistry users must always be able to use them

Could use Machine Rank…• but this ties to specific machines

Or could use new group support• Each group can be given a quota in config file• Job ads can specify group membership• Group quotas are satisfied first• Accounting by user and by group

Page 42: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

42

100 people surveyed! Favorite “ility” ?

Page 43: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

43

100 people surveyed!Favorite “ility” ?

Universability!

Page 44: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

44

› With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as:

universe = grid gridtype = gt2› Other gridtypes?

GT2 (Globus Toolkit 2) GT3 (Globus Toolkit 3.2) GT4 (Globus Toolkit 3.9.5+) UNICORE Nordugrid PBS (OpenPBS, PBSPro – technology from INFN) LSF (Platform LSF – technology from INFN) CONDOR (thanks gLite!)

Grid Universe

‘Condor-C’

‘Condor-G’

Page 45: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

45

Other Grid Universe improvements

› Condor-G has support for credential refresh via the MyProxy Online Credential Management in NMI

http://grid.ncsa.uiuc.edu/myproxy (both GT2 and GT4)

› GT4 : we start a GridFTP server behind the scenes GridFTP server bundled w/ Condor nowadays

› Some functionality present in Condor-G added to Condor-C Forwarding of refreshed credentials (EGEE) GSI authentication support Cleaner ClassAd representation (URL)

Page 46: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

46

Parallel Universe› Replaces the “MPI” universe› Allows running arbitrary programs

that need to gang-schedule multiple machines MPICH, LAM, … FT-MPICH (Seoul National Univ) Great for testing environments

Page 47: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

47

Hey Jobs! We’re watching you!

› Local Universe Just like Scheduler

Universe, but there is a condor_starter

All advantages of the starter

schedd

starter

job

Submit

startd

starter

job

Execute

Hey, job, behave or else!

Page 48: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

48

100 people surveyed! Favorite “ility” ?

Page 49: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

49

100 people surveyed!Favorite “ility” ?

Scalability!

Page 50: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

50

Faster Negotiation› SIGNIFICANT_ATTRIBUTES determined

automatically Job attributes AutoClusterId and

AutoClusterAttributes Rounding of Attributes

› Schedd uses non-blocking TCP connects to the startd

› Negotiator caching› Collector Forks for queries› More coming…

Page 51: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

51

Scalability, cont.› Knobs GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE,

GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE,

GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE› One instance of gridmanager handles multiple

jobs (all from a given user)› One instance of condor_dagman can run

multiple dags Is the Shadow next?

› Buffered I/O read on schedd restart (thanks Yahoo!)

Page 52: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

52

Quill› Job ClassAds

information mirrored into an RDBMS

› Both active jobs and historical jobs

› Benefits BOTH scalability and accessibility

QuillSchedd

Job Queue

log

RDBMS

Startd …

Master

Queue +

History Tables

Page 53: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

53

Version 6.9.x

Page 54: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

54

What’s brewing for after v6.8.0?› More data, data, data

Stork distributed now v6.7.x, incl DAGMan support – next it is NeST’s turn.

NeST manage Condor spool files, ckpt servers• GridFTP used to move the bits

Quill++ and CondorDB goodness› Virtual Machines (and the future of

Standard Universe) Research BOF w/ Jaeyoung Moon, rm219

9am

Page 55: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

55

SOAP API› First focus will be to finish

interfaces used by all command-line tools condor_userprio, condor_cod, …

› Explore message-based security Ian Alderman’s work w/ signed

ClassAd attributes

Page 56: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

56

Privilege Separation› No more root in the Condor daemons!› Instead, a small component will be

responsible for privileged operations› Initial exploratory work w/ GNU userv

(Cambridge)› Now focusing on integration w/ glexec

(gLite / nikhef)

Page 57: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

57

“The Year of the Schedd”› Schedd is juggling to many tasks

Break it down into smaller pieces, more modular

› Scalability All non-blocking I/O Hierarchy of schedds

› Schedd-on-the-side “Scheduler booster” Transform & delegate job classads to

different grids A “job router” for a grid

Page 58: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison  What’s new in Condor?

58

Thank you!