Transcript
Page 1: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Condor Week Summary

March 14-16, 2005

Madison, Wisconsin

Page 2: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Overview

• Annual meeting at UW-Madison.

• About 80 participants at this year’s meeting.

• Participants come from universities, research labs and industry.

• Single plenary sessions with talks from users and developers.

Page 3: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Overview

• Topics ranged from basic to advanced.

• Selected highlights in today’s talk.

• Slides from this year’s talks can be found at http://www.cs.wisc.edu/condor/CondorWeek2005

Page 4: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

CondorWeek Topics

• distributed computing and Condor

• data handling and Condor

• 3rd party contributions to Condor

• reports from the field

• Condor roadmap

Page 5: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Condor Grids (by Alan De Smet)

• Various alternatives for accessing remote computing resources (distributed computing, flocking, Globus/Condor-G, Condor-C, etc).

• Discussed pros and cons of each approach (ACF uses Globus/Condor-G).

Page 6: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Condor-G Status and News

• Globus Toolkit 2 is stable

• Globus Toolkit 3 is supported– But we think most people are moving to…

• Globus Toolkit 4 in progress– GT4 beta works now in Condor 6.7.6– Condor will officially support soon after

official GT4 release.

Page 7: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Glidein (by Dan Bradley)

• You have access to a cluster running some other batch system.

• You want Condor features, such as– queue management– matchmaking– checkpoint migration

Page 8: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

What Does Glidein Do?

• Installation and setup of Condor.– May be done remotely.

• Launching Condor.– Through Condor-G submission to Globus.– Or you run the startup script however you like.

Page 9: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Condor and DBMS (by Jeff Naughton)

• Premise: A running Condor system is awash in data:– Operational data

– Historical data

– User data

• DBMS technology can help capture, organize, manage, archive, and query this data.

Page 10: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Three potential levels of involvement

1. Passively collect and organize data, expose it through DB query interfaces.

2. Move/extend some data-related portions of Condor to DBMS (Condor writes to and reads from DBMS)

3. Provide services to help users manage their data.

Page 11: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Why do this?

• For Condor administrators– Easier to analyze and trouble shoot;– Easier to audit;– Easier to explore current and past system status

and behavior.

Page 12: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Our projects and plans

• Quill: Transparently provide a DBMS query interface to job_queue and history data. [ready to deploy!]

• CondorDB: Transparently captures and provides interface to critical data from all Condor daemons. [status: partial prototype working in our own “sandbox”]

Page 13: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Quill• Job ClassAds

information mirrored into an RDBMS

• Both active jobs and historical jobs

• Benefits BOTH scalability and accessibility

QuillSchedd

Job Queue

log

RDBMS

Startd …

Master

Queue +

History Tables

Page 14: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Longer-term plans

• Tight integration of DBMS technology and Condor [status: thinking hard!].

• DBMS-inspired data management services to help Condor users manage their own data. [status: thinking really hard!]

Page 15: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Stork (by Tevfik Kosar)

• Condor tool for data movement.

• First available in v. 6.7.6. Will be included in next stable release (6.8.0).

• Prototypes deployed at various sites.

Page 16: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Bioinformatics: BLASTHigh Energy Physics: LHC

Astronomy: LSST2MASSSDSSDPOSSGSC-IIWFCAM VISTANVSSFIRSTGALEXROSATOGLE...

LSST2MASSSDSSDPOSSGSC-IIWFCAM VISTANVSSFIRSTGALEXROSATOGLE...

Educational Technology: WCER EVP

500 TB/year

2-3 PB/year11 PB/year

20 TB - 1 PB/year

Page 17: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Stork: Data Placement Scheduler• First scheduler specialized for

data movement/placement.

• De-couples data placement from computation.

• Understands the characteristics and semantics of data placement jobs.

• Can make smart scheduling decisions for reliable and efficient data placement.

http://www.cs.wisc.edu/condor/stork

Page 18: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Stork can also:

• Allocate/de-allocate (optical) network links• Allocate/de-allocate storage space• Register/un-register files to Meta Data

Catalog• Locate physical location of a logical file

name• Control concurrency levels on storage servers

Page 19: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Storage Management (by Jeff Weber)

• NEST (Network Storage Technology) is another project at UW-Madison.

• To be coupled to Condor and Stork.

• No stable release available yet.

Page 20: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Overview of NeST

• NeST: Network Storage Technology• Lightweight: Configuration and installation can be

performed in minutes.• Multi-protocol: Supports Chirp, GridFTP, NFS, HTTP

– Chirp is NeST’s internal protocol

• Secure: GSI authentication• Allocation: NeST negotiates “mini storage contracts”

between users and server.

Page 21: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Why storage allocations ?

• Users need both temporary storage, and long-term guaranteed storage.

• Administrators need a storage solution with configurable limits and policy.

• Administrators will benefit from NeST’s autonomous reclamations of expired storage allocations.

Page 22: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Storage allocations in NeST

• Lot – abstraction for storage allocation with an associated handle– Handle is used for all subsequent operations on

this lot

• Client requests lot of a specified size and duration. Server accepts or rejects client request.

Page 23: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Condor and SRM (by Derek Wright)

• Coordinate computation and data movement with Condor.

• Condor ClassAd hook (STARTD_CRON_JOBS) queries DRM for files in cache and publishes it in ClassAd for each node.

• FSM keeps track of all files required by jobs in the system and contacts HRM if required files are missing.

• Regular Condor matchmaking schedules jobs where files exist.

Page 24: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

3rd party contributions to Condor

• High availability features (Technion Institute).

• Privilege separation in Condor (Univ. of Cambridge).

• Optimizing Condor throughput (CORE Feature Animation).

• Web interface to Condor (Univ. College of London).

Page 25: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Collector

Negotiator

Current Condor Pool

Startd and ScheddStartd and Schedd

Startd and ScheddStartd and ScheddStartd and Schedd

Startd and ScheddStartd and ScheddCentral Manager

Page 26: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Highly Available Condor Pool

Startd and ScheddStartd and Schedd

Startd and ScheddStartd and Schedd Startd and Schedd

Startd and ScheddStartd and Schedd

IdleCentral

Manager

IdleCentral

Manager

ActiveCentral

Manager

Highly AvailableCentral Manager

Page 27: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Highly Available Central Manager

• Our solution - Highly Available Central Manager– Automatic failure detection– Transparent failover to backup matchmaker (no

global configuration change for the pool entities)– “Split brain” reconciliation after network partitions– State replication between active and backups– No changes to Negotiator/Collector code

Page 28: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

What is privilege separation?

• Isolation of those parts of the code that run at different privilege levels

rootCondor daemons

Condor job

• No privilege separation:

rootCondor

daemonsCondor

job

• Privilege separation:

Page 29: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Throughput Optimization (CORE Feature Animation)Performance Before => After:● Removed Groups: 6 => 5.5 min● Significant Attributes: 5.5 => 3 min● Schedd Algorithm: 3 => 1.5 min● Separate Servers: 1.5 => 0.6 min● Cycle delay: 0.6 => 0.33 min● Server Loads: <1 Middleware

<2 Central Manager

Page 30: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Web Service Interface to Condor

• Facilitate the development of third-party applications capable of interacting with Condor (remotely).

– E.g. build higher-level application specific scheduler that submits jobs to multiple Condor pools based on application semantics

– These can be built using a wide range of languages/SOAP packages

– BirdBath has been tested on:

• Java (Apache Axis, XSUL)

• Python (ZSI)

• C# (.Net)

• C/C++ (gSOAP)

• Condor accessible from platforms where its command-line tools are not supported/installed

Page 31: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Condor Plans (by Todd Tannenbaum)

• Condor 6.8.0 (stable series) available in May 05.

• Fail-over, persistence and other features.

• Improved scalability and accessibility (API’s, Grid middleware, Web-based interfaces, etc).

• Grid universe and security improvements.

Page 32: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

• Condor can now transfer job data files larger than 2 GB in size.– On all platforms that support 64bit file

offsets• Real-time spooling of stdout/err/in in

any universe incl VANILLA– Real-time monitoring of job progress

• Condor Installer on Win32 uses MSI (thanks Micron!)

• condor_transfer_data (DZero)• STARTD_VM_EXPRS (INFN)• condor_vacate_job tool• condor_status -negotiator

BAM! More tasty Condor goodness!

Page 33: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

And More…• New startd policy expression MaxJobRetirementTime.

– specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job

• -peaceful option to condor_off, condor_restart• noop_job = True• Preliminary support for the Tool Daemon Protocol (TDP)

– TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools.

– specify a ``tool'' that should be spawned along-side their regular Condor job. – On Linux, ability to allow a monitoring tool to attach with ptrace() before

the job's main() function is called.

Page 34: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Hey Jobs! We’re watching you!• condor_starter enforce limits

– Starter is already monitoring many job characteristics (image size, cpu usage, etc)

– Threshold expressions• Use more resources than you said you

would, and BAM!

• Local Universe– Just like Scheduler Universe, but there is a

condor_starter– All advantages of the starter

schedd

starter

job

Submit

startd

starter

job

Execute

Hey, job, behave or else!

Page 35: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

ClassAd Improvements in Condor!• Conditionals

– IfThenElse(condition,then,else)• String functions

– Strcat(), strcmp(), toUpper(), etc.• StringList functions

– Example of a “string list” (CSV style)• Mylist = “Joe, Jon, Jeff, Jim, Jake”

– StrListContains(), StrListAppend(), StrListRemove(), etc.

• Others– Type test, some math functions

Page 36: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Accounting Groups andGroup Quota Support

• Account Group (w/ CORE Feature Animation)• Account Group Quota (inspiration CDF @ Fermi)

– Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them

– Could use Machine Rank…• but this ties to specific machines

– Or could use new group support• Each group can be given a quota in config file• Job ads can specify group membership• Group quotas are satisfied first• Accounting by user and by group

Page 37: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Improved Scalability

• Much faster negotiation– SIGNIFICANT_ATTRIBUTES determined

automatically– Schedd uses non-blocking TCP connects to the

startd– Negotiator caching– Collector Forks for queries– More…

Page 38: Condor Week Summary March 14-16, 2005 Madison, Wisconsin

What’s brewing for after v6.8.0?

• More data, data, data – Stork distributed w/ v6.8.0, incl DAGMan support– NeST manage Condor spool files, ckpt servers– Stork used for Condor job data transfers

• Virtual Machines (and the future of Standard Universe) • Condor and Shibboleth (with Georgetown Univ)• Least Privilege Security Access (with U of Cambridge)• Dynamic Temporary Accounts (with EGEE, Argonne)• Leverage Database Technology (with UW DB group)• ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida)• Easier Updates• New ClassAds (integration with Optena)• Hierarchical Matchmaking

Can I commit this to CVS??


Recommended