Click here to load reader

Condor Week Summary March 14-16, 2005 Madison, Wisconsin

  • View
    219

  • Download
    0

Embed Size (px)

Text of Condor Week Summary March 14-16, 2005 Madison, Wisconsin

  • Condor Week SummaryMarch 14-16, 2005Madison, Wisconsin

  • OverviewAnnual meeting at UW-Madison.

    About 80 participants at this years meeting.

    Participants come from universities, research labs and industry.

    Single plenary sessions with talks from users and developers.

  • OverviewTopics ranged from basic to advanced.

    Selected highlights in todays talk.

    Slides from this years talks can be found at http://www.cs.wisc.edu/condor/CondorWeek2005

  • CondorWeek Topicsdistributed computing and Condor

    data handling and Condor

    3rd party contributions to Condor

    reports from the field

    Condor roadmap

  • Condor Grids (by Alan De Smet)Various alternatives for accessing remote computing resources (distributed computing, flocking, Globus/Condor-G, Condor-C, etc).

    Discussed pros and cons of each approach (ACF uses Globus/Condor-G).

  • Condor-G Status and NewsGlobus Toolkit 2 is stableGlobus Toolkit 3 is supportedBut we think most people are moving toGlobus Toolkit 4 in progressGT4 beta works now in Condor 6.7.6Condor will officially support soon after official GT4 release.

  • Glidein (by Dan Bradley)You have access to a cluster running some other batch system.You want Condor features, such asqueue managementmatchmakingcheckpoint migration

  • What Does Glidein Do?Installation and setup of Condor.May be done remotely.Launching Condor.Through Condor-G submission to Globus.Or you run the startup script however you like.

  • Condor and DBMS (by Jeff Naughton)Premise: A running Condor system is awash in data:Operational dataHistorical dataUser dataDBMS technology can help capture, organize, manage, archive, and query this data.

  • Three potential levels of involvementPassively collect and organize data, expose it through DB query interfaces.Move/extend some data-related portions of Condor to DBMS (Condor writes to and reads from DBMS)Provide services to help users manage their data.

  • Why do this?For Condor administratorsEasier to analyze and trouble shoot;Easier to audit;Easier to explore current and past system status and behavior.

  • Our projects and plansQuill: Transparently provide a DBMS query interface to job_queue and history data. [ready to deploy!]CondorDB: Transparently captures and provides interface to critical data from all Condor daemons. [status: partial prototype working in our own sandbox]

  • QuillJob ClassAds information mirrored into an RDBMSBoth active jobs and historical jobsBenefits BOTH scalability and accessibility

    QuillScheddJob Queue logRDBMSStartdMasterQueue +History Tables

  • Longer-term plansTight integration of DBMS technology and Condor [status: thinking hard!].DBMS-inspired data management services to help Condor users manage their own data. [status: thinking really hard!]

  • Stork (by Tevfik Kosar)Condor tool for data movement.

    First available in v. 6.7.6. Will be included in next stable release (6.8.0).

    Prototypes deployed at various sites.

  • 500 TB/year2-3 PB/year11 PB/year20 TB - 1 PB/year

  • Stork: Data Placement SchedulerFirst scheduler specialized for data movement/placement.De-couples data placement from computation.Understands the characteristics and semantics of data placement jobs.Can make smart scheduling decisions for reliable and efficient data placement.http://www.cs.wisc.edu/condor/stork

  • Stork can also:Allocate/de-allocate (optical) network linksAllocate/de-allocate storage spaceRegister/un-register files to Meta Data CatalogLocate physical location of a logical file nameControl concurrency levels on storage servers

  • Storage Management (by Jeff Weber)NEST (Network Storage Technology) is another project at UW-Madison.

    To be coupled to Condor and Stork.

    No stable release available yet.

  • Overview of NeSTNeST: Network Storage TechnologyLightweight: Configuration and installation can be performed in minutes.Multi-protocol: Supports Chirp, GridFTP, NFS, HTTPChirp is NeSTs internal protocolSecure: GSI authenticationAllocation: NeST negotiates mini storage contracts between users and server.

  • Why storage allocations ?Users need both temporary storage, and long-term guaranteed storage.Administrators need a storage solution with configurable limits and policy.Administrators will benefit from NeSTs autonomous reclamations of expired storage allocations.

  • Storage allocations in NeSTLot abstraction for storage allocation with an associated handleHandle is used for all subsequent operations on this lotClient requests lot of a specified size and duration. Server accepts or rejects client request.

  • Condor and SRM (by Derek Wright)Coordinate computation and data movement with Condor.

    Condor ClassAd hook (STARTD_CRON_JOBS) queries DRM for files in cache and publishes it in ClassAd for each node.

    FSM keeps track of all files required by jobs in the system and contacts HRM if required files are missing.

    Regular Condor matchmaking schedules jobs where files exist.

  • 3rd party contributions to CondorHigh availability features (Technion Institute).

    Privilege separation in Condor (Univ. of Cambridge).

    Optimizing Condor throughput (CORE Feature Animation).

    Web interface to Condor (Univ. College of London).

  • Current Condor PoolCentral Manager

  • Highly Available Condor PoolHighly AvailableCentral Manager

  • Highly Available Central ManagerOur solution - Highly Available Central ManagerAutomatic failure detectionTransparent failover to backup matchmaker (no global configuration change for the pool entities)Split brain reconciliation after network partitionsState replication between active and backupsNo changes to Negotiator/Collector code

  • What is privilege separation?Isolation of those parts of the code that run at different privilege levels

  • Throughput Optimization (CORE Feature Animation)Performance Before => After:Removed Groups: 6 => 5.5 minSignificant Attributes: 5.5 => 3 minSchedd Algorithm: 3 => 1.5minSeparate Servers:1.5 => 0.6minCycle delay:0.6 => 0.33 minServer Loads:
  • Web Service Interface to Condor

    Facilitate the development of third-party applications capable of interacting with Condor (remotely).E.g. build higher-level application specific scheduler that submits jobs to multiple Condor pools based on application semanticsThese can be built using a wide range of languages/SOAP packages BirdBath has been tested on:Java (Apache Axis, XSUL)Python (ZSI)C# (.Net)C/C++ (gSOAP)Condor accessible from platforms where its command-line tools are not supported/installed

  • Condor Plans (by Todd Tannenbaum)Condor 6.8.0 (stable series) available in May 05.

    Fail-over, persistence and other features.

    Improved scalability and accessibility (APIs, Grid middleware, Web-based interfaces, etc).

    Grid universe and security improvements.

  • Condor can now transfer job data files larger than 2 GB in size.On all platforms that support 64bit file offsetsReal-time spooling of stdout/err/in in any universe incl VANILLAReal-time monitoring of job progressCondor Installer on Win32 uses MSI (thanks Micron!)condor_transfer_data (DZero)STARTD_VM_EXPRS (INFN)condor_vacate_job toolcondor_status -negotiatorBAM! More tasty Condor goodness!

  • And MoreNew startd policy expression MaxJobRetirementTime. specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job -peaceful option to condor_off, condor_restartnoop_job = TruePreliminary support for the Tool Daemon Protocol (TDP)TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools. specify a ``tool'' that should be spawned along-side their regular Condor job. On Linux, ability to allow a monitoring tool to attach with ptrace() before the job's main() function is called.

  • Hey Jobs! Were watching you!condor_starter enforce limitsStarter is already monitoring many job characteristics (image size, cpu usage, etc)Threshold expressionsUse more resources than you said you would, and BAM! Local UniverseJust like Scheduler Universe, but there is a condor_starterAll advantages of the starterscheddstarterjobSubmitstartdstarterjobExecuteHey, job, behave or else!

  • ClassAd Improvements in Condor!ConditionalsIfThenElse(condition,then,else)String functionsStrcat(), strcmp(), toUpper(), etc.StringList functionsExample of a string list (CSV style)Mylist = Joe, Jon, Jeff, Jim, JakeStrListContains(), StrListAppend(), StrListRemove(), etc.OthersType test, some math functions

  • Accounting Groups andGroup Quota SupportAccount Group (w/ CORE Feature Animation)Account Group Quota (inspiration CDF @ Fermi)Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use themCould use Machine Rankbut this ties to specific machinesOr could use new group supportEach group can be given a quota in config fileJob ads can specify group membershipGroup quotas are satisfied firstAccounting by user and by group

  • Improved ScalabilityMuch faster negotiationSIGNIFICANT_ATTRIBUTES determined automaticallySchedd uses non-blocking TCP connects to the startdNegotiator cachingCollector Forks for queriesMore

  • Whats brewing for after v6.8.0?More data, data, data Stork distributed w/ v6.8.0, incl DAGMan supportNeST manage Condor spool files, ckpt serversStork used for Condor job data transfersVirtual Machines (and the future of Standard Universe) Condor and Shibboleth (with Georgetown Univ)Least Pr

Search related