of 30 /30
Grid Computing I CONDOR

Grid Computing I CONDOR. 2 Agenda What is condor? What is Condor good for? How condor works? How to submit a job?

Embed Size (px)

Text of Grid Computing I CONDOR. 2 Agenda What is condor? What is Condor good for? How condor works? How to...

  • Grid Computing I

    CONDOR

  • *Agenda

    What is condor?What is Condor good for?How condor works?How to submit a job?

  • *What is Condor?

    Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing (HTC) facility.Condor manages both resources (machines) and resource requests (jobs)Condor has several unique mechanisms such as :ClassAd Matchmaking Process checkpoint/ restart / migrationRemote System CallsGrid Awareness

  • How Condor worksCondor provides: a job queueing mechanismscheduling policypriority schemeresource monitoring, andresource management.

    Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

  • Condor Architecture

  • Condor Daemons in action

  • condor_masterStarts up all other Condor daemonsIf there are any problems and a daemon exits, it restarts the daemon and sends email to the administratorChecks the time stamps on the binaries of the other Condor daemons, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new versionAlso supports various administrative commands such as starting, stopping or reconfiguring daemons remotly.

  • condor_startdRepresents a machine to the Condor systemAdvertises information related to the node resources to the Central Manager(condor_collector)Responsible for starting, suspending, and stopping jobsEnforces the wishes of the machine owner (the owners policy)

  • condor_starterOnly runs on Execution HostSets up the execution environment and monitors the job.

  • condor_scheddRepresents users to the Condor systemMaintains the persistent queue of jobsResponsible for contacting available machines and sending them jobsServices user commands which manipulate the job queue:condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio

  • condor_collectorCollects information from all other Condor daemons in the poolDirectory Service / Database for a Condor poolEach daemon sends a periodic update called a ClassAd to the collectorServices queries for information:Queries from other Condor daemonsQueries from users (condor_status)

  • condor_negotiatorPerforms matchmaking in CondorGets information from the collector about all available machines and all idle jobsTries to match jobs with machines that will serve them Both the job and the machine must satisfy each others requirements

  • Job Life Cycle in CondorJob submission: Job submitted by a host using condor_submit commandJob request advertising: On receiving a job request, the condor_schedd daemon on the submission host advertises a request to the condor_collectorResource advertising: Each condor_startd daemon running on an Execution host advertises available resources on host to condor_collector

  • Job life cycle (Cont)Resource matching: condor_negotiator daemon queries the condor_collector daemon to match a resource for a user job request. It then informs the condor_schedd on the submission host of the matched hostJob execution: The condor_schedd on submission host interacts with the condor_strtd daemon running oon the matched host, which spawns a condor_starter daemon. The condor_schedd on submission host spawns a condor_shadow daemon.Return output: When job is completed , the results are sent back.

  • Condor UniversesUniverse in Condor defines an execution environment Condor can support various combinations of features/environments in different UniversesDifferent Universes provide different functionality for your job

  • Condor UniversesSerial Jobs Vanilla Universe Standard UniverseScheduler UniverseParallel Jobs PVM Universe MPI UniverseJava UniverseGlobus Universe

  • Vanilla universeIntended for programs that can not be relinkedThe existing executable can be used without re-compiling or re-linkingCan not use Remote System CallsNo checkpointing, no migrationCan suspend or restart the job

  • Standard universecheckpointing, automatic migration for sequential jobsExisting program should be re-linked with the Condor instrumentation library The application cannot use some system calls (fork,socket, alarm)Grabs file operations and passes back to the shadow process

  • Scheduler UniverseThe job does not wait to be matched to a machine. Instead executes right away on the machine where the job is submittedMachine requirements are not considered

  • PVM universeUsed to run parallel job written in PVM 3.4

  • MPI universeMPICH usage without any necessary changesDynamic changes are not supportedThe application cannot be suspended

  • Java UniverseSubmitted program runs on any sort of machine with JVM regardless of its location, owner, or JVM versionCondor takes care of all the details as finding the JVM binary and setting classpath

  • Globus UniverseProvides standard Condor interface to Globus users Each job submission file is translated in Globus RSL Jobs submitted to Globus via GRAM protocol

  • Submitting a jobWrite a Java class and compile it.Public class Simple{

    public static void main(String arg[]){....}}

  • Submitting a job (Cont)Create a submit file. Name this file submit.javaUniverse = java Executable = simple.class Arguments = simple 4 10 Log = simple.log Output = simple.out Error = simple.error Queue

  • Submitting a job (Cont)

  • Example job description file

    Universe = vanilla Executable = fooRequirements=Memory >= 32 && OpSys == LINUX" && Arch ==x86Image_Size = 28 Meg Error = err.$(Process)Input = in.$(Process)Output = out.$(Process) Log = foo.log Queue 150

  • Current LimitationsLimitations on Jobs that can be checkpointed Jobs need to be re-linked to get Checkpointing and Remote System Calls

  • SummarySpecial resource management (batch)systemDistributed, heterogeneous system.Goal: exploitation of spare computing cycles.It can migrate jobs from one machine to another.The ClassAds mechanism is used to match resource requirements and resources

  • ReferencesThis presentation was prepared from the material provided by the Condor Project Team

    http://www.cs.wisc.edu/condor/

    ******************************