CSE 160/Berman Grid Computing 2 :// legion/ (thanks to shava and holly

  • View
    217

  • Download
    1

Embed Size (px)

Text of CSE 160/Berman Grid Computing 2 :// legion/ (thanks to shava and holly

  • Slide 1
  • CSE 160/Berman Grid Computing 2 http://www.globus.org http://whttp://www.cs.virginia.edu/~legion/ http://www.cs.wisc.edu/condor/ (thanks to shava and holly [see notes for CSE 225])
  • Slide 2
  • CSE 160/Berman Outline Today: Condor Globus Legion Next class: Talk by Marc Snir, Architect of IBMs Blue Gene Tuesday June 6, AP&M 4301 1:00-2:00
  • Slide 3
  • CSE 160/Berman Condor Condor is a high-throughput scheduler Main idea is to leverage free cycles on very large collections of privately owned, non- dedicated desktop workstations Performance measure is throughput of jobs Rather than how fast can a particular job run, how many jobs can complete over a long period of time. Developed by Miron Livny et al. at U. of Wisconsin
  • Slide 4
  • CSE 160/Berman Condor Basics Condor = hunter of idle workstations Condor pool consists of large number of privately controlled UNIX workstations (Condor now being ported to NT) WS owners define the conditions under which the WS can be allocated by Condor to an external user External Condor jobs run while machines are idle User does not need a login on participating machines Uses remote system calls to submitting WS
  • Slide 5
  • CSE 160/Berman Condor Architecture (all machines in same Condor Pool) Architecture: Each WS runs Schedd and Startd daemons Startd monitors and terminates jobs assigned by CM Schedd queues jobs submitted to Condor at that WS and seeks resources for them Central Manager (CM) WS controls allocation and execution for all jobs Schedd Shadow Startd Starter User Process Central Manager Submission Machine Execution Machine
  • Slide 6
  • CSE 160/Berman Standard Condor Protocol (all machines in same Condor Pool) Protocol: Schedd (submitting machine) sends job context to CM; Execution machine sends machine context to CM CM identifies a match between job requirements and execution machine resources CM sends to Schedd the execution machine ID Schedd forks a Shadow process on submission machine Shadow passes job requirements to Startd on execution machine and gets acknowledgement that execution machine is still idle Shadow sends executable to execution machine where it executes until completion or migration Schedd Shadow Startd Starter User Process Central Manager Submission Machine Execution Machine
  • Slide 7
  • CSE 160/Berman More Condor Basics Participating condor machines not required to share file systems No source code changes to users code required to use Condor, users must re-link their program in order to use checkpoint and migration vanilla jobs vs. condor jobs Condor jobs allocated to good target resource using a matchmaker Single condor jobs automatically checkpointed and migrated between WSs, and restarted as needed
  • Slide 8
  • CSE 160/Berman Condor Remote System Call Strategy Job must be able to read and write files on its submit workstation Submission WS Submitted process file Execution WS Submission WS allocated process Shadow process file Execution WS After allocation
  • Slide 9
  • CSE 160/Berman Condor Matchmaking Matchmaking mechanism matches job specs to machine characteristics Matchmaking done using classads Resources produce resource offer ads Include information such as available RAM memory, CPU type and speed, virtual memory size, physical location, current load average, etc. Jobs provide resource request ad which defines the required and desired set of resources to run on Condor acts as a broker which matches and ranks resource offer ads against resource request ads Condor makes sure that all requirements in both ads are satisfied Priorities of users and certain types of ads also taken into consideration
  • Slide 10
  • CSE 160/Berman Condor Checkpointing When WS owner returns, job can be checkpointed and restarted on another WS Periodic checkpoint feature can periodically checkpoint the job so that work is not lost should the job be migrated Condor jobs vs. vanilla jobs Condor job executables must be relinked and can be checkpointed, migrated and restarted Vanilla jobs are not relinked and cannot be checkpointed and migrated
  • Slide 11
  • CSE 160/Berman Condor Checkpointing Limitations Only single process jobs supported Inter-process communication not supported (socket, send, recv, etc. not implemented) All file operations idempotent (read-only, write-only work correctly, read and write to the same file may not) Disk space must be available to store the checkpoint file on the submitting machines. Each checkpointed job has an associated checkpoint file which is approximately the size of the address space of the process.
  • Slide 12
  • CSE 160/Berman Condor-PVM and Parallel Jobs PVM master/slave jobs can be submitted to Condor pool. (Special condor-pvm universe) Master is run on machine where the job was submitted Slaves pulled from the condor pool as they become available Condor acts as resource manager for pvm daemon Whenever pvm program asks for nodes, request is remapped to Condor Condor finds machine in condor pool and adds it to pvm virtual machine
  • Slide 13
  • CSE 160/Berman Condor and the Grid Condor and the Alliance Condor one of the Grid technologies deployed by the Alliance Used for production high-throughput computing by partners Condor and Globus Globus can use Condor as a local resource manager. Globus RSL specs translated into matchmaker classads
  • Slide 14
  • CSE 160/Berman Condor and the Grid Flock of Condors Aggregation of condor pools into flock enables Condor pools to cross load-sharing and protection boundaries Condor flock may include Condor pools connected by wide- area networks Infrastructure Idea is to add Gateway machine for every pool. Gateway machines act as resource brokers for machines external to a pool In published description, GW machine presents randomly chosen external pools/machines CM does not need to know about flocking Each GW machine runs GW-startd and GW-schedd as with a single condor pool
  • Slide 15
  • CSE 160/Berman Flocking Protocol (machines in different pools) Schedd Shadow GW-Startd GW-Startd child Central Manager Submission Machine Gateway Machine Startd Starter User Process Central Manager Gateway Machine Execution Machine Submission PoolExecution Pool GW-Schedd GW-Simulate Shadow
  • Slide 16
  • CSE 160/Berman Globus Globus -- integrated toolkit of Grid services Developed by Ian Foster (ANL/UC) and Carl Kesselman (USC/ISI) Bag of services model applications can use Grid services without having to adopt a particular programming model
  • Slide 17
  • CSE 160/Berman Core Globus Services Resource allocation and process management (GRAM, DUROC, RSL) Information Infrastructure (MDS) Security (GSI) Communication (Nexus) Remote Access (GASS, GEM) Fault Detection (HBM) QoS (GARA, Gloperf)
  • Slide 18
  • CSE 160/Berman Globus Layered Architecture Applications Core Services Metacomputing Directory Service GRAM Globus Security Interface Heartbeat Monitor Nexus Gloperf Local Services LSF CondorMPI NQEEasy TCP SolarisIrixAIX UDP High-level Services and Tools DUROCglobusrunMPINimrod/GMPI-IOCC++ GlobusViewTestbed Status GASS
  • Slide 19
  • CSE 160/Berman Globus Resource Management Services Resource Management services provide mechanism for remote job submission and management 3 low level services: GRAM (Globus Resource Allocation Manager) Provides remote job submission and management DUROC (Dynamically Updated Request Online Co-allocator) Provides simultaneous job submission Layers on top of GRAM RSL (Resource Specification Language) Language used to communicate resource requests
  • Slide 20
  • CSE 160/Berman GRAM LSFEASY-LLNQE Application RSL Simple ground RSL Information Service Local resource managers RSL specialization Broker Ground RSL Co-allocator Queries & Info Globus Resource Management Architecture
  • Slide 21
  • CSE 160/Berman Globus Information Infrastructure MDS (Metacomputing Directory Service) MDS stores information about entry = some type of object (organization, person, network, computer, etc.) Object class associated with each entry describes a set of entry attributes LDAP (Lightweight Directory Access Protocol) used to store information about resources LDAP = hierarchical, tree-structured information model defining form and character of information
  • Slide 22
  • CSE 160/Berman Globus Security Service GSI (Grid Security Infrastructure) Provides public key-based security system that layers on top of local site security User identified to system using X.509 certificate containing info about the duration of permissions, public key, signature of certificate authority User also has private key Provides users with a single sign-on access to the various sites to which they are authorized
  • Slide 23
  • CSE 160/Berman More GSI Resource management system uses GSI to establish which machines user may have access to GSI system allows for proxies so that user only need logon once, as opposed to logging on for all machines involved in a distributed computation Proxies used for short-term authentication, rather than long-term use
  • Slide 24
  • CSE 160/Berman Globus Communication Services Nexus Communication library which provides asynchronous RPC, multi-method communication, data conversion and multi-threading facilities I/O Low level communication library which provides a thin wrapper around TCP,

Related documents