of 53 /53
The Grid,Globus Tool Kit, Condor – G and What?

The Grid,Globus Tool Kit, Condor – G and What?. Outline 1. The “Grid Problem” 2. The “Grid Architecture” 3. The “Globus Toolkit” 4. The “Condor-G”

Embed Size (px)

Text of The Grid,Globus Tool Kit, Condor – G and What?. Outline 1. The “Grid Problem” 2. The “Grid...

  • Slide 1

The Grid,Globus Tool Kit, Condor G and What? Slide 2 Outline 1. The Grid Problem 2. The Grid Architecture 3. The Globus Toolkit 4. The Condor-G Slide 3 Culled From / http://www.globus.orgwww.globus.org / The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001 / Globus: A Metacomputing Infrastructure Toolkit. I. Foster, C. Kesselman. Intl J. Supercomputer Applications, 11(2):115-128, 1997 / Condor-G: A Computation Management Agent for Multi- Institutional Grids. J.Frey, T.Tannenbaum, I.Foster, M.Livny,S.Tuecke. Proc. of HPDC10, 2001 / http://charm.cs.uiuc.edu/ppl_research/faucets/ Slide 4 "The Grid" /Coined in 1990's to denote a proposed distributed computing architecture. /"Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources" -From "The Anatomy of the Grid" /Resource Sharing Computers,Storage,Sensors, Networks, Scientific Instruments sharing is highly controlled -- Providers & Consumer define What is shared Who is allowed to share Conditions for sharing /Coordinated problem solving Beyond client-server: distributed data analysis, visualization,computation, collaboration /Similar to the Power Grid, Faucets (Water supply), Nationwide Phone System. Slide 5 Virtual Organization /A set of individual/institutions defined by some set of sharing rules form a Virtual Organization (VO). /VO may contain Application Service Providers (ASP) Storage Service Providers (SSP) Cycle Providers /They lack Central Control Central Location Existing Trust Relationships Slide 6 Why Grids? / Civil Engineers collaborate to design, execute & analyze shake table experiments / Climate Scientists visualize, annotate & analyze terabytes of simulation datasets / An Emergency response team couples real-time data, weather model, population data / An application service provider purchases cycles from compute cycle provider / A biomedical engineer exploits 10K computers to screen 100K compounds in a hour. / NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers with each other. (Argonne,NCSA,Michigan,UIUC,USC) Slide 7 DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago tomographic reconstruction real-time collection wide-area dissemination desktop & VR clients with shared controls Advanced Photon Source Online Access to Scientific Instruments archival storage Slide 8 Image courtesy Harvey Newman, Caltech Data Grids for High Energy Physics Tier2 Centre ~1 TIPS Online System Offline Processor Farm ~20 TIPS CERN Computer Centre FermiLab ~4 TIPS France Regional Centre Italy Regional Centre Germany Regional Centre Institute Institute ~0.25TIPS Physicist workstations ~100 MBytes/sec ~622 Mbits/sec ~1 MBytes/sec There is a bunch crossing every 25 nsecs. There are 100 triggers per second Each triggered event is ~1 MByte in size Physicists work on analysis channels. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~622 Mbits/sec or Air Freight (deprecated) Tier2 Centre ~1 TIPS Caltech ~1 TIPS ~622 Mbits/sec Tier 0 Tier 1 Tier 2 Tier 4 1 TIPS is approximately 25,000 SpecInt95 equivalents Slide 9 Is it Really NEW? / Grid Computing has much in common with the existing industrial thrusts Application & Storage service providers (ASPs, SSPs) Internet & Peer-to-Peer computing Enterprise Computing Systems Business-to-Business exchanges / SSPs & ASPs allow organizations to outsource storage & computing requirements to other parties (typically via a VPN) / Enterprise distributed computing technologies (CORBA, Enterprise Java) enable resource sharing within a single organization / Business-to-Business & virtual enterprise technologies exchanges focus on information sharing via central servers. Slide 10 Is it NEW? / Sharing is not adequately addressed by these technologies. Complicated Requirements: run program X at site Y subject to community policy P, using data at Z according to the policy Q High Performance requirements of most of the applications. Users may not care where their program may run, as long as it satisfies their QoS requirements (Faucets) / controlled, dynamic sharing within VOs / Current doesnt accommodate the range of resource types or doesnt provide the flexibility and control on sharing relationships to establish VOs Slide 11 But Why Now??? / The internet & the increasing use of wireless devices provides the universal connectivity. / Many current research projects need teamwork, collaboration. / Network Vs. Computer Performance Computer speed doubles every 18 months Network speed doubles every 9 months Moores Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins. Slide 12 Major Grid Projects NameURL & Sponsors Focus BlueGridIBMGrid testbed linking IBM laboratories DISCOMwww.cs.sandia.gov/ discom DOE Defense Programs Create operational Grid providing access to resources at three U.S. DOE weapons laboratories DOE Science Grid sciencegrid.org DOE Office of Science Create operational Grid providing access to resources & applications at U.S. DOE science laboratories & partner universities Earth System Grid (ESG) earthsystemgrid.org DOE Office of Science Delivery and analysis of large climate model datasets for the climate research community European Union (EU) DataGrid eu-datagrid.org European Union Create & apply an operational grid for applications in high energy physics, environmental science, bioinformatics Slide 13 Major Grid Projects NameURL/SponsorFocus EuroGrid, Grid Interoperability (GRIP) eurogrid.org European Union Create tech for remote access to supercomp resources & simulation codes; in GRIP, integrate with Globus Toolkit Fusion Collaboratory fusiongrid.org DOE Off. Science Create a national computational collaboratory for fusion research Globus Project globus.org DARPA, DOE, NSF, NASA, Msoft Research on Grid technologies; development and support of Globus Toolkit; application and deployment GridPP gridpp.ac.uk U.K. eScience Create & apply an operational grid within the U.K. for particle physics research Information Power Grid Ipg.nasa.govCreate & apply a production Grid for aero sciences and other NASA missions Grid Research Integration Dev. & Support Center grids-center.org NSF Integration, deployment, support of the NSF Middleware Infrastructure for research & education Slide 14 www.teragrid.org Slide 15 www.ivdgl.org iVDGL: International Virtual Data Grid Laboratory Tier0/1 facility Tier2 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link Tier3 facility Slide 16 Grid Architecture Slide 17 Why Do We Need It? / To structure the development of new technology Common Vocabulary, Guidance, Perspective / To Identify the fundamental system components Specify the purpose and function of these components Indicate how these components interact Emphasizes identification and definition of protocols and services APIs and SDKs A Protocol architecture mechanism for VO users and resources to negotiate, manage sharing relationships Facilitates Extensibility Interoperability Slide 18 / Why is interoperability a fundamental concern? / Standard protocols standard services to abstract away resource specific details. / To accelerate application development in complex and dynamic execution environments we need APIs,SDKs / The Technology + Services middleware Slide 19 / Analogy to Internet Architecture / High level description places few constraints on design & implementation Application Fabric Controlling things locally: Access to, & control of, resources Connectivity Talking to things: communication (Internet protocols) & security Resource Sharing single resources: negotiating access, controlling use Collective Coordinating multiple resources: ubiquitous infrastructure services, app-specific distributed services Internet Transport Application Link Internet Protocol Architecture Layered Grid Architecture Slide 20 Features / Open and Extensible Built on Internet protocols & services Communication, routing, name resolution, etc. Layering is conceptual No constraints on who can call what Protocols/Services/APIs/SDKs will (ideally) be largely self-contained Things like communication, security are very fundamental Advantageous for higher layer functions to use common lower-level functions Slide 21 The Hourglass Model Diverse global services Core services Local OS A p p l i c a t i o n s Resource and connectivity form the neck in the hour glass Designed to be implemented on top of diverse range of resource types (fabric layer) Can be used to construct wide range of global and application specific Services (collective layer) Slide 22 What's the Status? / No official standards yet. / Globus Toolkit is the unofficial standard for several connectivity, resource, collective protocols / Global Grid Forum (GGF) has Grid Protocol Architecture group / In security some RFCs are available / In scheduling & resource management some working documents are available Slide 23 Globus Toolkit / A software toolkit (GTK) to address key issues to pave the road / Offers a set of technologies (NCSAs Grid-in-a-box) / Try to standardize the Grid protocols and APIs / Open Architecture & Open Source (Reference implementation) / Define Grid Protocols & APIs / Integrate and extend existing protocols provides software tools that make it easier to build computational grids and grid-based applications / Learn from the experiences gained through deployment and applications Slide 24 Key Components / Security / Communication / Information Infrastructure / Fault Detection / Resource Management / Portability / Data Management Slide 25 Fabric Layer / What do you expect? -- diverse resource that may be shared / Can be a logical entity such as a distributed file system, computer cluster But the Grid architecture dont care / Components implement the local, resource-specific operations that occur on specific resources (physical or logical) / Trade-off (Rich fabric functionality vs. Easy of deployment) Richer functionality more sophisticated sharing operations (e.g., reservation) Few demands simplified Grid infrastructure deployment. Should implement enquiry mechanisms and resource management Slide 26 Fabric in Globus Toolkit / Is designed to use existing fabric components / If a vendor doesnt provide the necessary behavior, GTK includes the missing piece / Resource management, is generally the domain of local resource managers. / GARA (General - purpose Architecture for Reservation and Allocation) can provide QoS for different types resources Slide 27 Connectivity / Communication Enable exchange of data between fabric layer resources Include transport, routing, naming (TCP,IP,DNS) Authentication Build on communication protocols Uniform authentication, authorization and message protection in multi institutional scenario Based on existing standards whenever possible Various Requirements Single sign on (access to multiple Grid resources without user intervention) Delegation (users program is able to access resources on which user is authorized) Integration with various local security solutions (identity mapping) User-based trust relationships (must not require for various resources to cooperate in configuring security environment) stake holder should have the final control the authorization decisions Slide 28 Security in GTK / Grid Security Infrastructure (GSI) is based on Public key X.509 certificates SSL/TLS communication GSS-API (Generic Authorization and Access) Extensions are added for single sign-on and delegation Stakeholder control is supported via GAA (Generic Authorization and Access). GSI adheres to the IETFs standard GSS-API Slide 29 Site A (Kerberos) Site B (Unix) Site C (Kerberos) Computer User Single sign-on via grid-id & generation of proxy cred. Or: retrieval of proxy cred. from online repository User Proxy Proxy credential Computer Storage system Communication* GSI-enabled FTP server Authorize Map to local id Access file Remote file access request* GSI-enabled GRAM server GSI-enabled GRAM server Remote process creation requests* * With mutual authentication Process Kerberos ticket Restricted proxy Process Restricted proxy Local id Authorize Map to local id Create process Generate credentials Ditto GSI Example Scenario Create Processes at A and B that Communicate & Access Files at C Slide 30 Resource Layer / Concerned entirely with individual resources / Addresses Resource discovery Reservation & Allocation Monitoring & Control Secure Negotiation / Two primary classes of protocols 1. Information protocols to obtain information about the structure & state of a resource 2. Management Protocols policy application point / in the neck of hourglass the number of such protocols should be limited to small and focused set. Slide 31 Resource Layer Components in GTK / GRIP (Grid Resource Information Protocol) Based on LDAP Defines standard resource information / GRRP (Grid Resource Registration Protocol) To register resources with Grid Index Information Servers GRAM (Grid Resource Access and Management) HTTP based RPC Used for allocation of computational resources for monitoring and controlling of computation on these resources / GridFTP allows grid applications to have secure, ubiquitous, high-performance access to data uses the GSI for authentication new extensions to the FTP protocol for parallel data transfer partial file transfer third-party (server-to-server) data transfer Slide 32 Collective Layer / Coordinate multiple resources / Protocols & services (APIs,SDKs) are not associated with any resource but rather are global in nature / Capture interactions across collections of resources / Examples: Index servers aka Metadirectory services custom views on dynamic resource collections Resource Brokers (e.g., Condor-G Matchmaker, AppLes, Nimrod-G, DRM broker) Resource discovery & allocation Replica catalogs & services Co-reservation & Co-allocation services Software discovery services select best s/w implementation and platform based on the problem parameters (e.g., NetSolve, Ninf) Community accounting and payment services Collaboratory services (e.g., CAVERNsoft) Slide 33 Information Infrastructure in GTK / An infrastructure that provides coherent system information spanning virtual organizations is necessary / MDS (Metacomputing Directory Service) Uses LDAP (Lightweight Directory Access Protocol) Provides uniform means of querying system information from a rich variety of system components uniform namespace for resource information across a system that may involve many organizations. GRIS (Grid Resource Information Service) a uniform means of querying resources for their current configuration, capabilities, and status GIIS (Grid Index Information Service) a means of knitting together arbitrary GRIS services to provide a coherent system image. GIISes provide a mechanism for identifying "interesting" resources Slide 34 Resource Management Services in GTK / Three main components 1. RSL (Resource Specification Language) to communicate resource requirements 2. GRAM (Grid Resource Allocation Management) standardized interface to all of the various local resource management tools (e.g., Condor,LSF,PBS) 3. DUROC (Dynamically-Updated Request Online Co-allocator) coordinates a single request that may span multiple GRAMs / Resource Broker handle the mapping of high level application requests into requests to individual resource managers Slide 35 GRAM LSFCondorPBS Application RSL Simple ground RSL Information Service Local resource managers RSL specialization Broker RSL Co-allocator Queries & Info Resource Management Architecture Slide 36 GRAM Components Grid Security Infrastructure Job Manager GRAM client API calls to request resource allocation and process creation. MDS client API calls to locate resources Query current status of resource Create RSL Library Parse Request Allocate & create processes Process Monitor & control Site boundary ClientMDS: Grid Index Info Server Gatekeeper MDS: Grid Resource Info Server Local Resource Manager MDS client API calls to get resource info GRAM client API state change callbacks Slide 37 GRAM / GRAM-1 HTTP-based RPC / GRAM-1.5 (Reliability improvements) Once-and-only-once submission Reliable termination detection GRAM-2 towards integration with web-services (SOAP) OGSA (Open Grid Services Architecture) Gate Keeper Single point of entry secure inetd Job Manager Layers on top of local resource management system (e.g., PBS,LSF, etc) Handles remote interaction with the job Slide 38 Data Management in GTK / "Access to distributed data is typically as important as access to distributed computational resources - Globus / Tools for managing data in Grid systems and applications / Also called Data Grid. / GridFTP / Data Replication Two tools for managing data replicas: multiple copies of data stored in different systems to improve access across geographically-distributed Grids Replica Catalog based on LDAP directory Replica Manager combines the Replica Catalog with GridFTP to manage data replication GASS (Global Access to Secondary Storage) Allows applications to access data stored in any remote file system by specifying a URL Can be in HTTP, FTP Slide 39 Condor-G A Computation Management Agent for Multi-Institutional Grid Slide 40 What is Condor-G? / Condor enhanced with Globus Toolkit components to harness multi-domain resources as if they all belong to one personal domain / Example of applying the general purpose GTK components to solve a particular problem (i.e., high-throughput computing on the Grid) Slide 41 Separation of Concern 1. Remote Resource Access Secure remote resource discovery,allocation, management Uses Globus Toolkit components 2. Computation Management Via user computation management agent responsible for job submission, job management, error recovery Taken from Condor system 3. Remote Execution Via mobile sandboxing to create a user tailored execution environment on a remote node Taken from Condor system Slide 42 Why Condor for Grid Jobs?? / Adv. Of using condor-G to manage Grid jobs Can Query a jobs status or cancel a job Credential management Get informed of job termination or problems via callbacks or asynchronous mechanisms such as email Access to detailed logs with complete history of the jobs execution Fault tolerance and exactly once semantics Slide 43 Job Execution / Stages a jobs standard I/O and executable using GASS / Submits jobs to remote machines using revised GRAM(1.5) Job manager checkpoint & restart Two-phase commit during job submission Monitors job status & recovers from remote failures using revised GRAM callbacks and status calls Condor-G handles resubmission of failed jobs, communications with the user concerning unusual & erroneous conditions, recording of computation status to support restart Slide 44 Execution Mechanism Slide 45 Fault Tolerance / Tolerates four types of failure Local Crash 1. Crash of the host on which GridManager is running (or crash of the GridManager alone) Queue state stored on disk Reconnect to the JobManagers that were running at the time of crash Remote Crash 2. Crash of the GlobusJobManager Start a new JobManager 3. Crash of the machine that hosts the remote resource (GateKeeper,JobManager) Wait until connectivity returns Start a new JobManager Network Failures Slide 46 Credential Management / Authentication in is done with limited-lifetime X509 proxies / Credential may expire before the job finishes execution / Condor-G agent periodically analyzes the credentials for all users with currently queued jobs / Can put jobs on hold and e-mail user to refresh proxy / Can forward new credentials to execution sites Using the MyProxy system, which lets a user store a long-lived proxy credential on a secure server. Remote services acting on behalf of user can obtain short-lived proxies Condor-G can use these to refresh the user credential automatically Slide 47 Resource Discovery & Scheduling / Simple user supplies list of GRAM servers / Resource broker in Condor-G agent Condor Matchmaking / flood candidate resources Glidelns Slide 48 GlideIn Mechanism / Use same codor mechanism to start on a remote node not a user job, but a daemon / The deamon traps system calls made by users job and redirects back to the originating system / Periodically checkpoints the job and migrates job to another location if it is requested / The Condor-G GlideIn mechanism uses Grid protocols to dynamically create a personal condor pool out of Grid resources by gliding-in Condor daemons to remote resource / Allows to delay the binding of an application to a resource Prevent queuing delays Can guarantee optimal queuing times to users Slide 49 Remote Execution via GlideIn Slide 50 Grid Architecture in Practice Compute Resource SDK API Access Protocol Checkpoint Repository SDK API C-point Protocol High Throughput Computing System Dynamic checkpoint, job management, failover, staging Brokering, certificate authorities Access to data, access to computers, access to network performance data Communication, service discovery (DNS), authentication, authorization, delegation Storage systems, schedulers Collective (App) App Collective (Generic) Resource Connect Fabric Slide 51 Future & Conclusions / Evolution Past-Present: O(10 2 ) computers; Mb/s networks; local (centralized) control Present: O(10 4 -10 6 ) data systems, computers; Gb/s networks; restricted decentralized control Future: O(106-109) data,sensors,computer,instruments; highly flexible policy,control A computer (includes software) is a dynamically, often collaboratively constructed collection of processors,data sources,networks,sensors,instruments Open the faucet get the water Connect to the Grid get the compute power We need a powerful computational economy model (Bidding systems new optimization algorithms) Slide 52 Summary / The Grid Problem: Resource sharing & coordinated problem solving in dynamic multi-institutional virtual organizations / Grid Architecture: Emphasizes protocol and service definition to enable interoperability and resource sharing / Globus Toolkit: a source of protocol and APIs, reference implementation / Condor-G: applies general purpose Globus Toolkit to solve high- throughput computing on the Grid Slide 53 Thanks for Ur Patience