1. INTRODUCTION Grid user requirements They want to be able to
discover, acquire, and reliably manage computational resources
dynamically, in the course of their everyday activities They do not
want to be bothered with the location of these resources, the
mechanisms that are required to use them, with keeping track of the
status of computational tasks operating on these resources, or with
reacting to failure They do care about how long their tasks are
likely to run and how much these tasks will cost
Slide 4
Solution: The Condor-G Leverages software from Globus and
Condor. allows the user to control multi-domain resources as if
they all belong to one personal domain Globus Toolkit :
inter-domain resource management protocols. Condor: intra-domain
resource management methods.
Slide 5
2. Large-scale sharing of computational resources How to build
and manage a multi-site computation that uses resources that belong
to different sites? DIFFICULTIES: Different sites may feature
different authentication and authorization mechanisms, schedulers,
hardware architectures, operating systems, file systems, etc. The
user has little knowledge of the characteristics of resources at
remote sites, and no easy means of obtaining this information Due
to the distributed nature of the multi-site computing environment,
computers, networks, and subcomputations can fail in various ways.
Keeping track of the status of different elements of a computation
involves tedious bookkeeping, especially in the event of failure
and dependencies among subcomputations.
Slide 6
2. Large-scale sharing of computational resources How to build
and manage a multi-site computation that uses resources that belong
to different sites? APPROACH: Remote resource access issues are
addressed by requiring that remote resources speak standard
protocols for resource discovery and management. Computation
management issues are addressed via the introduction of a robust,
multi-functional user computation management agent responsible for
resource discovery, job submission, job management, and error
recovery. From Condor Remote execution environment issues are
addressed via the use of mobile sandboxing technology that allows a
user to create a tailored execution environment on a remote
node.
Slide 7
3. Grid Protocols - Outline Protocols used in the Condor-G
system: 3.1. GSI (Grid Security Infrastructure) 3.2. GRAM (Grid
Resource Allocation and Management) 3.3. MDS-2 (Monitor and
Discovery System) 3.4. GASS (Global Access to Secondary
Storage)
Slide 8
3.1. GIS The Globus Toolkits Grid Security Infrastructure Makes
it possible to authenticate a user just once. Uses Public Key
Infrastructure (PKI) GSI employs the users private key to create a
proxy credential, which serves as a new private-public key pair
that allows a proxy (such as the Condor-G agent) to make remote
requests on behalf of the user
Slide 9
3.2. GRAM protocol The Grid Resource Allocation and Management
The Grid Resource Allocation and Management protocol supports
remote submission monitoring and control of a computational request
to a remote computational resource. Eg: run program P. Uses GSI for
authentication/authorization. Two-phase commit (using requests
sequences and commit command). Logs details of all active jobs
(useful for crash recovery).
Slide 10
3.3. MDS protocols Monitor and Discovery System Allows
discovering and disseminating information about the structure and
state of Grid resources. Uses GSI for access control. The idea: 1.
A resource uses the Grid Resource Registration Protocol (GRRP) to
notify other entities that it is part of the Grid. 2. Those
entities can then use the Grid Resource Information Protocol (GRIP)
to obtain information about resource status
Slide 11
3.4. GASS service The Globus Toolkits Global Access to
Secondary Storage Provides mechanisms for transferring data between
a remote HTTP, FTP, or GASS server In the current context, we use
these mechanisms to stage executables and input files to a remote
computer GSI mechanisms are used for authentication
Slide 12
4. Computation management The Condor-G agent: 4.1. User
interface 4.2. Supporting remote execution 4.3. Credential
management 4.4. Resource discovery and scheduling
Slide 13
4.1. User interface The Condor-G agent allows the user to treat
the Grid as an entirely local resource, with an API and command
line tools that allow the user to perform the following job
management operations: Submit jobs, indicating an executable name,
input/output files and arguments; Query a jobs status, or cancel
the job; Be informed of job termination or problems, via callbacks
or asynchronous mechanisms such as email; Obtain access to detailed
logs, providing a complete history of their jobs execution.
Slide 14
4.1. User interface The innovation in Condor-G is that these
capabilities are provided by a personal desktop gent and supported
in a Grid environment, while guaranteeing fault tolerance and
exactly-once execution semantics. providing the user with a
familiar and reliable single access point to all the resources
he/she is authorized to use.
Slide 15
4.2. Supporting remote execution Job Submission Process 1. User
indicates jobs to the scheduler. 2. Scheduler creates a GridManager
daemon. 3. For each job the GriManager creates a JobManager using
two- phase commit GRAM. 4. GASS is used to transfer job
executables, input files and to provide output. 5. JobManager
submits the jobs to the local scheduling system.
Slide 16
4.2. Supporting remote execution Crash Tolerance Condor-G is
built to tolerate four types of failure: 1. Crash of the Globus
JobManager: The GridManager then probes the GateKeeper. If
Gatekeeper responds then a new JobManager is started.
Slide 17
4.2. Supporting remote execution Crash Tolerance Condor-G is
built to tolerate four types of failure: 2 & 3. Resource
Management Machine Or Network Failure: The GridManager waits until
connection is re- established. Then reconnects to the
jobManager.
Slide 18
4.2. Supporting remote execution Crash Tolerance Condor-G is
built to tolerate four types of failure: 4. Job Submission Machine:
The GridManager gives the jobManager its New IP and PORT.
Slide 19
4.3. Credential Management GSI proxy credential is used to
authenticate with resorces. Because Proxy credentials expire the
agent periodically checks user creentials. When credentials expire
the jobs are put on hold and the user is notified. Problem: long
tasks will require frequent proxy updates.
Slide 20
Solution: MyProxy System (Long-lived proxy credentials) Remote
services acting on behalf of the user can then obtain short-lived
proxies (e.g. 12 hours) from the server. 4.3. Credential
Management
Slide 21
4.4. Resource discovery and scheduling 1. The Simple Approach:
a user-supplied list of GRAM servers. 2. The resource broker:
gathers information about available GRAM servers using the Monitor
and Discovery System (MDS). User Can then choose from the list of
available servers. For the case of high throughput computations
flooding is applied.
Slide 22
5. GlideIn mechanism What happens when a job executes on a
remote platform where required files are not available and local
policy may not permit access to local file systems? Solution:
Sandboxing
Slide 23
The Idea: Starts a daemon on the remote computer that learns
about the available settings and resources. Runs each user task in
a sandbox: where system calls are redirected to the local system.
5. GlideIn mechanism
Slide 24
THANKS QUESTIONS? FATIH UNIVERSITY Computer Engineering Helton
MALAMBANE