Document25

Grid Computing: A Survey of Technologies

September 22nd, 2004

Sean Norman

1 Introduction..................................................................................................................41.1 What is a grid?.....................................................................................................51.2 Benefits of grid computing..................................................................................9

1.2.1 Exploiting unused and underutilized resources...........................................91.2.2 Access to additional resources.....................................................................91.2.3 Balance resource utilization.......................................................................101.2.4 Parallel computing.....................................................................................101.2.5 Collaboration through virtualization..........................................................101.2.6 Increased reliability...................................................................................10

1.3 Anatomy of a grid..............................................................................................101.4 Organization......................................................................................................12

2 Globus........................................................................................................................132.1 Globus Toolkit...................................................................................................14

2.1.1 Grid Security Infrastructure.......................................................................152.1.2 Grid Resource Allocation Manager...........................................................16

2.1.2.1 Resource Specification Language..........................................................162.1.2.2 Globusrun command..............................................................................172.1.2.3 Gatekeeper.............................................................................................182.1.2.4 Job Manager...........................................................................................182.1.2.5 Dynamically-Updated Online Co-allocator...........................................182.1.2.6 Global Access to Secondary Storage.....................................................19

2.1.3 Monitoring and Discovery System............................................................192.1.3.1 Grid Resource Information Service.......................................................202.1.3.2 Grid Index Information Service.............................................................20

2.1.4 GridFTP.....................................................................................................212.2 Web Services.....................................................................................................21

2.2.1 Simple Object Access Protocol.................................................................222.2.2 Web Services Description Language.........................................................222.2.3 Web Service Inspection.............................................................................22

2.3 Open Grid Services Architecture.......................................................................222.3.1 What is a Service-Oriented Architecture?.................................................232.3.2 Exploiting Web Services...........................................................................232.3.3 Needs in a Grid Process.............................................................................24

2.3.3.1 Dynamic Service Creation.....................................................................252.3.3.2 Dynamic Service Management..............................................................252.3.3.3 Service Lifetime Management...............................................................252.3.3.4 Registration and Discovery....................................................................262.3.3.5 Notification............................................................................................262.3.3.6 Upgradeability.......................................................................................26

3 Condor.......................................................................................................................273.1 Classified Advertisements.................................................................................273.2 Matchmaking.....................................................................................................28

3.2.1 The Bilateral Matching Problem...............................................................293.2.2 Gangmatching............................................................................................303.2.3 Set-Extended ClassAds..............................................................................31

3.3 Problem Solvers.................................................................................................32

2

3.3.1 Master-Works............................................................................................323.3.2 Directed-Acyclic Graph Manager..............................................................33

3.4 Split Execution...................................................................................................333.5 Condor-G...........................................................................................................34

4 Legion........................................................................................................................354.1 Core Objects......................................................................................................364.2 Resource Management Infrastructure................................................................37

4.2.1 Hosts and Vaults........................................................................................374.2.2 Collection...................................................................................................384.2.3 Scheduler...................................................................................................384.2.4 Enactor.......................................................................................................384.2.5 Execution Monitor.....................................................................................38

4.3 Job Handling......................................................................................................384.4 Architectural Characteristics.............................................................................39

5 Nimrod.......................................................................................................................405.1 Nimrod-G...........................................................................................................415.2 System Architecture...........................................................................................41

5.2.1 Client/User.................................................................................................415.2.2 Parametric Engine......................................................................................425.2.3 Scheduler...................................................................................................425.2.4 Dispatcher..................................................................................................425.2.5 Job Wrapper...............................................................................................42

5.3 Scheduling and Computational Economy.........................................................426 Application Level Scheduling...................................................................................43

6.1 Resource Management Architecture..................................................................437 Discussion and Experiences......................................................................................44

7.1 Globus................................................................................................................457.2 Condor-G...........................................................................................................467.3 Legion................................................................................................................487.4 Nimrod-G...........................................................................................................497.5 AppLeS..............................................................................................................50

8 Conclusion.................................................................................................................509 Bibliography..............................................................................................................51

3

Abstract

As computing technology improves and the availability of computing resources increases, the demands placed on these resources gets higher and higher. This trend has led to the development of grid technology, which represents the evolution of distributed and parallel computing at a global or multi-institutional scale. The purpose of this paper is to describe grid computing in its current form, provide motivation for the technology and give an overview of selected grid systems. The paper also discusses experiences with these systems and discusses their characteristics as they relate to usability, adaptability, scalability and reliability.

1 Introduction

Today's world of computing continues to see improvement in raw computing power, storage capability and communication. Despite these improvements, computational resources fail to keep up with the demands that scientific and business communities put on them [25]. One of the primary reasons for this failure is due to the fact that as technology improves, more demands are placed on it. A good example of this trend is given in [25]; ten years ago, biologists were content with computing a single molecular structure. In today's world, biologists want to calculate the structures of complex assemblies of molecules and screen thousands of drug candidates. Other projects, such as the National Fusion Collaboratory [27], can produce hundreds of megabytes of data in a matter of seconds and requires a quick analysis of the data. CERN's Large Hadron Collider [12] is expected to produce a few petabytes of data per year by 2006 [25]. Due to the large increase in the amount of data being generated, stored and shared, collaboration between geographically dispersed colleagues now requires gigabits of data to be transferred in a short amount of time. Business companies such as Hewlett-Packard [40] and IBM [20] envision a future where they can provide on-demand computing and application hosting, thus removing the need for customers to purchase and manage their own hardware.

Demands such as these have caused the need and desire to be able to share a wide variety of computing resources, any of which could be located in geographically separate regions and controlled by different organizations. One of the motivations for resource sharing is to be able to harness the power of unused or underutilized resources. Resource sharing

4

takes advantage of the combined power of multiple resources in order to be able to achieve a goal. For example, rather than having to run a large scientific computation or large-scale job on a supercomputer, it could be run in parallel over a number of desktop systems, which are cheaper and more readily available. The SETI@home project [58] is an example of resource sharing by utilizing multiple desktop machines. The SETI@home software, once installed, “borrows” the compute power of a personal computer in order to analyze radio data whenever the computer becomes idle.

Five classes of applications that drive the development of grid computing are:1. Distributed supercomputing

· Used for dealing with large classes of problems which are computation-intensive and require lots of CPU power, memory, disk storage, etc.

2. High throughput computing· Focuses on harnessing the combined power of unused or underutilized resources

3. On-demand computing· Emphasizes the need to integrate remote resources with local computations

4. Data intensive computing· Focuses on how to combine data from multiple/large data sources

5. Collaborative computing· Emphasizes the need for communication, interaction and collaboration

1.1 What is a grid?

The inspiration for grid computing, which encompasses on-demand access to services, data and resources was seen as early as 1969 by a man called Leonard Kleinrock:

“...we will probably see the spread of 'computer utilities', which like present electric and telephone utilities, will service individual homes and offices around the country” [46].

Although Kleinrock's vision is far from being a reality, both this type of vision and the demands placed on computational resources inspired lots of research into grid computing during the 1990's. This research lead to the development of many projects, such as the design and implementation of the computational power grid [24], an infrastructure for distributed and parallel computing used widely in scientific grid-related projects today.Some of the largest players of the business world (including HP, IBM and Microsoft) are heavily involved in the research of grid computing technology.

So, what is a grid then? Many definitions exist and most organizations tend to “pick” a characterization or definition that suits their interests. For example, Sun Microsystem's Sun Grid Engine [61] and Platform's Load Sharing Facility [53] claim to be grid systems, but cannot be characterized as true grid systems due to their centralized control structure, complete knowledge of user requests and total control over system components [25]. As pointed out in [26], a universal and accepted definition yet to exist. Here is a particular definition taken from a grid research group a few years back:

5

mailto:SETI@home

mailto:SETI@home

“A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities” [24]

A computational grid refers to a grid that shares processing capacity. A few years later, the same grid research group refines their definition to be:

“An infrastructure allowing flexible, secure, and coordinated resource sharing among dynamic collections of individuals, institutions and resources referred to as virtual organizations” [29]

A virtual organization is defined as a group of people that are bound by a set of sharing rules that govern what resources are shared, who is allowed to share and the conditions under which sharing occurs [29]. Finally, a simple, business-oriented definition of a grid (taken from IBM) is:

“With grid computing you can unite pools of servers, storage systems and networks into one large system to deliver non-trivial qualities of service. To an end user or application, it looks like one big virtual computing system” [20]

Grid resources fall into the categories of computation (i.e. a machine sharing its CPU), storage (i.e. a machine sharing its RAM or disk space), communication (i.e. sharing of bandwidth or a communication path), software and licenses, and special equipment (i.e. sharing of devices).

As an example, a grid may consist of a group of homogeneous machines contained within the same department of an organization; each connected over a Local Area Network (LAN) and managed by grid management software. Security in this type of situation isn't much of a concern due to the grid’s localized utilization. Special policies for resource access, job scheduling priority, etc. are not usually necessary.

6

Figure 1: A simple grid [6]

A slightly more complicated grid introduces heterogeneous machines and more types of resources that may reside in different departments of an organization. These departments may be close together or geographically separated. Since the utilization of the grid never leaves the boundaries of the organization, these types of grids are referred to as intragrids. Intragrids typically use network file system mechanisms for data sharing and may have specialized policies in place for job-scheduling priorities, resource access, etc. Security is a much higher concern due to the resource access required across departmental boundaries.

A much more complicated grid introduces heterogeneous machines and resources that reside in different zones of administrative control. Since resources can be access from outside of organizational boundaries, these grids are referred to as intergrids. Intergrids require that specialized policies be put into place in order to address issues such as who has access what resources, what priorities are put on resource access and job scheduling, etc. Security is extremely important because resources are essentially being shared to the public. Intergrids also require a certain degree of standardization and virtualization due to their heterogeneous nature. They are usually organized in a hierarchical fashion in order to address issues of scalability.

7

Figure 2: A complex grid [6]

In the end, the goal of grid software is to enable wide-scale resource sharing and hide the issues of heterogeneity. To an end-user, all resources should appear as if it were operating from a local machine, as depicted in Figure 3.

8

Figure 3: Users’ view of the grid [6]

1.2 Benefits of grid computing

1.2.1 Exploiting unused and underutilized resources

Many organizations have a vast amount of compute resources available at their disposal. A large number of these resources are idle for long periods of time, especially in the case of desktop machines. Grid computing provides a framework that allows these resources to be exploited. For example, an application might be scheduled to execute on a machine which has a high activity level. The grid framework could detect this and reschedule the application to run on a less-burdened or idle machine elsewhere in the grid. This concept applies to any type of resource shared on the grid, such as disk storage. If an application requires a large amount of disk storage, the grid framework could be used to aggregate several disk stores into a single virtual store in order to increase storage space or improve performance.

1.2.2 Access to additional resources

A grid allows its users to access resources that the user might not normally have access to. For example, an organization might have a grid machine connected to an electron microscope or a telescope. If the appropriate policies are in place, a grid user who does not normally have access to such equipment might be given access to it, whether they belong to the organization or not. A grid can also be used for the simple reason of gaining access to additional resources. For example, if an application requires a lot of disk space, the grid framework can be used in order to obtain additional storage.

9

1.2.3 Balance resource utilization

The grid framework can be used in order to improve resource utilization. For example, jobs can be scheduled to run on idle machines or machines with low activity levels. Grids also offer load balancing. If jobs running on the grid require a high level of communication between each another, they could be scheduled in a manner than minimizes the cost of communication or the amount of traffic on their communications line.

1.2.4 Parallel computing

Many industries and scientific communities require the use of parallel computing in order to run applications or solve certain problems. Grid computing provides a framework that allows jobs to be split up into multiple sub jobs, and each sub job can be made to execute in parallel on different machines in the grid.

1.2.5 Collaboration through virtualization

Grid computing supports collaboration by defining standards which allow groups of heterogeneous systems to band together and present themselves as a larger virtual organization. Grid users can be organized dynamically into a number of virtual organizations, each having their own policy requirements and sharing their resources as part of a larger grid [6]. Resources are virtualized in order to increase interoperability between heterogeneous systems.

1.2.6 Increased reliability

Grids offer a certain amount of reliability at a low cost; no expensive or proprietary hardware or software is required. A failure in one part of the grid is unlikely to affect other parts of the grid. If a job is unsuccessful due to a failure and the failure is detected, grid management software can automatically re-submit the job for execution. Jobs can even be submitted multiple times in order to ensure that at least one copy of the job executes successfully.

1.3 Anatomy of a grid

The common features of a grid system are shown in Figure 4.

10

Figure 4: Common Features of a Grid System [52]

The client represents the end-user. It does not have to be the end-user per say; it can be an application or an agent acting on behalf of the end-user (which is typically the case). Many clients can exist at the same time. Each client acts independently and serves to further the interests of the end-user(s) they represent. One of the responsibilities of the client is to find and acquire the resources desired by the entities they represent. This is accomplished by first consulting the resource registry. The resource registry is an information source that allows entities to publish and update information about the resource they wish to share. Many resource registries might exist and each might hold a different set of information. As such, clients may be required to consult one or more resource registries before finding the resources they require. Once these resources are found, the client submits an allocation request to the resource manager(s) responsible for the desired resources. If the request can be accommodated, the resource manager(s) update the status information for the acquired resources in the resource registries.The client then sends the appropriate executables and input data to the allocated resources and receives a reference to the execution in return. This reference allows the client to monitor the execution of a job and inquire about its status, amongst other functions. The client may also receive the results of the job once its execution is complete.

A layered architecture of a grid is shown in Figure 5.

11

Figure 5: Layered Grid Architecture [25]

At the lowest level, we have the grid fabric. This layer consists of the resources and devices that are shared across the grid, such as computers, storage systems, networks and sensors. Also included are the local resource management systems that are in charge of managing local resources. High-level resource management systems and resource brokers must collaborate with these local resource management systems in order to gain access to local resources on a particular network or system.

Above the grid fabric layer is the grid core layer. It contains the communication and authentication protocols that must be implemented in order to enable resource control operations, secure initiation, authentication, and monitoring.

The grid service layer contains protocols, APIs and services which are used for interacting with resources. Typical services include information services, which allow resources to register themselves and be queried, data migration and replication services, etc.

The top layer is the user application layer. It consists of applications, such as the client agent described above, that make use of grid services.

1.4 Organization

The organization of this paper is as follows:

12

Section 1 discusses the motivations for grid computing, explains what grid computing is, discusses the benefits it can provide and shows a high level architectural view of a grid.

Section 2 discusses the Globus project. In particular, it discusses how the Globus architecture has evolved towards the use of web-services in what is now called the Open Grid Services Architecture. It also provides an overview of Web Services and the Globus toolkit.

Section 3 discusses the Condor project. It examines Condor's classified advertisement model, matchmaking, a few solutions for problems found with matchmaking when applied to a grid environment, and Condor-G – a grid-enabled Condor infrastructure.

Section 4 examines Legion, an object-oriented grid infrastructure. In particular, the paper focuses on Legion’s architectural characteristics, its core objects model and its resource management model.

Section 5 discusses the Nimrod software system. In particular, it discusses Nimrod-G: a grid-enabled parametric studies system based on a model of computational economy.

Section 6 examines Application Level Scheduling, an application-specific grid scheduler.

Section 7 contains a discussion of each of the technologies as they related to the characteristics of usability, adaptability, scalability and reliability.

Section 8 contains the concluding remarks.

2 Globus

Globus [33] is a grid-oriented community whose goal is to define standards for protocols, APIs, services definitions, service behaviors, etc. relating to grid software. The community provides a toolkit (described in section 2.1) that implements the basic components and services required to construct and support a computational grid. This infrastructure allows both users and applications working within the environment to view distributed heterogeneous resources as if they were local resources.

The standards and technologies developed by the Globus community are currently in a state of evolution. One of the largest reasons for this evolution is the interest shown by commercial companies to integrate grid technology with technologies already in use in enterprise environments. Applications running in these environments have evolved to the point where they operate on heterogeneous resource sets that can span multiple administrative domains, and grid technologies are specifically aimed at handling these types of problems. However, organizations did not want to have to modify applications or entire software infrastructures in order to accommodate new standards that would require significant changes to the underlying code. This lead to the development of the Open Grid Services Architecture (see section 2.3) [30], a set of grid standards based on Web Services standards (see section 2.2). These standards provide methods of defining how

13

grid components are described, how their behaviors can be described and how they communicate with one another.

This section will examine the Open Grid Service Architecture as well as the Globus and Web Service technologies on which it is based.

2.1 Globus Toolkit

The Globus toolkit [23] is a set of open-source libraries and services based on open-standards designed to support the construction of computational grids. The toolkit has a service layer based architecture consisting of resource management, data management, and information services, all of which sit on top of a security infrastructure. The service layers are depicted in Figure 6.

Figure 6: Globus service layers

Resource management services handle job submission, management of job execution, job monitoring, and resource allocation. Information services handle the task of collecting information such as resource status and resource availability throughout the grid. Data management services handle file transfers as well as their management. The overall architecture of the Globus toolkit is shown in Figure 7.

14

Figure 7: Architecture of Globus Toolkit [6]

2.1.1 Grid Security Infrastructure

The Grid Security Infrastructure (GSI) [31] deals with the issues of authentication and authorization. It makes use of the Secure Socket Layer protocol (SSL) [60], X.509 certificates [56] and public key encryption mechanisms. Three important properties that GSI provides are single (1) sign-on, (2) mapping to local security mechanisms, and (3) delegation. GSI provides single sign-on by requiring that users only be authenticated once. When a user signs in, GSI uses public key encryption mechanisms to verify the user's grid credential. The grid credential is a global credential used to authenticate the user to the grid infrastructure. GSI handles the mapping of grid credentials to local credentials in order to make use of local authentication and authorization mechanisms in place. This makes it possible to run different authentication systems (such as Kerberos and free radius) at different sites without requiring multiple credentials. GSI provides delegation by creating a proxy credential that is used for making requests on the user's behalf. The proxy can be thought of as an agent that holds the user's credentials and makes requests on behalf of the user. Instead of requiring the user to continually authenticate themselves for requests made while the job is running, the task is delegated to the proxy.

15

2.1.2 Grid Resource Allocation Manager

The Grid Resource Allocation and Management (GRAM) [19] is a protocol which supports the remote submission of a computational request to a remote computational resource, and subsequent monitoring and control of the resulting computation [28]. The protocol is designed to be fault-tolerant and makes use of GSI mechanisms for access control. GRAM is also responsible for processing Resource Specification Language (see section 2.1.2.1) specifications and periodically updating the information service. GRAM does not have any built-in scheduling or resource brokering capabilities. An overview of GRAM and its components is shown in Figure 8.

Figure 8: Globus GRAM overview [6]

2.1.2.1 Resource Specification Language

The Globus Resource Specification Language (RSL) [35] is an interchangeable language used for describing both resources and job requests. RSL descriptions contain attribute-value pairs, called relations, which control the behavior of one or more components in the system. Relations are assumed to correspond to one resource and may can be combined into single expressions using conjunctions in the form (relation 1) (relation 2) ... (relation n). shows an example of an RSL job description. It requests five machines with at least 512Mb of RAM in order to execute the program “a.out” with arguments “arg1, arg2” and sets the active directory to “/home/test”.

16

Figure 9: RSL job description (pre-OGSA)

Figure 10 shown an example of a post OGSA (see section 2.3) RSL job description. It requests to run the command “/bin/ls” with the argument “-l” and the path set to “/home/mydir”.

Figure 10: RSL job description (post-OGSA)

2.1.2.2 Globusrun command

Globusrun is a command-line tool used to request job submissions at remote machines, transfer a job's executable files to the remote site and transfer the resulting output files to the originating system.

(* this is a comment *)& (executable = a.out) (directory = /home/test ) (arguments = arg1 "arg 2") (count=5) (memory>=512)

<?xml version="1.0" encoding="UTF-8"?> <rsl:rslxmlns:rsl="http://www.globus.org/namespaces/2003/04/rsl" xmlns:gram="http://www.globus.org/namespaces/2003/04/rsl/gram"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.globus.org/namespaces/2003/04/rsl /opt/IBMGrid/schema/base/gram/rsl.xsd http://www.globus.org/namespaces/2003/04/rsl/gram /opt/IBMGrid/schema/base/gram/gram_rsl.xsd"><gram:job> <gram:executable> <rsl:path> <rsl:stringElement value="/bin/ls"/> </rsl:path> </gram:executable> <gram:arguments> <rsl:stringArray> <rsl:string> <rsl:stringElement value="-l"/> </rsl:string> </rsl:stringArray> </gram:arguments> <gram:directory> <rsl:path> <rsl:stringElement value="/home/mydir"/> </rsl:path> </gram:directory></gram:job> </rsl:rsl>

17

2.1.2.3 Gatekeeper

The gatekeeper is a process responsible for secure communication between clients and servers. It authenticates the ‘globusrun’ command’s right to submit jobs and determines which local account the jobs should be run. Authentication is accomplished via GSI mechanisms and authorization is achieved through the use of a grid-map file which maps grid identities to local user accounts. Enforcement is done according to the privileges associated with the local user account. The gatekeeper starts one Job Manager Instance for each job submitted by the user. Once the GRAM client is authenticated, the gatekeeper delegates the authority to communicate with clients to the job manager instance. The gatekeeper terminates once all of the jobs submitted by the user have completed.

2.1.2.4 Job Manager

The Job Manager is a process that executes under the user's local credentials. Its purpose is to parse RSL descriptions, manage client callbacks, submit job requests to local resource managers, handle job status and cancellation requests and return job output to the client. Client callbacks can be embedded into grid-enabled applications if the application needs to receive feedback or change its behavior when certain conditions occur.

2.1.2.5 Dynamically-Updated Online Co-allocator

The Dynamically-Updated Request Online Co-allocator (DUROC) [34] is a co-allocation mechanism that allows jobs to be submitted to resources that are managed by different resource managers. DUROC coordinates the transactions between the resource managers and brings up the distributed pieces of the job [34]. The job transaction satisfies ACID properties; either all of the jobs succeed or the transaction fails. Co-allocation is specified in RSL using the “+” symbol.

Figure 11: Overview of Globus DUROC [6]

A DUROC RSL request is shown in Figure 12. It specifies that a single job involving one process will be run on a system in UWO's computer science department, and a job

18

involving two processes will be run on a system in McGill's computer science department.

Figure 12: DUROC RSL request (pre-OGSA)

2.1.2.6 Global Access to Secondary Storage

Globus Access to Secondary Storage (GASS) [10] is a service that provides GRAM mechanisms for transferring and accessing data between remote resources. It can be used to access files stored in specialized GASS servers, FTP servers, HTTP servers and a variety of other servers. A typical use of GASS is to transfer a job’s executable and any of its inputs to a remote resource and transfer the resulting output back to the client.

2.1.3 Monitoring and Discovery System

The Monitoring and Directory System (MDS) [22] provides access to static and dynamic resource information and provides resource discovery and dissemination mechanisms. MDS has a hierarchical structure based on the Lightweight Directory Access Protocol (LDAP) [39]. It consists of four components: the Grid Resource Information Service, the Grid Index Information Service the Information Provider and the MDS Client.

+ ( (resourceManagerName=tfglobus .csd.uwo.ca) (count=1) (executable= test_app1) ) (& (resourceManagerName=tfglobus.csd.mcgill.ca) (count=2) (executable=test_app2) ) )

19

Figure 13: Overview of Globus MDS [6]

The Information Provider is used to translate local resource information into the format defined in the schema and configuration files [35] while the MDS client is used to search for resource information.

2.1.3.1 Grid Resource Information Service

The Grid Resource Information Service (GRIS) acts as a gateway between a resource and the resource directory; it provides a uniform interface for querying a resource provider about the status of resources as well as configuration information.

2.1.3.2 Grid Index Information Service

The Grid Index Information Service (GIIS) is a resource directory service that combines aggregate information from all of the components under it into a single directory listing. A GRIS registers itself with a GIIS, and a GIIS can register itself with another GIIS, thus forming a hierarchy.

20

2.1.4 GridFTP

GridFTP [5] is a secure and reliable grid data transfer protocol based on FTP. The Globus toolkit provides implementations of both a GridFTP client and a GridFTP server which support standard and third-party file transfers. Standard transfer occur when a client transfers a file to a remote machine running a GridFTP server (client to server) while a third-part transfer occurs when a client transfer a file from one machine running a GridFTP server to another running a GridFTP server (server to server).

2.2 Web Services

Web Services are interfaces that describe a collection of operations that are network-accessible through the use of XML messaging whose purpose is to facilitate application-to-application communication [6]. Web Services define techniques for describing software components that allow access to themselves, methods for accessing these components, and techniques that allow for the identification and discovery of relevant service providers [6]. Three roles assumed in the Web Services model are shown in Figure 14.

Figure 14: Roles in the Web Services model [52]

The requester is an entity that requests the use of a particular service. Services are created and deployed on a server that is able to receive messages in a particular encoding over a transfer protocol [52]. The server might support several styles of encoding and/or protocol pairs. These are called bindings, since high-level service definitions are bound to a low-level means of invoking the service [52]. Servers publish information about the services they provide in a registry, which, to the requester, represents a collection of services that provides an implementation of various interfaces. The requester can search for the interface it needs by filtering out interfaces based on criteria associated with the binding. Three Web Services are of particular interest when it comes to the Open Grid Services Architecture: Simple Object Access Protocol, Web Services Description Language and Web Service Inspection.

21

2.2.1 Simple Object Access Protocol

The Simple Object Access Protocol (SOAP) [59] is an XML-based messaging protocol used for encoding Web Service request and response messages before sending them over a network. SOAP defines a convention for remote procedure calls (RPC) and a convention for messaging independent of the underlying transport protocol. The messages can be transported over a variety of protocols such as FTP, SMTP, MIME and HTTP.

2.2.2 Web Services Description Language

The Web Service Description Language (WSDL) [15] is an XML-based language that defines the set of messages, the encodings and the protocols that are used to communicate with a service. In other words, it defines an interface to a service. WSDL allows multiple bindings for a single interface.

2.2.3 Web Service Inspection

Web Service Inspection (WSI) [11] is an XML-based language designed to assist in locating service descriptions and involves rules that define how inspection-related information should be made available. A WSI document is a collection of references to WSDL documents.

2.3 Open Grid Services Architecture

The construction of a grid system requires mechanisms common to general distributed computing infrastructures; for example, methods of identifying information by globally understood references which can be passed from one system to another, methods for subscribing to updates from an information source, etc. [52]. The demands placed on distributed programming for grid software are extremely high due to the fact that grids are heterogeneous and can span multiple administrative domains [52]. These types of issues are addressed through the proper use and development of standards. There are a number of workgroups currently developing a variety of grid standards. One of these groups is the Open Grid Services Infrastructure (OGSI) [18] workgroup. The OGSI specification is an extension of Web Services which specifies basic levels of function that allow grid services to be created, managed, discovered and destroyed. These basic functions are knows as the grid core. Much like the core of an operating system, the grid core provides fundamental building blocks for higher-level services. OGSI does not address issues such as interoperability between heterogeneous systems or security; these are best addressed by other standards. For example, interoperability between heterogeneous systems is addressed by Web Service standards while grid security issues are addressed by other groups within the Global Grid Forum (GGF) [32]. The OGSI specification exploits these standards, using and extending them as necessary. The Open Grid Services Architecture (OGSA) [30] is a collective set of standards for the specification of higher-level services which address issues such as data access, resource configuration, accounting, etc.

22

OGSA has three main predecessors: The Globus toolkit, an infrastructure that provides key grid technologies considered to be standards by a variety of organizations; The autonomic computing initiative is a project pursued by IBM which seeks to provide a common set of core infrastructure functions across all platforms; and finally, Web Service standards which define provisions for the discovery, registration and use of distributed services.

2.3.1 What is a Service-Oriented Architecture?

A Service-Oriented Architecture (SOA) is an architectural style involving a collection of services capable of communicating with one another, in order to pass data among or coordinating an activity involving a number of services [14]. The goal of a SOA is to achieve loose coupling among interacting agents.

A service is a well-defined, self-contained function that can be invoked through a remote interface. It is the endpoint of a connection, and involves an underlying system that supports the connection offered [14]. A service can be defined in terms of two elements: the messages used to communicate with it and the behavior expected in response to these messages. A service definition says nothing about its implementation details, thus it is possible to have multiple implementations of the same service, each of which may have different features or targets a different platform. If a service has multiple implementations, a client must be able to determine the details of each implementation.

A SOA is very flexible because it allows a great deal of variety when it comes to the interconnecting protocol and the underlying platform used by both clients and servers [14].

2.3.2 Exploiting Web Services

Web Services are a form of SOA which make use of WSDL for describing service interfaces and utilize XML and HTTP in order to enable distributed heterogeneous computing [52]. They fit into the OGSA philosophy because a basic premise of OGSA is that computational resources can be represented as services whose behaviors can be described by a core set of interfaces or operations. Grid services map quite well to the Web Service concepts of registration, discovery and use as shown in Figure 15 [52].

23

Figure 15: Web-Service facilities used in a grid [52]

There is a slight difference when it comes to the representation of a service; in a grid, a service might represent physical resources, separate from the applications and data that exploit them [52]. As such, information about these services is dynamic and may change frequently.

OGSI exploits the following aspects of Web Services:· The mechanisms for encoding message transmission protocols that are described

as bindings in Web Services. There are many ways of encoding messages, and Web Services separates these concerns from the XML definitions of application interfaces: OGSI standardizes interfaces at the application level, independently of the bindings [52].

· The conventions used in Web Services to separate the primary application interfaces (function calls and their parameters) from functions which can be managed by standardized, generic middleware. This includes issues such as authentication and access control. [52]

· The techniques used in Web Services to separate service and network management functions (i.e. workload balancing, performance monitoring, etc.) from the application interface. [52]

OGSI adds the following extensions to WSDL:· A mechanism for defining an interface in terms of an extension of one or more

simpler interfaces; this is a reliable way of establishing the basic behavior of a Grid Service. [52]

· Enabling an XML-based language for describing the state associated with a service. [52]

2.3.3 Needs in a Grid Process

24

OGSA defines several Grid Services in order to support the requirements of a robust grid environment. Grid Services extend Web Services by defining a set of interfaces (WSDL refers to these interfaces as portTypes) which address issues of dynamic service creation, discovery, management, notification, and upgradeability. This section examines how OGSA addresses these needs.

2.3.3.1 Dynamic Service Creation

In order for grids to be truly dynamic in nature, grid software must have a mechanism that allows services to be created dynamically. For example, in a Web serving environment, new Web server instances might need to be created in order to deal with increasing load requirements. OGSA defines a class of grid services that implement an interface that creates new grid services. This interface is knows as the Factory and the service that implements the interface is knows as a factory. The factory service is used to dynamically create instances of desired services. Creating a new service returns a Grid Service Handle and an initial Grid Service Reference for the service instance (briefly explained below).

Every grid service instance is assigned a globally unique name called the Grid Service Handle (GSH). A GSH does not hold any protocol or instance specific information (i.e. network addresses, protocol bindings, etc.). This type of information, along with information about how to interact with the service instance is encapsulated into a Grid Service Reference (GSR). The GSR can change over a service’s lifetime. As such, a GSR has an expiry time and must be renewed.

2.3.3.2 Dynamic Service Management

Once a service instance is created, there needs to be a mechanism for identifying and obtaining information from it. OGSA addresses this through the use of a HandleResolver interface, GSH and GSR. GSH provides a unique method of identifying a service instance, but only a GSR contains information about how to interact with it. The HandleResolver is used to create a GSR from a GSH. It does this by making use of the reference to the home handle resolver contained in the GSH.

2.3.3.3 Service Lifetime Management

The inclusion of dynamic service creation raises a few issues when it comes to service termination. Typically, a service is created with the goal of executing a task. The service is terminated either when the task is complete or by explicit request by the entity that requested the service. In a distributed environment, a service may never receive a termination request due to component failures or lost messages. Thus, the resources consumed by the service cannot be recovered. OGSA handles this problem through the use of a soft-state registration in the GridService interface. Each service instance has a certain lifetime associated with it and will terminate once the lifetime expires. However, it is possible for an entity to extend the lifetime of a service, given the right permissions.

25

2.3.3.4 Registration and Discovery

OGSA makes use of a service-oriented architecture. As such, clients need a method for finding services in a registry and services must be able to publish information about themselves into the registry. Information relating to these services must be represented in a standard fashion and clients must be able to identify services that perform the same task but differ in functionality. This allows clients to choose which service instances they would like to utilize according to criteria associated with the service.

OGSA defines a registry grid service which supports service discovery. A registry is defined in terms of Registry interface, which provides operations that allow GSHs to register themselves, and associated service data elements that contain information about all registered GSHs. The GridService interface contains operations that can be used to retrieve information about registered GSHs.

2.3.3.5 Notification

Distributed services need a mechanism that allows other services to be notified of any changes in state. However, most services are not interested in the state of all other services; a mechanism that allows services to subscribe to notification messages needs to be in place.

OGSA has such a system in place with its NotificationSource, Notification-Sink and NotificationSubscription interfaces. A service must support the NotificationSource interface if it wants to support the subscription of notification messages. This interface allows the management of these subscriptions. A service wanting to receive notification messages must support the NotificationSink interface, which is used to deliver notification messages. Subscription is achieved through use of the NotificationSubscription interface by giving it the GSH of the NotificationSource and GSH of the NotificationSink. Notification messages then flow from the source to the sink. The sink must periodically send keepalive messages in order to notify the source that its still interested in notification messages.

2.3.3.6 Upgradeability

Services will sometimes require an upgrade, which brings up a few problems. Will the upgrade be compatible with the older version? What happens if clients are interacting with the service during the upgrade period? If the service now supports different protocols or has additional features, the client must be notified of this.

OGSA handles this issue through use of the GSH. The GSH is the same as the GSR minus the protocol and instance-specific information, but it can be used to indicate that the service has changed. Since a GSR has an expiry date, when it needs to be renewed it will see that changes have been made to the GSH and will be updated to reflect these changes.

26

3 Condor

The Condor project [16], headed by the University of Wisconsin, is aimed at investigating and developing mechanisms and policies that support high-throughput-computing on sets of distributed resources. The infrastructure provided by the project contains mechanisms for job queuing, scheduling policy, priority scheme, resource monitoring and resource management. Condor uses Classified Advertisements, an expressive and extensible method of describing both resource requests and resource offers. A matchmaking framework is used to determine which resource requests and resource offers should be matched together. Condor was originally designed to run on distributed systems, but its possible to integrate the infrastructure with grid technologies in order to make it useful in the context of a grid environment. For example, Condor-G (discussed in section 3.5) uses the resource management mechanisms and information services provided by the Globus toolkit in order to operate in a grid environment.

3.1 Classified Advertisements

A Classified Advertisement [57] (known as ClassAd) is a semi-structured data model that represents arbitrary resources and constraints on their allocation [47]. A ClassAd is a mapping from attribute names to expressions. These expressions may consist of literals, aggregates, operators, references to other attributes or calls to built-in functions. ClassAds are flexible because they do not have to be created according to a specific template; a ClassAd is valid as long as it follows the appropriate syntax and contains the required attributes, mainly ‘Rank’ and ‘Constraint’. ClassAds were developed because Condor needed a simple and expressive method for representing arbitrary service requests and service offers and to identify and rank compatible matches. Figure 16 shows the ClassAd of a machine offering itself as a resource and Figure 17 shows the ClassAd of a job request.

Figure 16: ClassAd Machine Description

[Type= “Machine”;Activity= “Idle”;Name= “durendal.syslab.csd.uwo.ca”;Arch= “Intel”;OpSys= “Solaris8”;Memory= 256;Disk= 32768;Kflops= 27432;ResearchGrp= {“snorman”, “hanan”, “katchab”};Untrusted= {“unknown”};Rank= member(other.Owner, ResearchGrp)*10;Constraint= !member(other.Owner, Untrusted) && Rank >= 10 ? true : false;]

27

Figure 17: ClassAd Job Description

3.2 Matchmaking

Resources in Condor are represented by Resource-Owner Agents (RAs) and customers are represented by Customer Agents (CAs). RAs are responsible for enforcing the resource usage policies put into place by resource owners and maintain information relating to resource status. CAs maintain a list of jobs submitted per customer. Both RAs and CAs submit their ClassAds to a pool manager whose responsibility is to maintain a pool of ClassAds. The ClassAds must contain both a Constraint and Rank expression and each party must provide an address where it can be contacted. RAs can also include an authorization ticket which is used to authorize CAs during the claiming process. Every now and then, the pool manager invokes a matchmaking algorithm which scans knows ClassAds and creates pairs that satisfy each others constrains and preferences.

When two ClassAds are being evaluated, the matchmaker allows each ClassAd to access the other's attributes through the keyword other. The reference self.attribute refers to the attribute of a ClassAd that contains the reference while other.attribute refers to the attribute of the other ClassAd. For example, in Figure 17 the constraint “other.Memory >= self.Memory” is satisfied if the resource the Memory attribute of the resource provider is greater than the Memory attribute of the requester. Two advertisements are considered compatible if A.constraint and B.constraint both evaluate to true. A user-supplied Rank expression is used to determine preference when more than one resource satisfies a request. If a ClassAd refers to an attribute that does not exist in the other ClassAd, the expression evaluates to undefined which is interpreted by the matchmaker as being false.

The matchmaking algorithm uses information about past resource usage policy in order to ensure an element of fairness. When a match is found, the pool manager contacts the appropriate parties at the provided addresses and gives the CA the authorization ticket submitted by the RA. When a CA is contacted by the pool manager (due to a match), it CA must contact the RA directly and provide it with an authorization ticket. This is called the claiming process. The RA checks the authorization ticket and re-evaluates the constraints using updated resource information. If the ticket matches the original ticket and the constraints are still satisfied, the RA accepts and runs the job submitted by the

[Type= “Job”;Owner= “snorman”;Cmd= run_sim;Iwd= /usr/snorman/;Memory= 31;Rank= Kflops/1E3 + other.Memory/32;Constraint= other.Type==”Machine” && OpSys==”Solaris8” && Disk >=

15000 && other.Memory >= self.Memory;]

28

CA. Once a CA is finished with a resource, it releases its claim on it and the RA can advertise itself as unclaimed. It is possible to have an RA continue to advertise itself while claimed in order to listen for higher-priority jobs.

Figure 18: Condor Matchmaking [49]

The authors of [47] claim the separation of matching and claiming provides the following benefits:

· Weak consistency requirementso The state of resource providers and requesters may change once the ClassAd

has been sent, thus resulting in stale advertisements. Claiming allows constraints to be verified with respect to the current state of the resources [47].

· Authentication

o The claiming protocol can make use of cryptographic techniques [47].

· Bilateral Specialization

o Since the allocation phase has been pushed to the claiming stage, the matchmaker does not have to deal with specific kinds of providers and consumers – it remains generic in nature [47].

· End-to-End verification

o The matchmaker does not have to remember any state about matches [47].

3.2.1 The Bilateral Matching Problem

The Condor project was started in 1988 thus is fairly old by computer standards. As time passed on from its initial beginnings, Condor started being used in situations in was not originally intended to be in (such as grid environments). Grid environments posed unique challenges to Condor, such as the problem of multilateral (one-to-many) matching. A job running in a grid environment can potentially require multiple resources from multiple

29

sources. However, the original Condor framework only supported bilateral (one-to-one) matching. Thus, a ClassAd describing a job could only be matched to a single ClassAd describing a resource. It would have been an extremely difficult, if not impossible, task to find a method of running a job that required multiple resources using the original Condor framework. Nowadays, the ClassAds language and matchmaking framework have evolved to make it possible to achieve multilateral mappings through a series of extensions. Two of these proposed extensions are discussed in the following sections.

3.2.2 Gangmatching

Gangmatching [48] was a model developed to overcome the bilateral matching problem. The model follows a docking paradigm where gangs of ClassAds are created by docking (i.e. binding) together individual ClassAds with a matching operation [48]. In a traditional ClassAd, the advertisement contains an implicit declaration of a bilateral match. In the Gangmatching model, a ClassAd contains an explicit list of required bilateral matches. Gangmatching extends the ClassAds language through the inclusion of a Ports and Label attribute. The Ports attribute defines the characteristics and the number of matching ads required for that ClassAd to be satisfied while the Label attribute names the candidate bound to that port [48]. The scope of the Label extends from the port of declaration to the end of the port list.

Figure 19: Gangmatching Request ClassAd

[Type= “Job”;Owner= “snorman”;Cmd= “run_sim”;Ports= {

[ /* request a machine */Label= cpu;ImageSize= 32M;Rank= cpu.Kflops/1E3 + cpu.Memory/32;Constraints= cpu.Type==”Machine” &&

cpu .OpSys==”Solaris8” && cpu..Memory >= ImageSize;

],[ /* request extra disk storage */Label= storage;ImageSize= 32M;Rank= storage.Disk;Constraints= storage.Type == “Storage” &&

storage.Size > ImageSize;]

}]

30

Figure 20: Gangmatching Storage ClassAd

Figure 21: Gangmatching Machine ClassAd

The matchmaking algorithm is modified so that matches are made between all of the ports in a ClassAd while the claiming protocol is modified so that all of the matches within a gang are verified during the process. If any matches are rejected during the claiming process, perhaps due to changes in state, the match is discarded for the entire gang and the matchmaking process restarts.

3.2.3 Set-Extended ClassAds

Another method of overcoming the bilateral matching problem is the set-extended ClassAd language proposed in [7]. Set-extended ClassAds extend regular ClassAds by allowing them to express both set expressions and individual expressions. Individual

[Type= “Storage”;Size= 1024M;FileSystem= “FAT32”;Ports= {

[Label= requester;Rank= 1/requester.SizeRequired,Constraints= requester.Type == “Job” &&

requester.ImageSize < 1024M;]

}]

[Type= “Machine”;Activity= “Idle”;Name= “durendal.syslab.csd.uwo.ca”;Arch= “Intel”;OpSys= “Solaris8”;Memory= 256;Disk= 32768;Kflops= 27432;Ports= {

[Label= requester;Rank= 1/requester.ImageSize;Constraints= requester.Type == “Job”;]

}]

31

expressions place constraints on each individual ClassAd in the set (e.g. disk space per resource), while set expressions place constraints on the entire ClassAd set (e.g. total disk space for all resources). This is done by extending the ClassAds language to support the aggregate functions Max, Min and Sum.

The matchmaking algorithm must be modified so that it attempts to construct ClassAd sets that satisfy both individual and set constraints. The algorithm proposed in [7] consists of two phases. In the first phase, individual ClassAds are filtered out based on the individual constraints specified in the request. In the second phase, the algorithm attempts to find ClassAds that meet the specified set requirements. Since each ClassAd must be evaluated in terms of the all the others, it is quite inefficient to evaluate all possible combinations. The authors of [7] suggest the use of a heuristic greedy algorithm for set construction.

3.3 Problem Solvers

Condor supports the idea of problem solvers; high level constructs built on top of the Condor agent which serve to provide a unique programming model for managing a large number of jobs [49]. Problems solvers are only concerned with the application-specific details of ordering and task selection; concerns about job failure are delegated to the Condor agent. The Condor system includes two problem solvers with it, master-works and directed-acyclic graph manager, although more can be build using the public interfaces provided by the Condor agent.

3.3.1 Master-Works

The Master-Works (MW) problem solver is useful for solving problems for which the size is unknown and the workforce is large and unreliable. This model is well-suited for problems such as parameter searches, where large portions of the problem space may be examined independently, yet the progress is guided by intermediate results [49].

The MW model is shown in Figure 22. There are two main components in the model: a master process and a set of worker processes. The master process is in charge of directing the computation, while the worker processes perform the tasks of the computation. The master process consists of a work list, a tracking module and a steering module. The work list is a set of incomplete tasks that have yet to be executed. The tracking module keeps track of what tasks has been done and assigns incomplete tasks to worker processes. If a worker process fails, the tracking module returns the tasks assigned to it back into the work list. The steering module directs the computation by examining the results, modifying the work list and communicating with Condor to obtain a sufficient number of worker processes [49].

32

Figure 22: Condor Master-Works Problem Solver [49]

3.3.2 Directed-Acyclic Graph Manager

The Directed-Acyclic Graph Manager (DAGMan) problem solver is used for executing multiple jobs with dependencies in a declarative form, as shown in Figure 23. A ‘JOB’ statement assigns a file name (i.e. “a.condor”) with an abstract name (i.e. “A”) that describes a job. A ‘PARENT – CHILD’ statement describes the dependency relationship between two jobs. For example, in Figure 23, jobs B and C (the children) cannot start until after job A (the parent) has finished executing. The ‘SCRIPT PRE’ and ‘SCRIPT POST’ statements indicate commands to be run before and after a job executes. This can be used to validate the input or output of a job before moving onto the next stage. The ‘RETRY’ statement allows a job to be retried in case of failure.

Figure 23: Condor Directed-Acyclic Graph Manager [49]

3.4 Split Execution

33

Once a job is placed in an execution environment, it can run into a multitude of problems relating to missing libraries or files, firewalls, missing credentials, etc. One system might be aware of the user but not of the files the user needs and vise-versa. Only the execution system is aware of what file systems, networks, and databases may be accessed and how they must be reached and only the submission system knows at run time what precise resources the job must actually be directed to [49]. These systems do not know in advance what names the job may find its resources under, since this is a function of location, time, and user preference [49].

Condor uses split execution in order to address these issues. Split execution involves the cooperation of a shadow, responsible for specifying everything the job needs at run time, and a sandbox, responsible for providing the job with a safe execution environment. The combination of a shadow and a sandbox is referred to as a universe. Condor provides several universes, such as the standard universe and the java universe. More information about these universes and how they operate can be found in [49]. Figure 24 shows an overview of the standard universe.

Figure 24: Condor Standard Universe [49]

3.5 Condor-G

Condor-G [28] is an infrastructure which makes use of grid technologies taken from the Globus toolkit. The goal of the Condor-G is to allow users to harness resources that belong to multiple administrative domains as if they all belonged to a single administrative domain by making use of the standardized access and secure inter domain protocols from Globus together with Condor's mechanisms for job submission, job allocation, error recovery and execution environment setup [28].

The main component of the Condor-G system is the Condor-G agent, which allows a user to treat the grid as a local resource. It provides an API and command-line tools that allow the user to submit jobs, query a job's status, cancel a job, obtain information about job termination or any problems and provides access to detailed job execution logs. Once the agent receives a job submitted by the user, it stages the job's standard I/O and executable using Globus GASS, submits the job to a remote resource using Globus GRAM, handles job monitoring and failure recovery through Globus GRAM and authenticates all requests

34

via Globus GSI mechanisms. Condor's implementation of the GRAM protocol is revised in order to make it more fault-tolerant. It uses a two-phased commit protocol during job submission in order to deal with lost requests and responses and logs the details of all active jobs to stable storage at the client side in the event of local failure. The Condor-G agent also handles communication with the user concerning errors or unusual conditions, resubmission of failed jobs, and stores state information for each submitted job to persistent storage in the scheduler's job queue to support restart in case of failure.

Figure 25: Condor-G Architecture

The authors of [28] suggest the use of a personal resource broker in order to determine the resource(s) on which to run a user-submitted job, although the current method is to provide a list of user-supplied GRAM servers. The broker would combine information about user authorization, resource status and application requirements in order to build a list of candidate resources [28].

4 Legion

Legion [13] is an object-oriented metasystem developed by the University of Virginia that allows heterogeneous, geographically dispersed resources to interact with one another. It makes an entire system appear as a single virtual machine. Object in the Legion framework are made to represent components of a grid. Each object is defined and managed by their Class or Meta-Class object and arranged in a hierarchy. Legion does not mandate any particular programming model or language; it only defines APIs for object interaction. Thus, interactions can be implemented in a variety of different ways using different programming models and languages. Legion has built-in support for resource reservation and mechanisms for load control on host systems. The system provides a default scheduler which can be outperformed by schedulers with application-

35

specific knowledge [13]. Scheduling policy can be extended through the use of resource brokers. Metasystems such as Legion can support many different types of layering schemes, as shown in Figure 26.

Figure 26: Legion Layering Schemes [13]

In the layering scheme shown in part (a), applications do everything and are responsible for negotiating directly with the resources and making placement decisions [Figure 11]. In part (b), the application must negotiate with resource management services in order to obtain resources, but still makes its own placement decisions [13]. In part (c), the application interacts with a combined placement and negotiation module while in part (d), each of these functions are contained in separate modules [13].

4.1 Core Objects

Class objects have two main purposes in Legion: they define types for their instances and act as managers for these instances. The objects required for Legion to function properly are the LegionClass, HostClass and VaultClass objects. The LegionClass is a generic object much like Java's Object class; every other object is a derivative of this class. The HostClass encapsulates a machine's capabilities and is responsible for managing instantiation on that machine. It also grants reservations for future services and contains mechanisms for defining event triggers. The VaultClass is a generic storage abstraction used to store the persistent state of an object.

Figure 27: Legion Core Object Hierarchy [13]

36

Legion also includes several service objects which used for improving system performance, but these objects are not part of the core functionality of the system and are not required for the system to function properly.

Legion objects are required to be associated with a Vault in order for them to be able to execute. The Vault stores an Object Persistent Representation (OPR) of the object (its persistent state). The OPR is also used for the purposes of migration, shutdown and restart.

4.2 Resource Management Infrastructure

Legion's resource management philosophy is based on negotiation between agents acting on behalf of the producer and consumer of a resource. Users request blocks of time on resources, whereas administrators regulate the control and access to these resources and enforce usage and security policies [13]. This section presents the resource management infrastructure described in [13], which based on the layering scheme (c) shown in Figure 26.

Figure 28: Legion Resource Management Model [13]

4.2.1 Hosts and Vaults

Basic resources in Legion consist of Hosts and Vaults. Hosts are meant to represent actual machines and implement functions for reservation management, object management and information retrieval. Hosts also support event triggers, which are guarded statements that execute when conditions on the guard evaluate to boolean true. Hosts populate their attributes with information about their current state, architecture, operating system, etc. This information is periodically refreshed and the attributes are repopulated with current data. Vaults are used in order to store persistent state information regarding the Host objects.

37

4.2.2 Collection

The Collection is a repository of state-related resource information. Collections may retrieve information directly from resources themselves or may share information with other Collections. Sharing information with other Collections allows for greater scalability. Agents query the Collection to obtain information about a resource through use of a specialized query language. At the time of [13] was written, Collections were passive and only contained static information.

4.2.3 Scheduler

The Scheduler obtains resource information from the Collection and uses this information in order to compute mappings from objects to resources. A mapping consists of a (Class LOID, (Host LOID, Vault LOID)) pair which indicates on which Host and Vault the Class should be instantiated. Once these mappings are determined, they are passed to the Enactor (see section 4.2.4) via schedules for verification and instantiation. Each schedule has at least one master schedule, and each master schedule may contain a list of alternate schedules.

4.2.4 Enactor

The Enactor verifies resource mappings passed to it by the Scheduler and performs instantiation for these mappings. If all of the mappings in the master schedule succeed, the scheduling is complete. Instantiation is held off until the Scheduler calls a method that tells the Enactor to perform the instantiation on the reserved resources. If any of the mappings fail, an alternate schedule for the failed mapping is selected and verified. Mappings can fail if the resources on which the object should be instantiated are unavailable.

4.2.5 Execution Monitor

Application execution can be monitored through Legion's event-based notification mechanisms. The Enactor has the ability to register out-calls to Host objects, which are performed when a trigger's guard evaluates to true. These out-calls allow progress to be monitored and problems to be detected.

4.3 Job Handling

Job handling in Legion is currently accomplished via a set of command-line tools that allows programs to register themselves with the Legion system (create class objects and register implementation with them) and start program instances [44]. The procedure for running a Legion-enabled program is quite simple: the class associated with the program is instructed to create an instance of the class which executes the program code [44]. Running a non-Legion program involves a more complex procedure; a BatchQueueClassObject metaclass must be used in order to create a specialized kind of class object which acts as the class object for the non-Legion program [44].

38

JobProxyObjects are used in order to manage the execution of a program. A JobProxyObjects is created whenever the class associated with a program is instructed to create an instance of itself. Upon execution, the JobProxyObject forks and executes the binary associated with the program.

There are several disadvantages to the current approach to job handling, as pointed out in [44]. One JobProxyObject is required for each running job, thus increasing the overall resource requirements for each job. There is no method of monitoring or restarting a job, and there is no control over the total number of jobs executing simultaneously.

4.4 Architectural Characteristics· Object-Based

o Legion subscribes to an object-oriented design philosophy. This allows it to benefit from increased modularity (component complexity is contained within a single object), extensible functionality (ability to extend or re-implement base Classes) and added security (at the Object-level) [13]. However, the infrastructure does not require grid users to conform to an object-oriented design nor does it require applications to be Legion-aware.

· Naming and Transparency

o Legion provides a single namespace for all of its components. Each object in Legion has a name and a unique associated with it. An object can be queried and can also be requested to perform services once its name and interface are known to others.

· Service – Policy vs. Mechanism

o Legion subscribes to the philosophy that mechanisms can be mandated but policies cannot. For example, Legion provides mechanisms for constructing a scheduler but does not mandate scheduling policy or require a single scheduler for the entire grid [13].

· Security

o Legion uses public key security mechanisms for authentication and access control lists for authorization. It does not require a central certificate authority since each object's LOID contains its public key. Before a method is invoked on an object, the protocol stack associated with the object calls the security layer in order to ensure proper permissions are available.

· Extensibility

o Legion's object-oriented design allows for greater extensibility by allowing specialized objects to be constructed from base objects. This allows new features and functionality to be added to base objects. For example, Host objects can be specialized to exploit native operating system functionality.

· Interfaces

39

o A wide variety of interfaces are supported by Legion, including command-line tools and programmatic interfaces.

· Integration

o Legion provides each grid it manages with a global file system called a context space. The context space is similar to that of a traditional file system except that its components are distributed. Directories within this file system, called contexts, may contain any Legion object (i.e. other contexts, files, applications, etc.). Context spaces can be access through various interfaces such as command-line, programmatic, NFS, Samba, FTP and Web interfaces.

5 Nimrod

Nimrod [1] is a software system whose purpose is to manage the execution of parametric studies across distributed computers [51]. It provides the facilities to create, execute, monitor and collect the results of individual experiments. Nimrod requires experiments to be described through the use of declarative plan files. These files contain the parameters and the commands necessary to perform the work. Nimrod uses this information in order to transfer the necessary files and schedule the work on the first available machine [51]. A sample plan file is shown below.

Figure 29: Sample Nimrod Plan File [4]

The plan file is processed by a generator tool, which allows the user to choose values for the parameters specified in the file. Once this is done, the generator builds a run file that

40

is processed by a dispatcher tool. The dispatcher is responsible for managing the computation across the nodes on which the computation is scheduled to run.

Nimrod does not operate well in a grid environment due to several limitations. It operates on a static set of resources and has no provision for dynamic resource discovery, it does not understand the concept of user deadlines, and it has no support for a variety of access mechanisms [4]. Nimrod-G (see section 5.1) was developed in order to these shortcomings in a grid environment.

5.1 Nimrod-G

Nimrod-G [3] is a version of Nimrod designed to run in grid environments. It takes advantage of certain features provided by the Globus toolkit, such as automated discovery of allowed resources, and uses a model of computational economy as part of the Nimrod-G scheduler [51].

5.2 System Architecture

The Nimrod-G architecture consists of five main components: client/user, parametric engine, scheduler, dispatcher and job wrapper. The architecture is very flexible and can be integrated with grid-middleware provided by systems such as Globus and Condor.

Figure 30: Nimrod-G Architecture [3]

5.2.1 Client/User

The client is an interface for controlling and supervising an experiment under construction [3]. It allows users to modify parameters related to time and cost, which influences how the scheduler selects the resources on which to run the experiment [3].

41

The client also allows the user to control, monitor and check the status of jobs. The client is not bound to any particular site; the user can shut the client down and start it on another system without affecting the outcome of the experiment. It is possible to have many clients connected to any controlling or monitoring the same experiment.

5.2.2 Parametric Engine

The parametric engine acts as a central coordinator for the experiment. It is responsible for parameterization of the experiment and the actual creation of jobs, maintenance of job status, interacting with clients, schedule advisor, and dispatcher [3]. The engine also ensures that the state of the experiment is stored in persistent storage in the event of failure.

5.2.3 Scheduler

The scheduler is responsible for resource discovery, resource selection, and job assignment. It uses a resource selection algorithm based on a model of computational economy, which selects resources according to those that meet the deadline and minimize the cost associated with a computation [3]. In the basic Nimrod model, once tasks are started they do not communicate with one another; scheduling is simply a problem of finding suitable resources and executing the experiment [4].

5.2.4 Dispatcher

The dispatcher is responsible for initiating the execution of a task on a selected resource and updating its status to the parametric engine. This is accomplished by starting a job wrapper on the selected resource.

5.2.5 Job Wrapper

The job wrapper is responsible for staging the tasks and data associated with an experiment, executing the tasks and sending the results back to the parametric engine via the dispatcher [3].

5.3 Scheduling and Computational Economy

Resource selection can be handled in two different ways when using a model of computational economy. One method is to allow the system to work on behalf of the user and attempt to complete the assigned work within a given deadline and cost [3]. Another method is to allow the user to enter into a contract with the system. The contract specifies what the users are willing to pay for the resources if the work can be completed within a given deadline. Using this method, the system can identify a set of viable resources through the use of resource reservation or trading mechanisms. If the user is satisfied with the contract, it can be accepted. Otherwise, the contract can be re-negotiated by changing the deadline and/or cost constraints. The latter method is advantageous because it allows the user to be aware of whether the work can be completed within the given

42

deadline and cost constraints before the work actually starts. However it requires grid middleware services for resource reservation, broker services for cost negotiation, and an underlying system which has a management and accounting infrastructure [3].

The Grid Architecture for Computational Economy (GRACE) [2] is an economy-based architecture which provides these missing components and is generic enough to accommodate different economic models. The GRACE framework provides services that help resource providers and resource consumers maximize their goals. Resource providers can use GRACE mechanisms to define their charging and access policies and the GRACE trader works according to these policies [2]. Resource consumers define their requirements through the use of resource brokers, who use grace services for resource trading and identifying resource providers that meet their needs. A detailed discussion of the GRACE architecture and the services it provides can be found in [2].

6 Application Level Scheduling

The Application Level Scheduling (AppLeS) project [9] is aimed at developing scheduling agents for applications that run on production Grids. AppLeS uses an application-centric approach whereby it is more important to promote the performance of an individual application rather than optimize the user of system resources or maximize the throughput of a collection of jobs [9]. The AppLeS framework uses both dynamic and static resource-related information when configuring and selecting amongst viable resources. It also uses a Network Weather System to monitor changes in performance. In order for an application to take advantage of the AppLeS framework, it must contain an embedded AppLeS agent. AppLeS includes a scheduler that performs the mapping of jobs to specified resources, but local schedulers are still responsible for executing the application units [9].

6.1 Resource Management Architecture

An AppLeS agent consists of a coordinator, which is composed of four main subsystems:· The resource selector chooses and filters different resource combinations for the

application's execution [9].· The planner generates a resource-dependent schedule for a given resource

combination [9].· The performance estimator generates a performance estimate for candidate

schedules according to user's performance metric [9].· The actuator implements the best schedule on the target resource management

system [9].

These subsystems are fed by an information pool consisting of a network weather service, a user interface and a models repository. The network weather service provides dynamic information about the state of the system and provides resource load forecasting services. The user interface provides information about the application's characteristics, the user's performance criteria and the user's constraints. The models repository contains models that can be used for performance estimation, planning and resource selection.

43

Figure 31: AppLeS Resource Management Architecture [9]

7 Discussion and Experiences

So far, this paper has presented the descriptions of selected grid technologies. This section examines how these technologies relate to the generic system characteristics of usability, adaptability, scalability and reliability and identifies some of the features associated with the technologies themselves.

Usability involves whether the documentation associated with the system is complete and up-to-date, the difficulty involved in installing and configuring the software, how the system interacts with the end-user and whether or not research in the technology is still ongoing. Adaptability relates to how a system fares in a heterogeneous and dynamic environment. For example, can the system handle dynamic information or can it only operate on static information? Adaptability also encompasses whether it is possible to modify the system software, how easy it is to do so, and how well the system integrates with software mechanisms already in place (i.e. scheduling, resource brokering, security, etc.). Scalability involves how well a system scales in an increasingly larger environment as more resources and system components are added and higher demands are placed on the system. Reliability relates to a system's robustness and fault-tolerance.

44

7.1 Globus

Globus is not intended to be an “all-in-one” grid solution. Rather, it can be considered to be grid-middleware; it lets administrators pick and choose what components to use according to their system needs. For example, projects such as Condor-G and Nimrod-G make use of Globus for its GRAM and MDS components. The Globus community provides a wide variety of well maintained, up-to-date documentation on their website which includes toolkit installation and configuration instructions, an overview of system components along with their descriptions, an API reference, and a substantial number of research publications. Third-party documentation is also available from sites such as IBM DeveloperWorks grid computing [20]. The software for the toolkit is available in binary and source distributions and can be installed on machines running UNIX, Microsoft Windows or Linux operating systems (however the version for Microsoft Windows is no longer kept up-to-date). Installation and configuration of the software is a long and laborious process which requires the help of a system administrator due to certain tasks requiring high-level system privileges. Once the software is installed and set up, users interact with the grid through a variety of interfaces. For example, jobs are submitted to the grid and their status can be queried via command-line tools. Applications are not required to be grid-aware for the most part, but source code modifications are required in order to take advantage of certain functionality (such as callbacks). Applications are fully capable of interacting with the grid through programmatic interfaces. Even though Globus provides uniform, easy-to-use interfaces that present abstractions of both resources and services, end-users and resource providers are required to be familiar with RSL in order to describe resources and job requests. The Globus technology is still in its developmental stages and continues to receive research contributions from both academic and commercial organizations.

Globus is a very adaptable framework. First and foremost, Globus itself is simply a collection of protocol definitions, APIs and service behavior descriptions. The toolkit, which is open-source, is just an implementation of these. Components can be modified and re-implemented using any programming model or language desired as long as the defined conventions are followed. Globus does not provide any resource brokering or scheduling capabilities. As such, Globus can operate with scheduling systems such as PBS [55], LoadLever [42], Condor, LSF [54] and NQS [50] and requires the use of external resource brokering entities. Grid authentication and authorization mechanisms map to local mechanisms, which allows Globus to operate in an environment where several different types of schedulers, resource brokers and security mechanisms co-exist with one another. Services in the Globus framework are designed in such as way such that makes it possible to create higher-level services from a combination lower-level services, allowing the system's functionality to be expanded as new requirements arise. The MDS service is fully capable of handling dynamic resource information.

Although many of the technologies developed by Globus are standards-related, it is still possible to talk about the scalability of the system in terms of the services they have developed. The MDS information service scales will in an increasingly larger environment due to its hierarchical nature. Problems in scalability can arise on the host

45

running the gatekeeper process when a large number of jobs are submitted to it. The problems occur because each submitted job results in the creation of a jobmanager instance. Each jobmanager instance forks a Perl shell process every ten seconds which probes the local scheduler about the status of the job. This causes load average on the host machine to skyrocket, causes heavy swapping, and results in the bombardment of the local scheduler with potentially thousands of job status requests per minute.

The Globus architecture proves to be quite reliable. The system is organized in a cell-like structure. That is, many different sub-grids (or “cells”) can be set up with core services such that each cell is its own grid and can act in an independent fashion, but can also communicate and collaborate with other cells in order to appear as a larger grid entity. If one of these cells goes down, the failure is unlikely to affect any of the other cells. Globus also includes replica location services which allow replicas to be created and strategically placed in order to increase the overall system's reliability, scalability and performance. Grid-enabled applications can be embedded with callback mechanisms which are called by job monitoring services when certain conditions occur in the grid or the application. This enables corrective actions to be taken by the application when problems occur.

7.2 Condor-G

Condor-G is an all-in-one grid solution, meaning that it contains all of the services and mechanisms necessary to support a computational grid (although is does rely on the Globus toolkit). The Condor website provides well maintained and up-to-date documentation that gives a general overview of the system and its architecture, provides a manual for the installation, configuration and use of the software and includes a wide variety of research documents and publications. The Condor and Condor-G software are available for download from the Condor website. While Condor software is available for just about every platform and operating system, Condor-G is only available for Linux, Solaris, digital UNIX and IRIX. The installation and configuration of Condor-G has a dependency on both the Globus toolkit and Condor; it requires that both of these systems be installed and configured before it can be configured and installed itself. Interaction with Condor-G is done via the condor_submit command, used to submit jobs for execution. This command requires a job description file as input that includes information such as the executable name, the arguments to give, the name of an output file, etc. In the software's current iteration, transfer of the job executable is in the hands of the end user and is done through the use of Globus GASS. Condor-G currently has several limitations. There is no checkpointing, no matchmaking, limited file transfer (can only transfer job's executable, stdin, stdout and stderr), no job exit codes and limited platform availability. Once matchmaking is implemented in Condor-G, users and resource providers will be required to know the ClassAds language in order to describe jobs, resources and services. The Condor research community is quite active and continues to provide research contributions to the project.

46

The Condor system performs quite well when it comes to adaptability. The framework is flexible, the ClassAds language is extensible, and the software is open-source. These characteristics, along with the fact that Condor provides both scheduling and resource-brokering mechanisms, are among some of the many reasons why many grid research projects and papers make use of the Condor system. Since Condor provides its own scheduling and resource brokering mechanisms, it is not designed to integrate with existing mechanisms which differ from itself. The Condor Collector, which is queried by the scheduler in order to find out resource information, is capable of handling dynamic resource information. In addition, if ClassAds are matched based on stale information, the claiming process ensures that both parties (resource consumer and producer) use up-to-date information before proceeding any further. Jobs submitted through Condor-G can run in environments they would not normally run using Condor's GlideIn mechanism; system calls which require a certain level of permissions on the host system can be redirected to a designated system which executes these calls on the user's behalf. In the future, Condor-G will fully support job migration services, allowing the grid to adapt to situations of high load by migrating jobs from areas with high activity levels to ones with lower levels.

Condor-G addresses the Globus Gatekeeper and Jobmanager scalability problems through the use of a grid monitor process. One grid monitor process is submitted to the Gatekeeper per user and is responsible for monitoring the user's jobs in the local scheduler. Whenever Globus reports that a job is pending, the monitor stops the Jobmanager instance for that job. The Jobmanager is only restarted once Globus reports the job is running, or if no real-time streaming of stderr and stdout are needed, the Jobmanager is only restarted once the job completes in order to retrieve its output. Although this is not a complete solution, it does reduce the effect of the problems mentioned in section 7.1. For example, if 1000 jobs are submitted and only 100 are running, then only 100 jobmanager instances will exist instead of 1000. Condor can encounter problems with scalability because it requires that one machine be designated as a central manager responsible for matchmaking and acting as an information repository.As the size of the pool increases, performance of the system can decrease. This is because more updates must be sent to the manager (updating resource information), thus increasing network traffic, and causing a degradation in the performance of the matchmaking algorithm (the more ClassAds there are, the more ClassAds must be evaluated against one another in order to find matches).

Condor-G provides elements of reliability through its built-in mechanisms meant to deal with four types of failure: crash of the Globus Jobmanager, crash of the machine that manages the remote resource, crash of the machine on which the Gridmanager is executing or crash of the Gridmanager, and failures in the network connecting two machines [28]. However, Condor-G has a few weaknesses when it comes to the issue of reliability. If the machine designated as the central manager goes down, matching can no longer occur and Condor tools stop functioning altogether. Also, if GlideIn mechanisms [28] are used, mobile sandboxing techniques require that the originating machine be available at all times in order to service trapped system calls that are redirected to it. If

47

the originating machine is unavailable and a system call is redirected to it, the job can no longer proceed.

7.3 Legion

Legion is an all-in-one grid solution. As such, it contains all of the services and mechanisms necessary to support a computational grid without the need for external software support. Documentation for the Legion system is available from Legion's website at the University of Virginia and includes various tutorials, an overview of Legion and its components, research publications, and manuals for end-users, system administrator and developers. Rights to the Legion project were purchased by a private company called Avaki corp. in 2001. As such, research contributions for the project stopped around 2001 and the software is no longer available for download to the public. From the information gathered from the on-line manuals, end-users and system administrators interacted with Legion through a multitude of command-line tools whereas developers were able to develop their own objects which could interact with the system as long as the objects conformed to interface specifications.

Legion's flexible and extensible object model helps contribute to the system's overall adaptability by allowing programmers to develop new methods of computation as Legion's requirements change over time. In addition to this, Legion does not mandate any particular programming model or programming language and does not require applications to be grid-aware in order to run properly as part of the grid. Since the software is no longer available for public download, working or making any modifications to it an impossibility. Legion was designed to work with queuing systems such as PBS, LSF, LoadLever and NQS. It also provides a generic default scheduler, but was built with the idea that schedulers with application-specific knowledge would be developed and used in place of it. As of the writing of [13], Legion's Collection was a passive database of static information and could not handle dynamic resource information.

Legion provides some elements in its design which help with the issue of scalability. The system was designed with the idea that resources would be owned and controlled by different organizations. As such, resources falling within this criteria do not pose any particular problems. Legion provides a persistent, global namespace that makes it easier for applications that span multiple sites to name and locate objects contained within the grid. In addition, Legion Collections are able to share data with other Collections, thus combining data and forming a hierarchical structure. This allows resource information services to scale as the environment grows.

Legion attempts to address reliability through a variety of methods. It attempts to increase fault tolerance in communication by having each communication pass through a protocol stack which includes a variety of methods for authorization, encryption, retransmission, etc. A draft paper [36] written after the commercialization of Legion identifies several lessons learned as Legion evolved from an academic to a commercial project. Although no specific details are provided, the paper indicates that Legion took special care in

48

handling both failure and timeout cases involving services and components and also included some type of event/exception management system. [36] also goes on to identify a problem where Legion is unable to handle devices that disconnect and reconnect to the network, perhaps due to a change in the device's IP address. In order to address this problem, objects are required to periodically check their IP addresses and must re-register themselves with their classes if any changes in IP are detected.

7.4 Nimrod-G

Nimrod-G is a complete grid solution and does not rely on external software (except for optionally using Globus for its MDS and GRAM components). The Nimrod website contains a variety of research publications and a link to the Nimrod tools page, which contains installation instructions, tutorials and a user and system manual (however, finding this link is non-intuitive; it requires the user to click on a link to the software download page). Almost all research publications relating to Nimrod-G mention that the system relies on Globus for its GRAM and MDS components. However, the installation instructions for the software state that it is possible to install and run Nimrod-G without the use of Globus, but does state the consequences of this action. The software can be installed on Alpha, Solaris and Linux architectures. In any case, once the software is up and running, the user interacts with the grid through the Nimrod-G agent using command-line tools. Users are required to know how to construct plan files (see Section 5) in order to be able to submit jobs to the grid. Research within the Nimrod-G community is still ongoing.

In terms of adaptability, Nimrod-G is able to handle dynamic resource information due to its use of Globus information services (MDS). The software is available in open-source format and the system uses an extensible, application-oriented scheduling policy thus can be modified in order to suit different needs. Nimrod-G provides its own scheduling and resource brokering mechanisms, and was not designed to integrate with existing services and mechanisms.

The Nimrod-G system is scalable due to its cell-like organization. Much like Globus, independent sub-grids (cells) can be set up to represent different markets and can communicate and collaborate with other cells in order to appear as a much larger grid. Much like real world markets, these grid cells trade resources with one another based on the laws of supply and demand.

Nimrod-G can run into problems when it comes to reliability. For example, how can the price of resource access determined in a fair manner and how can the prices be regulated? What mechanisms are there in place to stop an application from dominating the use of a resource if the application's owner is willing to pay any price for access to it? This type of behavior can lead to the starvation of jobs that are not willing to pay a high price for resource access, essentially driving the application out of the marketplace. Also, if the economic model is driven by the laws of supply and demand, resources in high demand might not be affordable to certain entities. When users are paying for timely access to resources, what happens in the situation where system load is high enough to cause job

49

execution delays? Does this cost extend itself to the user? Although a system which uses of a model of computational economy might work extremely well in a simulated, controlled or academic environment, a lot of issues have to be addressed before such a system can be used reliably in a production environment.

7.5 AppLeS

AppLeS is not a complete grid solution and requires an underlying resource management system such as Globus or Legion in order to operate. Documentation for AppLeS is available at the Grid Research and Innovation Laboratory (GRAIL) website. The site includes a few AppLeS related research publications and some poorly organized and incomplete documentation for the installation, configuration and use of the AppLeS software. The software is available for a variety of platforms including Linux, AIX, Irix, Mac OS X and Solaris. Installation and configuration of AppLeS requires that the underlying resource management system be set up in advance. Since AppLeS is an application-level scheduler, the user must set up an agent for each application designated to run on the grid. The user does this by providing the AppLeS agent with information about the application such as performance criteria, execution constraints, login information, etc. via the user interface. Research contributions to the AppLeS project are sparse and the community does not seem to be very active.

AppLeS is quite an adaptable software system. The software is open source and available for public download and therefore can be changed and modified to suit different needs. AppLeS is designed to work with a variety of different resource management systems and works with just about any type of application. It is able to handle dynamic resource information as long as the underlying resource management system supports it.

Scalability and reliability are characteristics not very relevant to the AppLeS system itself since AppLeS is just an application level scheduler. However, the characteristics certainly apply to the underlying resource management system and their effects on the overall system vary according to which underlying system is used.

8 Conclusion

This paper gave an overview of five selected grid technologies and provided an analysis of each based on the generic system characteristics of usability, adaptability, scalability and reliability. This provided the reader with information relating to the costs and benefits associated with the use of each technology. The Globus, Condor-G and Nimrod-G projects are still in stages of research and development, while work on the AppLeS project seems to have slowed over the years. Academic contributions to the Legion system stopped after the system was purchased by a company involved in the commercial sector in 2001. Many issues have yet to be solved by the grid computing community. For example, the growth of grids will eventually require the development of a meta-scheduler in order to manage jobs running on different grids. It will be interesting to see how these issues are dealt with in the future and which technologies will become dominant forces. As of today, development of grid technologies within the Globus community is occurring

50

at an accelerated pace do to its support from both academic and commercial communities, likely due to the fact that many companies plan on capitalizing on grid technologies and would like to see them integrated into production environments as soon as possible. That being said, grids in their current state are not the “ultimate solution” for vastly improving application performance. Some application cannot be parallelized while others might require too much work in order to grid-enable the application. In addition, the configuration of a grid has a vast effect on the infrastructure of an organization when considering factors such as performance, reliability and security. All of these factors, including the direction grid computing will take in the future, must be taken into account when deciding whether or not it is beneficial to set up a grid or not.

9 Bibliography

1. Abramson D., Giddy J., Hall B. and Sosic, R. Nimrod: A Tool for Performing Parametised Simulations using Distributed Workstations. The 4th IEEE Symposium on High Performance Distributed Computing, Virginia, August 1995. <http://www.csse.monash.edu.au/%7Edavida/papers/nimrod.pdf>

2. Abramson, D., Buyya, R, and Giddy, J. A Case for Economy Grid Architecture for Service Oriented Grid Computing. 10th Heterogeneous Computing Workshop April 23, 2001 in conjunction with IPDPS in San Francisco, California. <http://www.csse.monash.edu.au/%7Edavida/papers/ecogrid.pdf>

3. Abramson, D., Buyya, R., and Giddy, J. Nimrod-G: An Architecture for a Resource Management and Scheduling System in a Global Computational Grid. Proc. of the 4th International Conference on High Performance Computing in Asia-Pacific Region, 2000. <http://www.csse.monash.edu.au/~davida/papers/hpcasia.pdf>

4. Abramson, D., Giddy, J. and Kotler, L. High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid? International Parallel and Distributed Processing Symposium (IPDPS), pp. 520-528, Cancun, Mexico, May, 2000. <http://www.csse.monash.edu.au/%7Edavida/papers/ipdps.pdf>

5. Allcock, W. GridFTP Protocol Specification (Global Grid Forum Recommendation GFD.20). March, 2003. <http://www.globus.org/research/papers/GFD-R.0201.pdf>

6. Amir, A., Armstrong, J., Berstis, V., Bieberstein, N., Bing-Wo, R., Ferreira, L., Hernandez, O., Kendzierski, M., Magowan, J., Murakawa, R., Neukoetter, A. and Takagi, M. Introduction to Grid Computing with Globus. IBM Redbook, 2004. <http://publib-b.boulder.ibm.com/abstracts/sg246895.html?Open>

7. Angulo, D., Foster, I., Liu, C. and Yang, L. Design and Evaluation of a Resource Selection Framework for Grid Applications. Proc. of IEEE International Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, July, 2002. <http://www.globus.org/research/papers/RS-hpdc.pdf>

8. Autonomic Computing, IBM, <http://www.research.ibm.com/autonomic/>9. Berman, F. and Wolski, R. The AppLeS project: A status report. Proc. of the 8th NEC

research Symposium, 1997. <http://www.cs.ucsd.edu/groups/hpcl/apples/pubs/nec97.ps>

10. Bester, J., Foster, I., Kesselman, C., Tedesco, J. and Tuecke, S. GASS: A Data Movement and Access Service for Wide Area Computing Systems. Sixth Workshop on

51

http://www.cs.ucsd.edu/groups/hpcl/apples/pubs/nec97.ps

http://www.research.ibm.com/autonomic/

http://www.globus.org/research/papers/RS-hpdc.pdf

http://publib-b.boulder.ibm.com/abstracts/sg246895.html?Open

http://www.globus.org/research/papers/GFD-R.0201.pdf

http://www.csse.monash.edu.au/~davida/papers/ipdps.pdf

http://www.csse.monash.edu.au/~davida/papers/hpcasia.pdf

http://www.csse.monash.edu.au/~davida/papers/ecogrid.pdf

http://www.csse.monash.edu.au/~davida/papers/nimrod.pdf

I/O in Parallel and Distributed Systems, May 5th, 1999. <ftp://ftp.globus.org/pub/globus/papers/gass.pdf>

11. Brittenham, P. An Overview of the Web Services Inspection Language. DeveloperWorks SOA and Web Services, IBM, 2001. <www.ibm.com/developerworks/webservices/library/ws-wsilover/>

12. CERN Particle Physics Center, 2004. <http://public.web.cern.ch/Public/Welcome.html>

13. Chapin, S., Grimshaw, A., Karpovich, J. and Katramatos, D. The Legion Resource Management System. Proc. of the 5th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP '99), in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS '99), April, 1999. <http://www.cs.virginia.edu/~legion/papers/legionrm.pdf>

14. Chatarii, J. Introduction to Service-Oriented Architecture. Developer Shed, 2004. <http://www.devshed.com/c/a/Web-Services/Introduction-to-Service-Oriented-Architecture-SOA/>

15. Christensen, E., Curbera, F., Meredith, G. Weerawarana, S. Web Services Description Language (WSDL) 1.1. W3C, Note 15, 2001.

16. Condor Project, University of Wisconsin. <http://www.cs.wisc.edu/condor/>17. Czajkowski, K., Foster, I. and Kesselman, C. Resource Co-Allocation in

Computational Grids. Proc. of the Eighth IEEE International Symposium on High Performance Distributed Computing (HPDC-8), pp. 219-228, 1999. <http://www.globus.org/research/papers/paper3.pdf>

18. Czajkowski, K., Foster, I., Frey, J., Graham, S., Kesselman, C., Maquire, T., Sandholm, T., Snelling, D., Tuecke, S. and Vanderbilt, P. Open Grid Services Infrastructure (OGSI) ver 1.0. Global Grid Forum GridForge. <https://forge.gridforum.org/projects/ogsi-wg>

19. Czajkowski, K., Foster, I., Karonis, N., Kesselman, C., Martin, S., Smith, W., and Tuecke, S. A Resource Management Architecture for Metacomputing Systems. Proc. Of the IPPS/SPDP '98 Workshop on Job Scheduling Strategies for Parallel Processing, pp. 62-82, 1998. <ftp://ftp.globus.org/pub/globus/papers/gram97.pdf>

20. DeveloperWorks Grid Computing, IBM, <http://ww136.ibm.com/developerworks/grid/>

21. Ferrari, A., Grimshaw, A., Holcomb, K. and Lindahl, G. Metasystems. Communications of the ACM, Vol. 41, No. 11, November, 1998.

22. Fitzgerald, S., Foster, I., Kesselman, C., Smith, W., Tuecke, S. and von Laszewski, G. A Directory Service for Configuring High-Performance Distributed. Proc. of the 6th IEEE Symposium on High-Performance Distributed Computing, pp. 365-375, 1997. <ftp://ftp.globus.org/pub/globus/papers/hpdc97-mds.pdf>

23. Foster, I. and Kesselman, C. Globus: A Metacomputing Infrastructure Toolkit. International journal of Supercomputer Application, Vol. 11, No. 2, pp. 115-128, 1997. <ftp://ftp.globus.org/pub/globus/papers/globus.pdf>

24. Foster, I. and Kesselman, C. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, CA, 677 pp, 1999. <http://www.globus.org/research/papers/chapter2.pdf>

52

http://www.globus.org/research/papers/chapter2.pdf

ftp://ftp.globus.org/pub/globus/papers/globus.pdf

ftp://ftp.globus.org/pub/globus/papers/hpdc97-mds.pdf

http://ww136.ibm.com/developerworks/grid/

ftp://ftp.globus.org/pub/globus/papers/gram97.pdf

https://forge.gridforum.org/projects/ogsi-wg

http://www.globus.org/research/papers/paper3.pdf

http://www.cs.wisc.edu/condor/

http://www.devshed.com/c/a/Web-Services/Introduction-to-Service-Oriented-Architecture-SOA/

http://www.devshed.com/c/a/Web-Services/Introduction-to-Service-Oriented-Architecture-SOA/

http://www.cs.virginia.edu/~legion/papers/legionrm.pdf

http://public.web.cern.ch/public/index.html

http://www.ibm.com/developerworks/webservices/library/ws-wsilover/

ftp://ftp.globus.org/pub/globus/papers/gass.pdf

25. Foster, I. The Grid: A New Infrastructure for 21st Century Science. Physics Today, Vol. 55, No. 2, pp. 42-47, February, 2002. <http://crystal.uta.edu/~levine/class/spring2003/grid/Foster_phyiscs_today.pdf>

26. Foster, I. What is the Grid? A Three Point Checklist. GRIDToday, Vol. 1, No. 6, July, 2002. <http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf>

27. Foster, I., Fredian, T., Greenwald, D., Keahey, K., McCune, D., Peng, Q., Schissel, D. and Thompson, M. Computational Grids in Action: The National Fusion Collaborator. Future Generation Computer Systems, Vol. 18, No. 8, pp. 1005-1015, October, 2002. <http://www.globus.org/research/papers/fusion02.pdf>

28. Foster, I., Frey, J., Livny, M., Tannenbaum, T. and Tuecke, S. Condor-G: A Computation Management Agent for Multi-Institutional Grids. Proc. of the Tenth International Symposium on High Performance Distributed Computing (HPDC-10), IEEE Press, August, 2001. <http://www.cs.wisc.edu/condor/doc/condorg-hpdc10.pdf>

29. Foster, I., Kesselman, C. and Tuecke, S. The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of High Performance Computing Applications, Vol. 15, No. 3, pp. 200-222, 2001. <http://www.globus.org/research/papers/anatomy.pdf>

30. Foster, I., Kesselman, C., Nick, J.M. and Tuecke, S. The Physiology of a Grid: An Open Grid Services Architecture for Distributed Systems Integration. DRAFT document, 2002. <http://www.globus.org/research/papers/ogsa.pdf>

31. Foster, I., Kesselman, C., Tsudik, G. and Tuecke, S. A Security Architecture for Computational Grids. Proc. Of the 5th ACM Conference on Computer and Communication Security, pp 83-92, 1998. <ftp://ftp.globus.org/pub/globus/papers/security.pdf>

32. Global Grid Forum, <http://www.gridforum.org/>33. Globus Alliance, <http://www.globus.org/>34. Globus Dynamically-Updated Request Online Coallocator (DUROC) v0.8, Globus

Alliance. http://www.globus.org/duroc/frames.html>35. Globus Resource Specification Language (RSL) v1.0, Globus Alliance.

<http://www.globus.org/gram/rsl_spec1.html>36. Grimshaw, A. and Natrajan, A., Legion: Lessons Learned Building a Grid Operating

System. IEEE Transcript on Parallel and Distributed Systems, to appear in 2006. <http://www.anandnatrajan.com/papers/TPDS06.doc>

37. Grimshaw, A. and Wulf, W. The Legion Vision of a Worldwide Virtual Computer. Communications of the ACM, Vol. 40, No. 1, pp. 39-45, January, 1997. <http://www.cs.virginia.edu/~legion/papers/cacm.ps>

38. He, H. What is Service-Oriented Architecture? O’Reilly webservices.xml.com, September 30th, 2003. <http://webservices.xml.com/pub/a/ws/2003/09/30/soa.html>

39. Howes, T. The Lightweight Directory Access Protocol. CITI Technical Report 95-8, July 27th, 1995. <http://www.kingsmountain.com/directory/doc/ldap/ldap.html>

40. HP Grid & Utility Computing, HP, <http://devresource.hp.com/drc/topics/utility_comp.jsp>

41. Humphrey, A. and Grimshaw, A. and Marty A. Grids: Harnessing Geographically-Separated Resources in a Multi-Organisational Context. Presented at High

53

http://devresource.hp.com/drc/topics/utility_comp.jsp

http://www.kingsmountain.com/directory/doc/ldap/ldap.html

http://webservices.xml.com/pub/a/ws/2003/09/30/soa.html

http://www.cs.virginia.edu/~legion/papers/cacm.ps

http://www.anandnatrajan.com/papers/TPDS06.doc

http://www.globus.org/gram/rsl_spec1.html

http://www.globus.org/duroc/frames.html

http://www.globus.org/

http://www.gridforum.org/

ftp://ftp.globus.org/pub/globus/papers/security.pdf

http://www.globus.org/research/papers/ogsa.pdf

http://www.globus.org/research/papers/anatomy.pdf

http://www.cs.wisc.edu/condor/doc/condorg-hpdc10.pdf

http://www.cs.wisc.edu/condor/doc/condorg-hpdc10.pdf

http://www.globus.org/research/papers/fusion02.pdf

http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf

http://crystal.uta.edu/~levine/class/spring2003/grid/Foster_phyiscs_today.pdf

Performance Computing Systems Conference, June, 2001. <http://www.cs.virginia.edu/~legion/papers/HPCS01.pdf>

42. IBM LoadLever: General Information. IBM , 2nd edition, 1993.43. Kandagatla, C. Survey and Taxonomy of Grid Resource Management System.

<http://www.cs.utexas.edu/users/browne/cs395f2003/projects/KandagatlaReport.pdf>44. Katramatos, D. The Legion JobQueue. White paper, University of Virginia, May,

2000. <http://www.cs.virginia.edu/~legion/papers/jobQrep.ps>45. Keahey, V., Lang, S., Liu, B. Meder, S. and Welch., K. Fine-Grain Authorization

Policies in the GRID: Design and Implementation. 1st International Workshop on Middleware for Grid Computing, 2003. <http://www.globus.org/research/papers/mgc_final.pdf>

46. Kleinrock, L. MEMO on Grid Computing. University of California, Los Angeles, 1969. <http://www.lk.cs.ucla.edu/LK/Bib/REPORT/press.html>

47. Livny, M., Raman, R. and Solomon, M. Matchmaking: Distributed Resource Management for High Throughput Computing. Proc. of the Seventh IEEE International Symposium on High Performance Distributed Computing, Chicago, IL., July 28th-31st, 1998. <http://www.cs.wisc.edu/condor/doc/hpdc98.pdf>

48. Livny, M., Raman, R., and Solomon, M. Policy Driven Heterogeneous Resource Co-Allocation with Gangmatching. Proc. of the Twelfth IEEE International Symposium on High-Performance Distributed Computing, Seattle, WA., June 2003. <http://www.cs.wisc.edu/condor/doc/gangmatching-hpdc12.pdf>

49. Livny, M., Tannenbaum, T. and Thain, D. Condor and the Grid. in Fran Berman, Anthony J.G. Hey, Geoffrey Fox, editors, Grid Computing: Making The Global Infrastructure a Reality, John Wiley, 2003. <http://media.wiley.com/product_data/excerpt/90/04708531/0470853190.pdf>

50. Network Queuing System (NQS), University of Maryland, <http://umbc7.umbc.edu/nqs/nqsmain.html>

51. Nimrod: Tools for Distributed Parametric Modeling, Monash University Information Technology, <http://www.csse.monash.edu.au/~davida/nimrod/index.htm>

52. Open Grid Services Infrastructure V1.0: Primer. OGSI Working Group, March 12th, 2004. <https://forge.gridforum.org/projects/ogsi-wg/docman/>

53. Platform Computing's Load Sharing Facility, <http://www.platform.com/>54. Platform LSF Scheduler, Platform Computing,

<http://www.platform.com/products/LSF/>55. Portable Batch System (PBS), NASA Advanced Supercomputing Division,

<http://www.nas.nasa.gov/Groups/SciCon/Origins/Cluster/PBS/>56. Public-Key Infrastructure (X.509), Internet Engineering Task Force,

<http://www.ietf.org/html.charters/pkix-charter.html>57. Raman, R. ClassAds Programming Tutorial (C++). 2000,

<http://www.cs.wisc.edu/condor/classad/c++tut.html>58. SETI@Home Project, <http://setiweb.ssl.berkeley.edu/>59. Simple Object Access Protocol (SOAP) ver 1.1. W3C, Note 8, 2002.60. SSL 3.0 Specification, Netscape Network, <http://wp.netscape.com/eng/ssl3/>61. Sun Microsystems Grid Engine, Sun Microsystems,

<http://wwws.sun.com/software/gridware/>

54

http://wwws.sun.com/software/gridware/

http://wp.netscape.com/eng/ssl3/

http://setiweb.ssl.berkeley.edu/

http://www.cs.wisc.edu/condor/classad/c++tut.html

http://www.ietf.org/html.charters/pkix-charter.html

http://www.nas.nasa.gov/Groups/SciCon/Origins/Cluster/PBS/

http://www.platform.com/products/LSF/

http://www.platform.com/

https://forge.gridforum.org/projects/ogsi-wg/docman/

http://www.csse.monash.edu.au/~davida/nimrod/index.htm

http://umbc7.umbc.edu/nqs/nqsmain.html

http://media.wiley.com/product_data/excerpt/90/04708531/0470853190.pdf

http://www.cs.wisc.edu/condor/doc/gangmatching-hpdc12.pdf

http://www.cs.wisc.edu/condor/doc/hpdc98.pdf

http://www.lk.cs.ucla.edu/LK/Bib/REPORT/press.html

http://www.globus.org/research/papers/mgc_final.pdf

http://www.cs.virginia.edu/~legion/papers/jobQrep.ps

http://www.cs.utexas.edu/users/browne/cs395f2003/projects/KandagatlaReport.pdf

http://www.cs.virginia.edu/~legion/papers/HPCS01.pdf

Documents

Document25