DataGrid - server11.infn.itserver11.infn.it/workload-grid/docs/DataGrid-01-D1.4-0127-1_0.pdf · ist-2000-25182 public 1 / 46 datagrid d efinition of the arc hitecture, technical plan

IST-2000-25182 PUBLIC 1 / 46

D a t a G r i d

D E F I N I T I O N O F T H E A R C H I T E C T U R E , T E C H N I C A L P L A N A N D E V A L U A T I O N C R I T E R I A F O R T H E R E S O U R C E C O -

A L L O C A T I O N F R A M E W O R K A N D M E C H A N I S M S F O R P A R A L L E L J O B

P A R T I T I O N I N G

WP1: Workload Management

Document identifier: DataGrid-01-D1.4-0127-1_0

Date: 13/09/2002

Work package: WP1: Workload Management

Partner(s): INFN

Lead Partner: INFN

Document status APPROVED

Deliverable identifier: DataGrid-D1.4

Abstract: This document updates the architecture specification of the Workload Management System, in particular to describe how resource co-allocation and job partitioning are addressed in this framework.

The subjects of advance reservation (whose mechanisms are used for resource co-allocation), job checkpointing (the job partitioning problem is addressed in this framework) and inter-job dependencies (necessary to “manage” partitionable jobs) are also discussed.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0

DEFINITION OF THE ARCHITECTURE, TECHNICAL PLAN AND EVALUATION CRITERIA FOR THE RESOURCE CO-

ALLOCATION FRAMEWORK AND MECHANISMS FOR PARALLEL JOB

PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 2 / 46

Delivery Slip

Name Partner Date Signature

From Massimo Sgaravatto INFN

Verified by Francesco Prelz INFN

Approved by PTB 10/09/2002

Document Log

Issue Date Comment Author

0_0 03/06/2002 First draft F. Giacomini, A. Gianelle, R. Peluso, M. Sgaravatto

0_1 07/06/2002 F. Giacomini, M. Sgaravatto


0_3 19/06/2002 M. Sgaravatto



0_6 12/07/2002 M. Sgaravatto

1_0 10/09/2002 Final PTB approval

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 3 / 46

Document Change Record

Issue Item Reason for Change

0_1 Addressed Mirek Ruda’s comments

0_2 Modified figure on new architecture; modified state machines; modified presentation page

0_3 Addressed Francesco Prelz’s comments

0_4 Addressed reviewers’ comments

0_5 Addressed D. Bosio, B. Jones and J. Montagnat’s comments

0_6 Addressed M. Parson’s comments

1_0 Final PTB approval

Files

Software Products User files

Microsoft Word 2000 DataGrid-01-D1.4-0127-1_0

Adobe Acrobat 5.0 DataGrid-01-D1.4-0127-1_0

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 4 / 46

CONTENT

1. INTRODUCTION ..............................................................................................................................................................6 1.1. OBJECTIVES OF THIS DOCUMENT................................................................................................................................. 6 1.2. APPLICABLE DOCUMENTS AND REFERENCE DOCUMENTS....................................................................................... 7 1.3. DOCUMENT AMENDMENT PROCEDURE....................................................................................................................... 8 1.4. TERMINOLOGY............................................................................................................................................................... 8 1.5. ACKNOWLEDGEMENTS................................................................................................................................................. 9

2. EXECUTIVE SUMMARY.............................................................................................................................................10

3. REVIEW OF THE WORKLOAD MANAGEMENT SYS TEM ARCHITECTURE...................................12 3.1. THE NEW WMS ARCHITECTURE.............................................................................................................................. 12 3.2. SUPPORTING NEW FUNCTIONALITIES....................................................................................................................... 16

3.2.1. Supporting Job Partitioning.............................................................................................................................16 3.2.2. Supporting Job Dependencies..........................................................................................................................16

4. RESOURCE RESERVATION .....................................................................................................................................17 4.1. EXTERNAL INTERFACE ............................................................................................................................................... 19

4.1.1. Language..............................................................................................................................................................20 4.1.2. Application Programming Interface................................................................................................................21

4.2. INTERNAL DESIGN ....................................................................................................................................................... 22 4.3. DEPENDENCIES ............................................................................................................................................................ 23

5. CO-ALLOCATION .........................................................................................................................................................24 5.1. EXTERNAL INTERFACE ............................................................................................................................................... 25

5.1.1. Language..............................................................................................................................................................25 5.1.2. Application Programming Interface................................................................................................................27

5.2. INTERNAL DESIGN ....................................................................................................................................................... 28 5.3. DEPENDENCIES ............................................................................................................................................................ 28 5.4. EVALUATION CRITERIA FOR RESOURCE CO-ALLOCATION...................................................................................... 29

6. JOB PARTITIONING AND JOB CHECKPOINTING ........................................................................................30 6.1. JOB PARTITIONING: INTRODUCTION TO THE PROBLEM........................................................................................... 30 6.2. JOB CHECKPOINTING................................................................................................................................................... 30

6.2.1. Job state................................................................................................................................................................31 6.2.2. Job checkpointing scenario...............................................................................................................................31 6.2.3. Job checkpointing Application Programming Interface..............................................................................32 6.2.4. Internal design for job checkpointing.............................................................................................................34

6.3. JOB PARTITIONING....................................................................................................................................................... 35 6.3.1. Job partitioning scenario..................................................................................................................................35 6.3.2. Internal design for job partitioning.................................................................................................................38

6.4. EVALUATION CRITERIA FOR JOB PARTITIONING...................................................................................................... 39 7. INTER-JOB DEPENDENCIES ....................................................................................................................................40

7.1. INTRODUCTION............................................................................................................................................................ 40 7.2. EXTERNAL INTERFACE ............................................................................................................................................... 41

7.2.1. Language..............................................................................................................................................................41 7.2.2. Application Programming Interface................................................................................................................43 7.2.3. DAG State Machine............................................................................................................................................44

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 5 / 46

7.3. RECOVERY.................................................................................................................................................................... 45 7.4. DETAILED DESIGN....................................................................................................................................................... 45

8. CONCLUSIONS ...............................................................................................................................................................46

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 6 / 46

1. INTRODUCTION In [R1] the architecture of the WP1 Workload Management System (WMS) was presented. This architecture is now reviewed and complemented. In particular:

• to increase the reliability, the efficiency and the flexibility of the system;

• to allow exploiting and using WP1 modules also “outside” the WP1 Workload Management System, and therefore assuring interoperability with other Grid frameworks;

• to support new functionalities.

Resource co-allocation is one of the new functionalities that will be supported in the WMS: co-allocation means the concurrent allocation of multiple resources, which can be homogeneous or heterogeneous. The Workload Management System will provide a generic framework to support co-allocation, using techniques based on immediate or advanced reservation of resources.

Job partitioning is another new functionality introduced in the revised Workload Management System framework. Job partitioning takes place when a job has to process a large set of “independent elements”, as it often happens in many applications, such as the ones directly involved in the DataGrid project. In these cases it may be worthwhile to “decompose” the job into smaller sub-jobs (each one responsible for processing just a sub-set of the original large set of elements), in order to reduce the overall time needed to process all these elements through “trivial” parallelisation, and to optimize the usage of all available Grid resources.

Job partitioning will be supported in the context of logical job checkpointing, and will also make use of techniques used to support inter-job dependencies.

Section 3 of this document presents how the existing Workload Management System is reviewed and complemented. Section 4 discusses about resource reservation, whose mechanisms are then exploited in the context of resource co-allocation, presented in section 5. Job partitioning and job checkpointing are the subjects of section 6, while section 7 discusses inter-job dependencies. Section 8 concludes the document.

1.1. OBJECTIVES OF THIS DOCUMENT

The goal of this document is to review and complement the first version of the WP1 architecture document ([R1]), in particular to describe how the new functionalities of resource co-allocation (addressed using techniques based on immediate or advanced reservation of resources) and job partitioning (addressed in the context of logical job checkpointing, and which requires also solutions for the problem of inter-job dependencies) can be supported in the framework of the workload management system.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 7 / 46

1.2. APPLICABLE DOCUMENTS AND REFERENCE DOCUMENTS

Reference documents [R1] DataGrid - Definition of Architecture, Technical Plan and Evaluation Criteria for Scheduling, Resource Management, Security and Job Description.

http://www.infn.it/workload-grid/docs/DataGrid-01-D1.2-0112-0-3.pdf [R2] Condor-G Home Page

http://www.cs.wisc.edu/condor/condorg [R3] “Classified Advertisements” Home Page http://www.cs.wisc.edu/condor/classad [R4] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, A. Roy, A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation, Intl Workshop on Quality of Service, 1999.

http://www.globus.org/documentation/incoming/iwqos.pdf [R5] A. Roy, V. Sander, Advance Reservation API, GGF Scheduling Working Group, Scheduling Working Document 9.4 http://www-unix.mcs.anl.gov/~schopf/ggf-sched/WD/schedwd.9.4.pdf [R6] DataGrid - Job Description Language HowTo. http://www.infn.it/workload-grid/docs/DataGrid-01-TEN-0102-0_2.pdf [R7] DataGrid - User Requirements and Specifications for the DataGrid Project. http://datagrid-wp8.web.cern.ch/DataGrid-WP8/Documents/Workspace/D8.1/D8.1.a_v2.1.pdf [R8] DataGrid - Long Term Specifications of LHC Experiment. Part B. http://datagrid-wp8.web.cern.ch/DataGrid-WP8/Documents/Workspace/D8.1/D8.1.b_v2.0.pdf

[R9] DataGrid - Requirements for grid-aware biology applications. http://marianne.in2p3.fr/datagrid/wp10/documents/DataGrid-10-D10.1-0102-3-8.doc [R10] DataGrid - Job partitioning and checkpointing. http://www.infn.it/workload-grid/docs/DataGrid-01-TED-0119-0_3.pdf [R11] The Condor project http://www.cs.wisc.edu/condor

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 8 / 46

[R12] DataGrid - Logging and Bookkeeping Service for the DataGrid http://www.infn.it/workload-grid/docs/DataGrid-01-TEN-0109-1_0.pdf [R13] DataGrid - JDL Attributes http://www.infn.it/workload-grid/docs/DataGrid-01-NOT-0101-0_6-Note.pdf [R14] DAGman http://www.cs.wisc.edu/condor/dagman [R15] Information and Monitoring (WP3) Architecture Report

http://edms.cern.ch/document/334453/

1.3. DOCUMENT AMENDMENT PROCEDURE

Since DataGrid has very fast-paced timescales, with very frequent software releases, it is very difficult to follow a traditional design cycle.

Therefore the architecture described in this document represents work-in-progress and hasn’t been fully implemented yet. It could become necessary to revise this architecture according to new or more focused user requirements, to the evaluation of the framework that will implement the specified architecture, to feedback received by users, to new evolutions in the available technology.

1.4. TERMINOLOGY

Glossary API

CA CE DAG DAGMAN GARA

GGF JDL LB MPI

Application Programming Interface

Co-allocation Agent Computing Element Directed Acyclic Graph DAG MANager Globus Architecture for Reservation and Allocation

Global Grid Forum Job Description Language Logging & Bookkeeping Message Passing Interface

MRI Magnetic Resonance Image

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 9 / 46

QoS

RA RB RC RSL SE

UI WMS

Quality of Service

Reservation Agent Resource Broker Replica Catalog Resource Specification Language Storage Element

User Interface Workload Management System

1.5. ACKNOWLEDGEMENTS

We would like to thank everyone in WP1 for their contributions. We also wish to thank Miron Livny for the very fruitful discussions. Special thanks to Diana Bosio, Bob Jones, Julian Linford, Maciej Malawski and Johan Montagnat for their useful comments.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 10 / 46

2. EXECUTIVE SUMMARY The initial WP1 Workload Management System software architecture described in [R1] and implemented in the first phase of the project, is being reviewed and complemented. The objectives for the revised architecture, discussed in this document, are:

• to increase the reliability and the flexibility of the system;

• to address some of the shortcomings that emerged in the first DataGrid testbed;

• to simplify the whole system;

• to favor the interoperability with other Grid frameworks, allowing the use of modules (e.g. the Resource Broker) also “outside” the WP1 Workload Management System;

• to make it easy to plug-in new components implementing new functionalities.

Immediate or advance reservation of resources, which can be heterogeneous in type and implementation and independently controlled and administered, is one of the new functionalities that must be supported, to allow the use of end-to-end quality of service (QoS) services in emerging network-based applications.

The Workload Management System will provide a generic framework to support reservation of resources, based on concepts that have emerged and been widely discussed in the Global Grid Forum. In its implementation it is foreseen to address at least computing, network and storage resources, provided that adequate support exists from the local management systems. The Reservation Agent is the core component of the resource reservation framework. Its main functionalities are to accept a generic reservation request from a user (specified via the usual Job Description Language, used also for job specification), to map it into a reservation on a specific resource, to match the requirements and preferences specified by the user, to perform the allocation on the specific resource, and then to allow the user to use a granted reservation for his job. The mechanisms designed and implemented for resource reservation will be exploited also for resource co-allocation, that is the concurrent allocation of multiple, homogeneous or heterogeneous, resources. The process for performing a co-allocation, given a co-allocation request (specified using the Job Description Language), consists of three major steps:

1. discover resources compatible with the requirements and preferences included in all the resource descriptions;

2. find compatible combinations of resources that would satisfy the co-allocation request;

3. try each combination taking into account optimisation criteria, until one succeeds or all fail .

The Co-allocation Agent is the component responsible for applying this 3-step procedure.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 11 / 46

Another new functionality that must be supported is job partitioning. Job partitioning takes place when a job has to process a large set of “independent elements”. In these cases it may be worthwhile to “decompose” the job into smaller sub-jobs (each one responsible for processing just a sub-set of the original large set of elements), in order to reduce the overall time needed to process all these elements, and to optimize the usage of all available Grid resources. The proposed approach is to address the job partitioning problem in the context of job checkpointing, as described in section 6. Checkpointing a job during its execution means “saving” in some way its state, so that the job execution can be suspended, and resumed later, starting from the same point where it was previously stopped. Users are provided with a logical checkpointing service: through a proper API, a user can save, at any moment during the execution of a job, the state of this job, and the job must also be instrumented so that it can be restarted from an intermediate (i.e. previously saved) state. The checkpointing API can be exploited also in the context of job partitioning: the processing of a job could be described as a set of independent steps/iterations, and this characteristic can be exploited, considering different, simultaneous, independent sub-jobs, each one taking care of a step or of a sub-set of steps, and which can be executed in parallel. The partial results (that is the results of the various sub-jobs) can be represented by job states (the final job states of the various sub-jobs) which can then be merged together by a job aggregator, which must start its execution when the various sub-jobs have terminated their execution. Hence job partitioning also requires mechanisms to address the problem of inter-job dependencies. Generalizing the problem, it is possible to define a whole web of dependencies on a set of program executions, building a Directed Acyclic Graph (DAG), whose nodes are program executions (jobs), and whose arcs represent dependencies between them. Within the Workload Management System, a DAG will be managed by a meta-scheduler, called DAGMan (DAG Manager), whose main purpose is to navigate the graph, determine which nodes are free of dependencies, and follow the execution of the corresponding jobs.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 12 / 46

3. REVIEW OF THE WORKLOAD MANAGEMENT SYSTEM ARCHITECTURE The objectives that we want to achieve by reviewing the existing WP1 software architecture are:

• the simplification of the flow of control within the Workload Management System, minimizing the dependencies between the different components;

• the reduction of stateful invocations of the internal components. In particular the aim is to minimize the duplication of persistent information related to the same events, which is difficult to keep coherent;

• increasing the reliability and the flexibility of the system;

• the support for new functionalities, which must be accommodated in the Workload Management System framework. There are a number of new functionalities, which have been introduced, but their discussion lies outside the scope of this document. Here we will discuss only job partitioning and co-allocation (and the other frameworks needed to implement these two services, that is resource reservation, job checkpointing and job dependencies);

• New functionalities include job partitioning, the possibility to reserve resources and to co-allocate them, the management of dependencies between jobs;

• the possibility to exploit and use WP1 modules (i.e. the Resource Broker) also “outside” the WP1 Workload Management System. This is particularly important to guarantee interoperability with other Grid frameworks (e.g. the ones developed within the various US Grid projects).

3.1. THE NEW WMS ARCHITECTURE

The new Workload Management System architecture is represented in Figure 1.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 13 / 46

Figure 1: UML diagram describing the new WMS architecture

The User Interface (UI) is the component that allows users to access the functionalities offered by the Workload Management System. Although there are several changes compared to the architecture described in [R1], the commands available at the user interface level are the same. This means that this modification of the architecture does not imply significant changes from the user point of view. The Network Server is a generic network daemon, responsible for accepting incoming requests from the UI (e.g. job submission, job removal), which, if valid, are then passed to the Workload Manager. For this purpose the Network Server uses Protocol, to check if the incoming requests conform to the agreed protocol.

The Workload Manager is the core component of the Workload Management System. Given a valid request, it has to take the appropriate actions to satisfy it. To do so, it may need support from other components, which are specific to the different request types.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 14 / 46

All these components that offer support to the Workload Manager provide a class whose interface is inherited from a Helper class, which consists of a single method (resolve()).

Essentially the Helper, given a JDL expression, returns a modified one, which represents the output of the required action. For example, if the request was to find a suitable resource for a job, the input JDL expression will be the one specified by the user, and the output will be the JDL expression augmented with the CE choice.

The Resource Broker is one of these classes offering support to the Workload Manager. It provides a matchmaking service: given a JDL expression (e.g. for a job submission, or for a resource reservation request), it finds the resources that best match the request. Actually the Resource Broker can be “decomposed” in three sub-modules:

• a sub-module responsible for performing the matchmaking, therefore returning all the resources suitable for that JDL expression;

• a sub-module responsible for performing the ranking of matches resources, therefore returning just the “best” resource suitable for that JDL expression;

• a sub-module implementing the chosen scheduling strategy; this must be easily pluggable and replaceable with other ones implementing different scheduling strategies.

Within this architecture, the Resource Broker is therefore re-cast as a module, implementing the Helper interface, which can be “plugged” and used also in frameworks other than the WP1 Workload Management System. The Job Adapter is responsible for making the final “touches” to the JDL expression for a job, before it is passed to CondorG for the actual submission. So, besides preparing the CondorG submission file, this module is also responsible for creating the wrapper script: as described in [R1], the user job is wrapped within this script, which is responsible for creating the appropriate execution environment in the CE worker node (this includes the transfer of the input and of the output sandboxes). CondorG ([R2]) is the module responsible for performing the actual job management operations (job submission, job removal, etc.), issued on request of the Workload Manager. The CondorG framework is exploited for various reasons:

• the reliable two-phase commit protocol used by CondorG for job management operations;

• the persistency: CondorG keeps a persistent (crash proof) queue of jobs. This queue will be used as a persistent database storing information, represented via Condor classads ([R3]), concerning active jobs;

• the logging system: CondorG logs all the signification events (e.g. job started its execution, job execution completed, etc.) concerning the managed jobs: this is useful to increase the reliability of the whole system;

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 15 / 46

• the increased openness of the CondorG framework;

• the need for interoperability with the US Grid projects, of which CondorG is an important component.

The Log Monitor is responsible for“watching” the CondorG log file, intercepting interesting events concerning active jobs, that is events affecting the job state machine described in [R1] (e.g. job done, job cancelled, etc.), and therefore triggering appropriate actions. The Reservation Agent and the Co-Allocation Agent are represented in the same block to simplify the figure. The Reservation Agent is the core component of the reservation framework. As explained in section 4, its main functionalities are to accept a reservation request from a user, map it into a reservation on a specific resource, perform the allocation on this resource, and allow the user to use the granted reservation. The Co-Allocation Agent, in the framework of resource co-allocation, is responsible for discovering resources compatible with the requirements and preferences included in a co-allocation request, finding compatible combinations of resources that would satisfy the co-allocation request, and trying each combination until one succeeds or all fail (see section 5). For what concerns the Logging and Bookkeeping service, it stores logging and bookkeeping information concerning events generated by the various components of the WMS. Using this information, the LB service keeps a state machine view of each job. This service is not subject to any change compared to the previous architecture discussed in [R1]. The only significant expected change is the planned use of R-GMA [R15] framework to improve the efficiency of the service. The dependencies between this component and the other modules of the Workload Management System (UI accessing the LB service to get status and logging information on jobs, and the various modules pushing events concerning jobs to the LB) are not represented in the figure, again for increased simplicity.

The other modules (Partitioner and DAGMan) are explained in the following section. Therefore, besides having introduced new components to support new functionalities (e.g. DAGMan, to support job dependencies, the Reservation Agent to support resource reservation, etc.), the core functionalities have been split between the Workload Manager (responsible to take the appropriate actions given a request) and the Resource Broker (which is now a pluggable component, just responsible to find the best CEs given a specific JDL expression). Moreover, the duplication of persistent information was avoided by relying more widely on the CondorG system.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 16 / 46

3.2. SUPPORTING NEW FUNCTIONALITIES

This new proposed architecture, besides improving the efficiency of the whole system, makes it also easy to plug in new components implementing new functionalities. This is the case, for example, for job partitioning (discussed in section 6) and for job dependencies (the subject of section 7).

3.2.1. Supporting Job Partitioning As it will be discussed in section 6.3.2, in the context of job partitioning, a JDL expression representing a partitionable job must be transformed in a JDL expression representing a Directed Acyclic Graph (DAG) of jobs. The component providing this functionality, the Partitioner, is an example of a Helper class, which can be called by the Workload Manager when it receives a request of type "partitionable job".

3.2.2. Supporting Job Dependencies

As it will be discussed in section 7, in order to manage dependencies between jobs, a component called DAGMan (DAG Manager) is introduced. DAGMan can be seen as an iterator on a graph of jobs: when a job is free of dependencies it can be submitted to CondorG for its execution. Before passing a job belonging to the DAG to CondorG, it is first necessary to find an appropriate resource for its execution. This can be accomplished by calling the Resource Broker, in the same way as it is usually called by the Workload Manager for “single” jobs: the result of the call will be the original JDL expression augmented with the resource choice. DAGMan has then to also call the Job Adapter, to make the final adjustments to the job description.

Figure 1 shows the DAGMan in the overall design. The JDL expression, received from the UI, representing the DAG, is first of all modified by the Job Adapter, to create the Condor submit file, and the DAG is submitted to CondorG. DAGMan, in turn, for each job (node) composing the DAG calls the Resource Broker (to bind the job to a resource).

Note that in the picture there is a DAGMan dependency on the Partitioner since a DAG job could in turn be partitionable: in this case DAGMan has to call the Partitioner, which returns a DAG (which, like all other DAGs, is then submitted to CondorG, which then spawns another DAGMan). In other words, every time CondorG receives job, which is a DAG, it spawns a DAGMan process to manage that DAG. In turn DAGMan, for each job contained in the DAG, passes it to CondorG (after calling one or more helpers).

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 17 / 46

4. RESOURCE RESERVATION

The realization of end-to-end quality of service guarantees in emerging network-based applications may require mechanisms that support advance or immediate reservation of resources, which can be heterogeneous in type and implementation, and independently controlled and administered. Employing reservation mechanisms can also help to reduce competition for resources and to implement higher-level abstractions, such as the concurrent allocation of multiple resources. Therefore, by (immediately or in advance) reserving resources, it is possible to mediate among competing requests and to avoid oversubscriptions, and therefore prevent degraded services and/or increased costs due to excess overprovisioning ([R4]). The Workload Management System will provide a generic framework to support reservation of resources. In its implementation it is foreseen to address at least computing, network and storage resources, provided that adequate support exists from the local management systems. The approach described here for reserving a resource and then using that reservation is based on concepts described in the Global Grid Forum (GGF) draft “Advanced Reservation API” ([R5]).

A resource reservation request is specified by the following attributes:

• Start time: the earliest time that the reservation may begin;

• Duration: how long the reservation lasts;

• Resource type: the type of the underlying resource, such as network, computation, storage;

• Reservation type: used in case a resource supports different types of reservation;

• Resource-specific parameters: parameters that are specific to the type of resource, such as bandwidth or maximum jitter for a network reservation, number of nodes for a reservation of a computing resource, etc.;

• End time: the latest time that the reservation can expire. If the difference between “end time” and “start time” exceeds the requested “duration”, any given time interval of the correct “duration” starting at or after “start time” and not ending past “end time” is acceptable.

Not all the attributes are mandatory: if not specified, “start time” defaults to “now” and “end time” defaults to “infinite”. A reservation request can also contain additional information that could help the system to find a better resource match. For example, if with a request for a computing resource it is also specified that the considered job will have to access a certain data set, the system (i.e. the Resource Broker) will try to find and reserve a resource able to provide the best access to the required data.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 18 / 46

The process for acquiring a resource, given a reservation request, consists of two major steps:

1. Discover resources compatible with the requirements and preferences specified in the request; this phase implies querying, either directly or indirectly, the resources to know their current status and availability;

2. Perform the allocation on the “best-matched” resource. In a Grid context, searching for resources, querying their status and allocating one of them cannot be reasonably performed in an atomic manner, hence this two-step allocation procedure is expected to fail (i.e. the process of finding and reserving a resource compatible with the requirements doesn’t succeed) with a non-negligible rate. The system must therefore be designed and implemented to be resilient to such failures. Once the reservation has been granted, it can then be used by a user to perform the corresponding job: a computing reservation can be used for running applications, a storage reservation can be used to store files, a network reservation can be used to initiate data transfers, etc. The declaration of which reservations a job is entitled to use is expressed by the attribute UseReservation in the JDL expression of a job. UseReservation is a list of reservation identifiers:

UseReservation = { <reservation_id_1>, ..., <reservation_id_n> };

The information associated to a reservation will be used by the Workload Management System to act appropriately.

Using a previously granted reservation may require for some resources and/or resource types passing one or more parameters that were not known at reservation time, and which are needed to use the reservation. This is the case for the example of a network reservation manager that implements QoS “marking” packets based on the port used by the application sending data; such information is usually available only when the application has started. Passing run-time information to the reservation manager is called binding. A binding can then be cancelled with an unbinding operation, that is the run-time information passed during the binding is not valid any more. In the network example above it means that the router should not mark any more the packets originated from the source address (port) specified during the bind. Depending on the resource and/or resource type a reservation can:

• be bound-unbound several times;

• be bound multiple times without being unbound.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 19 / 46

Other useful operations on a reservation are:

• cancellation: a user can explicitly release a reservation when not needed anymore. Otherwise the reservation is released by the resource itself according to its own policy; for example the reservation is released at expiration time or if not used within a certain timeout;

• monitoring: a reservation goes through a number of different states during its lifetime (resource discovery, resource allocation, reservation utilization). A user may want to be informed about the changes of status, for example to know when he can finally use a reservation;

• modification: sometimes it is desirable to modify the parameters associated to an active reservation, specified in the original request. If the new parameters are less demanding in terms of resource consumption, the modification request can be usually accepted.

Therefore the user, before submitting a job, can request to reserve a specific resource. The result of this reservation request will be an identifier (see section 4.1.2). The user can then check the status of the reservation and, when it has been granted and therefore is ready to be used, it can then be specified along with the other attributes of the JDL expression needed to submit the considered job.

It is then up to the Workload Management System to transparently use this reservation. For example, if the reservation request was for a computing resource, the job will have to be submitted to the CE chosen at reservation time, and it will be necessary to interact with the correspondent Resource Manager to be able to use the granted reservation.

4.1. EXTERNAL INTERFACE

The Reservation Agent (RA) is the core component of the resource reservation framework. The main functionalities of the RA are:

• to accept a generic reservation request, map it into a reservation on a specific resource, matching the requirements and preferences specified by the user, and perform the allocation on the specific resource;

• to allow the user to use a granted reservation for his job.

In a broad sense, the interface that the Reservation Agent presents to its clients is characterized by:

• the language used to express reservation requests;

• the Application Programming Interface (API) provided to interact with the Reservation Agent.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 20 / 46

4.1.1. Language

The language used to specify a resource reservation request is the Job Description Language ([R6]) used for job submission.

As mentioned above, a reservation request is characterized by several attributes, which, in JDL, are mapped to the following:

• ReservationResource: to specify the resource type. The type is expressed as a string chosen in a predefined set, currently including “computing”, “network”, “storage”;

• ReservationType: to specify the reservation type. This is expressed as a string and is resource-dependent;

• ReservationStart: to specify the start time. The time is an integer value expressing the number of seconds since the epoch1;

• ReservationEnd: to specify the end time, expressed as the number of seconds since the epoch;

• ReservationDuration: to specify the duration, described as the number of seconds the reservation should last.

• ReservationParameters: to specify resource-dependent parameters. Besides the above attributes, the JDL expression should also contain a Rank and especially a Requirements expression, as it happens for job submissions. In particular the Requirements expression should ask that the selected resource support reservation (boolean attribute SupportReservation): this attribute will be automatically added by the Reservation Agent to the JDL expressions specifying resource reservation requests. The JDL “Type” attribute is used to specify that a JDL expression define a resource reservation request if it equals “Reservation”. As an example, the following JDL expression represents a reservation request for three nodes for 300 seconds on a Computing Element running Linux, whose architecture is i386:

[ Type = “Reservation”; ReservationType = “computing”; ReservationStart = 1021539656; ReservationEnd = 1021541000;

ReservationDuration = 300; ReservationParameters = [ nodes = 3; ]; ...

1 Corresponding to the midnight of the 1st of January 1970 UTC

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 21 / 46

Requirements = other.Arch == “i386” && other.OpSys == “Linux” && other.SupportReservation; ]

4.1.2. Application Programming Interface

The Reservation Agent provides its clients with an Application Programming Interface designed along the guidelines specified in [R5]:

create_reservation():

creates a reservation for the specified request bind_reservation():

binds a reservation to run-time parameters

unbind_reservation(): unbinds a reservation

cancel_reservation(): cancels a reservation

modify_reservation():

modifies the parameters associated with a reservation status_reservation(): returns the status of the resource reservation As previously introduced, a reservation is referenced via an identifier. This is a user-controlled handle that can be used to manipulate the reservation. The identifier is assigned and used like a job identifier (see [R1]): it is assigned on the client side at the moment the reservation request is created.

The above resource-independent API is implemented on top of a number of resource-dependent Reservation Agents. These support the same functionality of the generic Reservation Agent but with the following personalisations:

• no resource discovery is done (this is a responsibility of the generic RA);

• for reasons of reliability and flexibility, the creation of a reservation is implemented as a two-phase commit. To support this, the resource-dependent API includes a commit_reservation() function.

Therefore the Reservation Agent does all the operations that are resource independent. For the rest it delegates the resource-dependent Reservation Agent. So there will be a Network Reservation Agent

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 22 / 46

(possibly many, if different QoS techniques are possible for the network), a Computing Reservation Agent, and a Storage Reservation Agent.

4.2. INTERNAL DESIGN

The creation of a reservation is shown in Figure 2. The submission of the request (1) returns immediately, indicating if the RA is willing to accept the request. If that is the case, the RA starts the discovery phase contacting the RB, which returns an ordered (by rank) list of suitable resources (2). The RB performs this task querying the Information System, where the characteristics and the status of the various resources, including the schedule of reservations (provided by the Resource Managers) are published, and performs the matchmaking with the JDL expression specifying the reservation request. The RA then tries to reserve a resource: it iterates through the list representing suitable resources, and contacts the correspondent Resource Managers, until it succeeds (3).

Figure 2: Creation of a reservation

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 23 / 46

Note also that, when available, Grid accounting mechanisms will be exploited to avoid abuses, for example in case a malicious user reserve in advance a large amount of resources degrading the overall performance of the Grid. Where possible, the Reservation Agent will be implemented on top of the Globus Architecture for Reservation and Allocation, GARA ([R4]), in particular to implement the interaction with the local resource managers, for resource reservation requests and for using the granted reservations. GARA in fact provides mechanisms for QoS reservations for different types of resources, including computers, networks, and disks.

4.3. DEPENDENCIES

The reservation framework depends on services provided by other components:

• the Resource Broker finds resources that match the requirements and preferences of a resource reservation request;

• local Resource Managers must support reservation requests; they should also publish their capability in the Information System (attribute SupportReservation) and possibly also the current schedule of reservations;

• the Logging and Bookkeeping Service receives the various events concerning resource reservation and keeps a reservation state machine;

• the overall Workload Management System should provide support for allowing jobs to use granted reservations.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 24 / 46

5. CO-ALLOCATION

With co-allocation we mean the concurrent allocation of multiple resources. These resources can be homogeneous, as it is for example for the concurrent allocation of multiple nodes on two different computing elements, needed to run a distributed parallel job, or they can be heterogeneous. An example of co-allocation of heterogeneous resources is the concurrent allocation of space on a storage element, a node on a computing element and some bandwidth on the network link between the computing and storage elements to run a job that writes a large amount of data at a high rate. Co-allocating resources, especially in a distributed non-centrally-managed environment, is usually a difficult problem, for various reasons: resources can be of widely varying types, can be located in different administrative domains and subject to different control policies and mechanisms, ([R4]), etc. The Workload Management System will provide a generic framework to support co-allocation of resources, using techniques based on immediate or advanced reservation of resources. When a co-allocation is requested, it is necessary to specify the time frame and the list of the required resources. As in the case of a resource reservation (see section 3), the attributes concerning the time frame are:

• Start time: the earliest time that the co-allocation may begin;

• Duration: how long the co-allocation lasts;

• End time: the latest time that the co-allocation can expire.

Each resource involved in the co-allocation is described by the following attributes:

• Resource type: the type of the underlying resource, such as network, computation or storage;

• Reservation type: used in case a resource supports multiple types of reservation (e.g. for the network a guaranteed bandwidth, or a guaranteed latency, or a limit on the jitter, etc.);

• Resource-specific parameters: parameters that are specific to the type of resource;

• A "critical" flag: if set to “true”, the whole co-allocation fails if the reservation of this particular resource fails.

The process for performing a co-allocation, given a co-allocation request, consists of three major steps:

1. discover resources compatible with the requirements and preferences included in all the resource descriptions; this phase implies querying, either directly or indirectly, the resources to know their current status and availability;

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 25 / 46

2. find compatible combinations of resources that would satisfy the co-allocation request. A combination is a sequence of specific resources that, if reserved successfully, would make the whole co-allocation successful;

3. try each combination taking into account optimisation criteria, until one succeeds or all fail. For each combination the resources are tried sequentially.

Once the co-allocation is granted, it can be used to perform the considered job; in this respect the co-allocation becomes simply a set of reservations and it is the responsibility of the user to correctly manage those reservations. Nevertheless the co-allocation entity can still be used as a handle for the whole set (for example for operations such as cancellation, monitoring and modifications, the latter however limited to the time frame).


The Co-allocation Agent (CA), in the framework of co-allocation, is responsible for accepting a co-allocation request from a user and applying the 3-step procedure described above: resource discovery, identify combinations of resources compatible with the request, select an acceptable one.

The Co-allocation Agent is not responsible for the subsequent use of the reservations representing the result of the co-allocation.

In a broad sense the interface presented by the Co-allocation Agent to its clients is described by:

• the language used to express co-allocation requests;

• the API to interact with the Co-allocation Agent.

5.1.1. Language

The language used to specify a co-allocation request is, as for reservation, the Job Description Language ([R6]). As mentioned above, a co-allocation request contains information concerning the time frame for the allocation, and the description of the involved resources. The time frame is described by the following attributes, analogous to the ones used in the context of resource reservation (see section 4.1.1):

• ReservationStart: for the start time;

• ReservationEnd: for the end time;

• ReservationDuration: for the duration.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 26 / 46

The description of a resource contains the following attributes (for a more detailed description see section 4.1.1):

• ReservationResource: for the resource type;

• ReservationType: for the reservation type;

• ReservationParameters: for the resource-dependent parameters.

Both at co-allocation and reservation level, the JDL expression should contain a Rank and especially a Requirements expression. In particular the Requirements expression for a reservation must require that the considered resource support reservation.

The “Type” attribute is used to identify a JDL expression specifying a co-allocation, when it equals “coallocation”.

As an example, the following JDL expression specifies a co-allocation request for a computing node, 100 GB of storage in a storage element “speaking” a certain protocol (gridftp), and a connection between the considered computing element and storage element of 10 MB/s. Note also that the data set used as input by the job which will use this co-allocation has been also specified (attribute “InputData”), along with the identifier of the Replica Catalog (RC) this data set refers to, so that the system will try to find the best CE-SE match with respect to the I/O access to these data:

[ Type = "coallocation"; ReservationStart = 1022248228; ReservationEnd = 1022255428;

ReservationDuration = 3600; Res1 = [ Type = "Reservation"; ResourceType = "Computing";

ReservationParameters = [ nodes = 3; ]; Requirements = other.Arch == "i386” && other.OpSys==”Linux" && other.SupportReservation; InputData = "LF:testbed0-00019";

ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=INFN Test

RC,dc=sunlab2g, dc=cnaf, dc=infn, dc=it";

];

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 27 / 46

Res2 = [ Type = "Reservation"; ResourceType = "Storage"; ReservationParameters = [ space = 100000; ];

Requirements = other.SupportReservation && other.FreeSpace > ReservationParameters.space && other.Protocol == “gridftp”; ];

Res3 = [ Type = "Reservation"; ResourceType = "Network"; ReservationParameters = [ Bandwidth = 10000;

EndPoints = {Res1.CeId, Res2.SEId}; ]; Requirements = other.SupportReservation; ] ]


The Co-allocation Agent provides its clients with an Application Programming Interface (API) that allows management of co-allocations:

create_allocation():

creates a co-allocation, applying the 3-step procedure previously described. The result is a co-allocation represented by a set of reservations;

cancel_allocation(): cancel a co-allocation by cancelling all the reservations belonging to the specified co-allocation;

modify_allocation(): used to modify the allocation time frame parameters;

status_allocation(): returns the status of the co-allocation, in terms of the status of the associated reservations.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 28 / 46

The individual reservations can also be directly manipulated through the reservation system API described in section 4.1.2.

A co-allocation is addressable with an identifier. This handle is used whenever the owner wants to operate on the co-allocation. The identifier, assigned on the client side at the moment the co-allocation request is generated, is used like a job identifier (see [R1]). The single reservations that are part of the co-allocation are identified in the same way as simple reservations are.

5.2. INTERNAL DESIGN

It is felt that resource co-allocation is best implemented on top of the resource reservation mechanisms. As described in section 4, a Reservation Agent performs two actions: resource discovery and resource reservation, through resource-specific Reservation Agents. In the context of co-allocation, resource discovery must be implemented by the Co-allocation Agent: this cannot be demanded of the Reservation Agent because of the need of having a more global view of the available resources. The resource reservation phase can then be based directly on the resource-specific Reservation Agents, whose interface is required to be public.

In order to apply some optimisation to the co-allocation process described in section 5 (in particular step 3 of the described procedure), especially if the failure rate (i.e. failure to find suitable co-allocations) happens to be excessively high, the agent may try several combinations of resources in parallel. If the reservation process is implemented as a two-phase commit the co-allocator can try several possibilities in parallel without committing. It can then choose the best co-allocation, commit the corresponding reservations and forget about the others. If such a feature is not available (i.e. create_reservation() really allocates the resource) the co-allocator should then cancel the unneeded reservations.

5.3. DEPENDENCIES

The co-allocation framework depends on services provided by other components:

• the Resource Broker finds resources that match the requirements and preferences of a co-allocation request;

• resource-specific Reservation Agents provide the primitives for managing resource reservations;

• the Logging and Bookkeeping Service records the transitions of the state of the co-allocation.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 29 / 46

5.4. EVALUATION CRITERIA FOR RESOURCE CO-ALLOCATION

The process to acquire a co-allocation described above requires that various decisions must be taken such as:

• Which preferences and requirements, besides those expressed by the user, have to be taken into account during the resource discovery phase (e.g. the availability of data, the overall status of the Grid, etc.)

• How to decide if it is worth “trying” a particular set of compatible resources. For example, given a computing-storage co-allocation and a pair <CE, SE> satisfying the requirements for this co-allocation, the agent may decide that the available bandwidth between the CE and SE is not suitable for the application that will use the co-allocation, and therefore can decide to ignore that combination.

• Given a compatible combination of resources: in which order to try each reservation, and how to bind some parameters given the result of previous reservations in the same combination ? For example, if the co-allocation requires some QoS on two Computing Elements and on the network link between them, the agent can decide to reserve first the CEs and then the network between them. In this case, when trying the reservation on the network link, the two endpoints are already bound. On the other hand, if the agent decides to reserve first a suitable network link, then the parameter for the two subsequent computing reservations is fixed.

• How to decide, in a set of successful co-allocations, which one to retain. Multiple co-allocations may be successful for the same request. In this case the agent has to decide which one to retain, releasing (or not committing) the others.

The agent design shall allow different strategies to be adopted, possibly at the same time.

The behaviour of the co-allocation agent will be evaluated considering various parameters, including the following:

• the rate of successful co-allocation requests versus the total number of requests

• the average time needed to process a successful co-allocation request

• the average time needed to process an unsuccessful co-allocation request

• the number of combinations attempted before finding a successful co-allocation Much work still needs to be done in this area and a detailed plan is therefore still lacking.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 30 / 46

6. JOB PARTITIONING AND JOB CHECKPOINTING

6.1. JOB PARTITIONING: INTRODUCTION TO THE PROBLEM

Job partitioning takes place when a job has to process a large set of “independent elements”. These could be, for example, HEP events ([R7], [R8]), isochromats in parallel isochromat simulation ([R9]), images of slices for some MRI applications ([R9]), etc. In these cases it may be worthwhile to decompose the job into smaller sub-jobs (each one responsible for processing just a sub-set of the original large set of elements), in order to reduce the overall time needed to process all these elements, and to optimize the usage of all available Grid resources. The peculiar feature of the problem, considering the nature of the jobs that we have to consider, is that the various elements are independent, and therefore the sub-jobs responsible for processing them don't need any communication between them during their execution. The various sub-jobs, which can run simultaneously on different computing resources, all run the same application. Users may also require that the output data produced by the various sub-jobs be “recombined” together, through another job, to a single stream, which produces the actual output that must be returned to the user. Trying to address job partitioning in a general common framework, instead of finding different customized solutions for different applications and environments, we can say that a job can be partitioned into sub-jobs if the job can be described as being composed by many independent steps. For example a step can be the analysis of a specific sub-set of events, the processing of a isochromat, the Monte Carlo generation of a specific set of HEP events (just a sub-set of the whole set of events that must be produced in that Monte Carlo simulation), etc. We propose here (see section 6.3) to address the job partitioning problem in the context of checkpointable jobs, discussed in section 6.2. A comprehensive description of the problem and of the proposed solutions is presented in [R10], where the checkpointing-partitioning framework from a user point of view (e.g. the description of the proper JDL attributes) is described as well. We realize that some of the plans reported in this document concerning job partitioning and checkpointing are speculative at present and high risk. It will be possible only later to understand if it is necessary to revise the proposed architecture design, in particular according to the first evaluation of the framework that will implement the specified architecture.

6.2. JOB CHECKPOINTING Checkpointing a job during its execution means saving in some way its state, so that the job execution can be suspended, and resumed later, starting from the same point where it was previously stopped, without the need to restart the job from its beginning, and therefore without losing the computation done until that moment. Checkpointing a job from time to time can prevent data loss, due to unexpected problems (e.g. failures such as machine crashes, etc.). This is useful to increase the reliability, in particular for long-running jobs, since relying on checkpointing the number of failed (for external reasons) jobs will decrease. We don't want to address here the classic checkpointing problem, that is saving somewhere all the information related to a process (process's data and stack segments, information about open files, pending signals, CPU state, etc.) as it is addressed in other projects (e.g. Condor [R11]). Instead, the

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 31 / 46

idea is providing the users with a “trivial”, or logical checkpointing service: through a proper API, a user can save, at any moment during the execution of a job, the state of this job. The hypothesis is, of course, that the job can be restarted later from an “intermediate” (i.e. previously saved) state.

6.2.1. Job state As described in section 6.2, the idea is that users can insert in the code for their applications some specific function calls (see section 6.2.3) to save, from time to time, the state of their jobs. Many applications, and in particular most DataGrid applications, can be seen as being “composed” of a set of sequential steps/iterations. For example a step can represent the processing of a MRI image, the analysis of an HEP event, the processing of a file, etc. In these cases it may be worthwhile to save the state of the job after each of these steps. A check-pointable application must be able, of course, to restart itself from a previously saved state. In the “trivial” checkpointing service that we are going to provide, a state is defined by the user, and it is represented by a list of <var, value> pairs. It is up to the user to define which <var, value> pairs must be saved as state of the job: he is the only one who knows which are the right ones. They must allow to represent exactly what that job has done until that moment, and they must be chosen by the user in such a way that, relying on them, the job can restart later its processing from this intermediate state.

6.2.2. Job checkpointing scenario In the code for a checkpointable job, the user has to define what is a state, that is he has to specify which <var, value> pairs are needed to represent exactly what that job has done until that moment and are “enough” to restart later the processing from an intermediate state, and has to include calls to persistently save, from time to time, the state of the job in terms of such pairs. As mentioned in section 6.2.1, many applications can be seen as composed by a set of sequential steps. The various steps can be represented by a set of iterations (the initial set of iterations can be specified at submission time using a specific attribute in the JDL expression), and the iterations through the various steps can be done using a specific function. Normally a job would save its state after each step. The checkpointing framework is useful when a job ends in an “abnormal” way, that is when it exits before it completes, because of an “external” problem (e.g. a machine crash). If and only if it is clear that the problem was “external” to the job, then the job must be automatically rescheduled (possibly on a CE different than the one where the problem happened, satisfying the same requirements for that job) and resubmitted. If a state for that job was saved in its previous execution, the job doesn't need to start from the beginning, but it can start from the “point” corresponding to the last saved state. This means that, first of all, the last saved state is “loaded”. Then the user's code, using this “retrieved” state, must be able to start from the point corresponding to this intermediate state (if the ”set of iterations” mechanism is used, it is just necessary to “jump” to the next iteration.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 32 / 46

Unfortunately it is not always so straightforward to “automatically” (by the Grid middleware) understand when a job ends in an “abnormal” way. In some cases the Grid middleware can report that the job has been completed in a normal way, when in fact this could not be true. If this is the case (i.e. some problems occurred), the user can retrieve an intermediate state for this job (usually the last saved one), and resubmit the job, pointing out that it must start using this intermediate state. We allow retrieving2 (and then using) a state other than the last saved one, since for example the user could find that the problem in his job was due to some errors in the code, and the last saved state for the job cannot be used, since it is affected by the buggy code, while a previous saved state is still correct, and therefore can be used, after having replaced the wrong (buggy) executable with a one were the bugs have been fixed.

6.2.3. Job checkpointing Application Programming Interface

In the proposed checkpointing framework, the state of a job can be described through the use of an abstract object:

Object State: { // Data Members Label_t state_id = “label”; VarValueSet var_value_pairs[] =

{``var1''=``value1'', ``var2''=``value2'', ... }; StepsSet main_stepper = {`èlement1'', `èlement2'', `èlement3'', ... }; Label_t current_step;

// Methods int save_value(Pair); int save_state(); <type> get_value(string); State load_state(Label_t);

Label_t get_next_step(); int get_state_size(); int get_max_state_size(); }

2 By using the dg-get-job-chkpt command, as explained in [R10]

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 33 / 46

Each state has a name that identifies it unambiguously: the first data member (state_id ) represents this identifier. The var_value_pairs data member represents the set of <var, value> pairs, that are the variables with their values, used to describe the status of the job. main_stepper defines the list of steps/iterations that must be considered to process this job (to be used for those jobs which can be represented by a set of steps). For instance, a list of input file names, event numbers or OIDs can be specified. main_stepper can otherwise be viewed as a list of labels defining some points inside the code of the job. So when a job starts from a given state, it can “jump” to the right label, and start its computation from that point.

The last data member (current_step) of the object represents an element of the main_stepper set; it represents the current step (i.e. the step that is considered by the job at that moment).

The methods for this State object are:

int save_value(Pair pair) This function saves the pair pair (a pair of strings which define respectively the name and the value of the variable that must be saved in the job state) in the var_value_pairs data member.

int save_state() This function saves persistently the current State object.

<type> get_value(string var_name) This function returns the value of the variable var_name stored in the var_value_pair set with name var_name.

State load_state(Label_T stateid) This function “retrieves” a state for a job, whose identifier is the one specified as the argument.

Label_t get_next_step() This function returns the next element of the main_stepper set (i.e. the next step that must be considered by the job), or “NULL” if the job is in its last step.

int get_state_size() This method returns the size of the current State object. This is used (with the get_max_state_size method) to be sure that the state can be persistently saved.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 34 / 46

int get_max_state_size() This function returns the maximum size of a state that can be persistently saved.

6.2.4. Internal design for job checkpointing The main functionality that must be provided is therefore the possibility to persistently save the state of a job, and to retrieve a previously saved state. For this purpose, relying on the existing Workload Management System architecture described in [R1], the LB service can be used: among the various attributes (described in [R12]) associated for each job, also its job states can be stored here. Therefore when the function save_state is called from a user's job, the various <var, value> pairs corresponding to the State object data members are saved in the LB server. Problems could occur if a large size state has to be saved (e.g. the available space in the LB server can be limited, etc.): in case of failure, the save_state method will return a proper error. Users can check, before issuing the save_state call, what is the size of the current state (using the get_state_size method), and the maximum state size that can be saved in the LB server (using the get_max_state_size method). If a large state must be considered, the corresponding information could be saved in a SE, and then just the correspondent file name can be specified in the <var, value> pairs specifying the state.

An initial state is also automatically saved in the LB server when a job has just been submitted to the Grid: this state will be empty, apart from the main_stepper attribute (i.e. the set of iterations) whose initial value can be specified in the JDL expression.

As described in section 6.2.2, multiple states for the same job can be kept in the LB server: for each job its last n states are persistently stored (where n is a configurable parameter of the LB server) in a cyclic way (when a new state must be saved, and there aren't free slots for that job, the oldest state for that job is overwritten).

When a State object is instantiated in a user's job, the last saved state for this job is automatically retrieved from the LB server, and therefore the user’s code can use the information stored in this state and restart the processing from the point corresponding to this intermediate state. If, on the contrary, this is the first execution of the job, the retrieved state will be empty (or it will include the main_stepper attribute only, as specified in the JDL expression). A job state retrieval fails if the number of retrievals considering the same state for that job reached a certain (configurable) threshold: this is done in order to avoid infinite resubmissions. As mentioned in section 6.2.2, job state retrieval from LB happens also when a user requests the last saved (or a previous one) state for a job. This retrieved state can then be specified as “starting” state for another job (and therefore it will be simply stored in the LB server as the initial state for the new job).

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 35 / 46

6.3. JOB PARTITIONING The API introduced in section 6.2.3 for the checkpointing problem, can be exploited also in the context of job partitioning. As we have already mentioned, the idea is that the processing of a job should be described as a set of independent steps/iterations. The goal is to exploit this characteristic, considering different, simultaneous, independent sub-jobs, each one taking care of a step or of a sub-set of steps, and which can be executed in parallel. The partial results (that is the results of the various sub-jobs) can be represented by job states (the final job states of the various sub-jobs), as defined in section 6.2.1, and which can then be “merged” together by a job aggregator. It must be stressed again that the hypothesis is, of course, that the various steps are independent, and therefore can be processed in parallel by different independent jobs.

6.3.1. Job partitioning scenario

When a user submits a partitionable job to the Grid, he must specify, using a specific JDL attribute, which are the independent steps/iterations “composing” the job. Besides specifying, in the JDL expression, the characteristics and the requirements of the partitionable job, the user can also specify the “job aggregator” (that is the job which is responsible for collating and merge together the results for the various sub-jobs in which the original job is being partitioned) and a “pre-job”, that is a job which must be executed before the various sub-jobs: JobType = Partitionable;

Executable = ...; JobSteps = ...; StepWeight = ...; Requirements = ...; ... ... Prejob = [ Executable = ... Requirements = ...; ... ... Aggregator = [ Executable = ... Requirements = ...; ... ... ];

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 36 / 46

JobType = Partitionable specifies that the job is a partitionable one. JobSteps is used to list the independent steps, while StepWeight is used to specify the “weights” for the various elements of the list: as it will be discussed in section 6.3.2, this is a “hint” given by the user to help the Grid middleware in performing a good partitioning.

Once the user has submitted the job, the Partitioner is then responsible for partitioning the list of independent steps into sub-lists, each one then corresponding to a sub-job.

At this point a DAG is created and submitted to the Grid. The resulting DAG will look like the one represented in Figure 3, where SJ1, SJ2, …, SJm represent the sub-jobs in which the original partitionable job was decomposed.

Figure 3: DAG created for a partitionable job

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 37 / 46

Both the m sub-jobs and the job aggregator will rely on the checkpointing APIs described in section 6.2.3.

Each sub-job will be given (through the specific JDL attribute) as the initial set of iterations one sub-list previously created by the Partitioner (one sub-list per each sub-job). Therefore the initial state for each sub-job (defined by the initial set of iterations) is defined and stored in the LB server by the Partitioner.

Before exiting, each sub-job will have to save, using the save_state method, its final state. Each sub-job will have the same Requirements and Rank of the initial partitionable job. For what concerns the InputData attribute (specifying the list of input data), if the list of independent steps represents this attribute, each sub-job will be given as InputData one sub-list previously created by the Partitioner (one sub-list per each sub-job).

When all the sub-jobs have completed their execution, and they all have saved their final state, the job aggregator can start its execution (this is triggered by the DAGMan mechanisms, described in section 7). This job will be given (as usual through the proper JDL attribute) as the initial set of iterations the list of the identifiers of the final states of the various sub-jobs. The job aggregator will have to “retrieve” (using the load_state method) these final states of the sub-jobs, and it will then perform some processing to collect together these partial results (i.e. the results of the various sub-jobs). This is the pseudo-code of a possible job aggregator:

#include <api.h> State AggState; Label_t SubJobId; while (SubJobId = AggState.get_next_step()) { State tmpstate ( AggState.load_state((SubJobId))); // the following instruction uses // tmpstate.get_value() functions processing_state(tmpstate); } }

print_overall_results();

First of all an object State (AggState ) for the aggregator job is instantiated. Therefore its initial state is retrieved, and the main_stepper data member (the list of iterations) is initialized with the identifiers of the final states of the various sub-jobs, which were specified in the proper JDL attribute. These sub-jobs states are retrieved (using the load_state function). Then the <var, value> pairs associated to each state are in some way processed, to produce the overall result of the original partitionable job.

Doc. Identifier:

DataGri d-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 38 / 46

6.3.2. Internal design for job partitioning

When a partitionable job is submitted, as discussed in section 6.3.1, the first problem is to partition it. In other words, to decompose its list of independent steps/iterations, into sub-lists. Each sub-list corresponds to a different sub-job. This is the responsibility of a particular Helper module, the Partitioner, whose role in the new Workload Management System architecture has been introduced in section 3.2.1. The strategies, the policies, and the parameters that should/could be taken into account in such partitioning can be various, and will be investigated to find which are the best approaches. As a starting point, and as a proof of concept for the whole framework, the initial list will be partitioned in m sub-lists, where m represents the total number of available CPUs for all the CEs satisfying the requirements for this job. How to optimize the partition of the n elements of the initial list of independent iterations into m sub-lists is not trivial, since the Grid cannot know anything about the complexity, the time taken to process a single element (these times can be different for the different elements of the list, etc.). For this purpose a “hint” can be given by the user, to help the Grid middleware in performing a good partitioning: the user can specify in the JDL expression a list of “weights” for the various elements of the initial list. If the user doesn't specify this attribute, then all the elements are considered with the same weight, so a simple partitioning will be done: each sub-list will have n/m elements.

Other possible approaches and algorithms will be investigated in the future (possibly during the third year of the DataGrid project).

The next step is the definition of a DAG, composed by m+2 nodes: m nodes representing the m sub-jobs (which can be executed in parallel) and 2 other nodes, representing the pre-job (which is executed before the m sub-jobs), and the job aggregator (which can start its execution when the other m sub-jobs have terminated their run).

For what concerns the scheduling of the DAG, the matchmaking is performed for each node of the graph, and therefore in general the various nodes can be submitted to different CEs (“lazy” scheduling: see section 7.4).

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 39 / 46

6.4. EVALUATION CRITERIA FOR JOB PARTITIONING

As discussed in section 6.3.2, given a list of independent iterations, the Partitioner module must partition this list in m sub-lists (each one then corresponding to a “parallel” sub-job). Deciding how many sub-lists have to be considered (i.e. choosing the value for m), and then partitioning the initial list in m sub-lists is a complex problem, since many different policies and strategies could be considered. Many factors can affect this choice, such as the number of existing and available processors belonging to CEs “suitable” for that job, the resources (e.g. CPU time) needed to process each element of the list of independent iterations (which can be different for different elements), possible “hints” given by users, etc.

The choice for the right balance of information on which the decision can be made is an open problem: in section 6.3.2 a first, very rough strategy was proposed, but clever and more efficient strategies will emerge based upon the experiences of the application areas and will therefore be analysed later. The module to perform the partitioning will have to be modular and easily replaceable, since it is very unlikely that a single strategy will be the best one for any kind of application and for any environment.

The strategies will be evaluated considering various parameters, such as the ones in this (non exhaustive) list:

• the number of sub-jobs created, compared with the number of existing and free resources, suitable for those sub-jobs, in the Grid;

• the permanence time of individual sub-jobs in the Workload Management System queue;

• the difference measured using the wall clock time between the fastest sub-job and the slowest sub-job;

• the amount of resources consumed by each sub-job.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 40 / 46

7. INTER-JOB DEPENDENCIES

7.1. INTRODUCTION

Very often the execution of a program Y relies on the outcome of a previously run program X, meaning that Y cannot start before X has successfully finished. Therefore there is a temporal dependency between the execution of X and the execution of Y. Generalizing this concept, it is possible to define a whole web of dependencies on a set of program executions, building a Directed Acyclic Graph (DAG), whose nodes are program executions (jobs), and whose arcs represent dependencies between them. A node represents a job and consists of three parts:

• a PRE-script, which is executed before the user's job is run;

• a user's job;

• a POST-script, which is executed after the user's job has run.

The PRE-script can be provided as a means to prepare the execution environment for a job, while the POST-script can be useful to check the job's outcome and to perform some post-processing actions.

A DAG node is considered failed if any of these three steps fail, while a whole DAG succeeds if and only if all its member jobs succeed.

An arc between two nodes represents a dependency between the jobs corresponding to those two nodes. At the moment we consider just temporal dependencies (e.g. run job B only when job A has finished), but the specification of other types of dependencies between nodes will be investigated.

The arcs are directed since there is a clear time order on which job should be run first. Moreover there cannot be loops in the graph, otherwise the time order would not be preserved.

Apart from the temporal dependencies, the jobs represented by the various nodes are independent: each one has its own executable, input, output, running environment, requirements, preferences, etc.

A DAG is considered as a single unit of work, and therefore it is represented (having its own entries) in the Logging and Bookkeeping Service, and it can be managed (e.g. killed or monitored) like a single “normal” job. Moreover the “visibility” of the single jobs (i.e. nodes) composing a DAG is also provided, and therefore users can, for example, kill or monitor such single jobs.

Within the Workload Management System, a DAG is managed by a meta-scheduler, called DAGMan (DAG Manager) [R14], whose main purpose is to navigate the graph, determine which nodes are free of dependencies, and follow the execution of the corresponding jobs. DAGMan is a product originally

developed within the Condor project [R11]. A DAGMan process is started by CondorG for each DAG submitted to it.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 41 / 46


Three aspects must be specified when describing how DAGs are seen from the client's point of view:

• the language used to specify a DAG;

• the API provided to manage a DAG;

• the state machine representing the status of a DAG during its lifetime.

7.2.1. Language

The language used to specify a DAG is the same Job Description Language (JDL) ([R6]) used for normal jobs.

The specification of a DAG consists of a ClassAd, where the attribute “Type” is set to "DAG", containing a set of ClassAd attributes, each one representing a job. A job ClassAd is specified using the usual attributes listed in [R13], plus the following ones:

• Children = <array of string>; These represent the identifiers (names) of children of the considered job, that is the jobs that can be executed only after the considered job has completed its execution;

• PreScript = <string>;

The script to run before job execution;

• PreScriptArguments = <array of strings>; The list of arguments for the PRE-script;

• PostScript = <string>;

The script to run after the job has completed;

• PostScriptArguments = <array of strings>; The arguments for the POST-script.

For example the DAG represented in Figure 4 can be specified by the following JDL expression:

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 42 / 46

[ Type = "DAG"; A = [

Executable = "A.sh"; PreScript = "PreA.sh"; PreScriptArguments = { "1" }; Children = { "B", "C" } ];

B = [ Executable = "B.sh"; PostScript = "PostA.sh"; PostScriptArguments = { "$RETURN" };

Children = { "D" } ]; C = [ Executable = "C.sh"; Children = { "D" }

]; D = [ Executable = "D.sh"; PreScript = "PreD.sh";

PostScript = "PostD.sh"; PostScriptArguments = { "1", "a" } ] ]

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 43 / 46

Figure 4: Example of DAG

The "RETURN" macro, included in PostScriptArguments for job B in the above example, represents the exit status of B.sh and in this case is provided as a way to tell the POST-script the exit status of its corresponding job. Note, however, that an exit status other than zero implies that the node, and hence the whole DAG, has failed.


The operations on DAGs supported by the Workload Management System are specified using the following functions:

dg-job-submit: submits a DAG to the Workload Management System;

dg-job-cancel: kills a previously submitted DAG. All the jobs that are part of the DAG get killed as well. A rescue DAG (see section 7.3) is produced;

dg-job-status: returns the current status of the job (see section 7.2.3);

dg-job-get-output: retrieves the output sandboxes for all the DAG member jobs, assumed that the DAG has completed.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 44 / 46

A DAG is addressable with an identifier. The identifier is assigned and used like a single job identifier (see [R1]), in particular it is assigned on the client side when the DAG is submitted to the Workload Management System.

7.2.3. DAG State Machine

A DAG goes through a number of states during its lifetime. Since, as introduced above, a DAG must be considered as a unit of work, the job state machine described in [R1] applies to it as well, albeit some of the states are not meaningful in this scenario, since there isn’t scheduling for a DAG as a whole, but only for its member jobs.

The states that apply to a DAG are (see Figure 5): SUBMITTED: the user has submitted the DAG, using the User Interface;

WAITING: the DAG has been submitted to CondorG, but a DAGMan process has not been started yet to manage it;

RUNNING: a DAGMan has been started for the DAG; DONE: the DAG has been completed (this does not mean that all the jobs have completed successfully) or the DAG has been explicitly terminated by the user. If at least one job in the DAG did not complete successfully, a rescue DAG has been produced;

ABORTED: the DAG has been aborted for external reasons; CLEARED: the user has retrieved successfully all the output files. Bookkeeping information is purged some time after the job enters this state.

Figure 5: DAG State Machine

Moreover, since, as reported in section 7.1, the visibility of the single jobs (nodes) composing a DAG is also provided, tools can be easily written to retrieve the status of all the jobs within the DAG.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 45 / 46

7.3. RECOVERY

Given the distributed and uncontrolled nature of the Grid, a failure in one of its components is not an uncommon event. If failures are likely to happen for single jobs, the problem is even worst for DAGs, since their lifetime is usually much longer than single jobs. The software infrastructure must therefore be resilient to any kind of fault that can occur, and it must be able to recover whenever possible.

When a DAG node fails, it doesn’t make sense to rerun all its “previous” jobs. Instead a recovery mechanism is considered: if a DAG fails, a recovery file is created. This recovery file stores the so-called rescue DAG: this is the original DAG, where the already completed jobs have been marked as done. Therefore, when resubmitted, the rescue DAG will be executed only for the jobs that are still to be run (i.e. the ones not marked as done). The resubmission of the rescue DAG is triggered automatically (without requiring user intervention) by the Workload Manager, if requested to do so.

7.4. DETAILED DESIGN

DAGMan can be seen as an iterator through the nodes of a DAG, looking for free nodes (i.e. nodes without dependencies). The corresponding jobs can then be submitted for execution, via CondorG.

Before doing this, it is of course necessary to choose the resource where to submit the job. Therefore there must be a scheduling decision.

Different models could be considered:

• Eager scheduling: the job is bound to a resource before the whole DAG is submitted;

• Lazy scheduling: the job is bound to a resource just before that job is submitted;

• Very lazy scheduling: the job is bound to a resource where the resource is already acquired. Therefore when there is an available resource, the best-matching job among those that are in the queue is found.

It is felt that the later a job is bound to a resource the better it is, although this can affect the complexity of the software. For what concerns the “very lazy scheduling”, supporting this approach would heavily affect the current Workload Management System design and so it is not considered viable, at least in the near future. Therefore the Workload Management System will be implemented considering the lazy scheduling model, although a user is in any case free to bind a job to a specific resource.

The implementation of the lazy model implies that a job be scheduled (i.e. bound to a resource) just before it is submitted to a certain resource via CondorG. To support this, DAGMan will call one or more Helper modules, along the same lines adopted within the Workload Manager, as already introduced in section 3.2.2.

Doc. Identifier:

DataGrid-01-D1.4-0127-1_0



PARTITIONING

Date: 13/09/2002

IST-2000-25182 PUBLIC 46 / 46

8. CONCLUSIONS In this document we presented and discussed the proposed revised and final WP1 Workload Management System software architecture. With this new architecture, besides being able to increase the reliability and the flexibility of the system, we aim to address some of the existing shortcomings that emerged in the first DataGrid testbed, to favour the interoperability with other Grid frameworks, and we think it will be easier to plug-in new components implementing new functionalities. One of the new functionalities that we are going to provide is resource co-allocation, that is the concurrent allocation of multiple, homogeneous or heterogeneous, resources. Resource co-allocation will be provided relying on mechanisms for immediate or advance reservation of resources, which can be heterogeneous in type and implementation and independently controlled and administered. Another new functionality that must be supported is job partitioning. Job partitioning takes place when a job has to process a large set of "independent elements". In these cases it may therefore be worthwhile to "decompose" the job into smaller sub-jobs, each one responsible for processing just a sub-set of the original large set of elements. The proposed approach, discussed in the document, is to address the job partitioning problem in the context of job checkpointing. Checkpointing a job during its execution means "saving" in some way its state, so that the job execution can be suspended, and resumed later, starting from the same point where it was previously stopped. Job partitioning also requires mechanisms to address the problem of inter-job dependencies, which can be represented by Directed Acyclic Graphs (DAGs), whose nodes are program executions (jobs), and whose arcs represent dependencies between them. As discussed in the document, within the Workload Management System, a DAG will be managed by a meta-scheduler, called the DAGMan (DAG Manager).