22
EGEE is a project funded by the European Union under contract INFSO-RI-508833 Practical approaches to Grid workload management in the EGEE project Massimo Sgaravatto INFN Padova On behalf of the EGEE JRA1 IT-CZ cluster CHEP 2004 www.eu-egee.org

EGEE is a project funded by the European Union under contract INFSO-RI-508833

Embed Size (px)

DESCRIPTION

Practical approaches to Grid workload management in the EGEE project Massimo Sgaravatto INFN Padova On behalf of the EGEE JRA1 IT-CZ cluster. CHEP 2004. www.eu-egee.org. EGEE is a project funded by the European Union under contract INFSO-RI-508833. EGEE project. EGEE project - PowerPoint PPT Presentation

Citation preview

EGEE is a project funded by the European Union under contract INFSO-RI-508833

Practical approaches to Grid workload management in the

EGEE project

Massimo SgaravattoINFN Padova

On behalf of the EGEE JRA1 IT-CZ cluster

CHEP 2004

www.eu-egee.org

Chep 2004 - 2

EGEE project

• EGEE project Aim: build a consistent, robust and secure Grid infrastructure Focus first on two pilot applications areas (HENP, Biomedical

applications)• But the goal is to take other researchers in academia and industry

• Middleware activity (JRA1) Re-engineer Grid software to provide production quality

middleware Evolution towards emerging standards, based on Service

Oriented Architectures Taking into account application requirements and production/

deployment/ management needs

• See talk #247 (E. Laure)

Chep 2004 - 3

Workload management

• Grid workload and resource management is one of the key Grid middleware functionality How to efficiently schedule a big number of different data-intensive jobs,

submitted by a distributed community of users, to a Grid encompassing many and heterogeneous resources

• Progress was made in various projects with different integrated software solutions: DataGrid Workload Management System Condor EuroGrid-Unicore resource broker …

• Still a lot to do Scalability, reliability Identification and handling of failures originating from different software

layers, and possibly from 'foreign' Grid system and resources Distributed (hierarchical ?) super-scheduling Proper semantics of resource information collection and distribution (push,

pull, index, cache, refresh) …

Chep 2004 - 4

Workload Management System

• Provision of Grid Workload Management System services assigned to the “EGEE JRA1 Italian Czech cluster” CESNET Datamat S.p.A. INFN

• Architecture of the EGEE WMS designed and being implemented Taking into account feedback and requirements from reference applications

and deployment/production/management activities Taking into account previous experiences from other Grid projects (in

particular the DataGrid WMS) Set of Grid services

• Workload Manager (WM)• Computing Element (CE): Resource access• Logging & Bookkeeping (L&B)• Job Provevance (JP)• Grid Accounting service

Interoperating among them and with other EGEE Grid Services

Chep 2004 - 5

Workload Manager

Chep 2004 - 6

Workload Manager

Job managementrequests (submission, cancellation) expressed

via a Job DescriptionLanguage (JDL)

Chep 2004 - 7

Workload Manager

Keeps submission requests

Requests are kept for a while

if no matchingresources available

Chep 2004 - 8

Workload Manager

Repository of resource information

available to matchmaker

Updated via notifications and/or active

polling on sources

Chep 2004 - 9

Workload Manager

Finds an appropriateCE for each submission

request, taking into account job requests and preferences, Grid status, utilization policies

on resources

Chep 2004 - 10

Scheduling policies

• Different possible policies Eager scheduling: a job is bound to a resource as soon as possible

• Job is then forwarded to that CE, where very likely it will end up in a queue

Lazy scheduling: job held by the WM until a resource becomes available• Job then forwarded to that CE for immediate execution

• WM architecture able to accommodate both models (and the intermediate solutions) Eager scheduling: matching a job against multiple resources Lazy scheduling: matching a resource against multiple jobs

• Needed to better investigate strengths and weaknesses of different policies in different scenarios Evaluation of relevant metrics, covering both resource utilization and user

satisfaction

Chep 2004 - 11

Computing Element

• Service representing a computing resource• Main functionality: job management

Run jobs Cancel jobs Suspend and resume jobs Provide info on “quality of service”

• How many resources match the job requirements ?• What is the estimated time to have the job starting its execution ?• …

…• Used by the WM or by any other client (e.g. end-user)• CE architecture accommodated to support both push and pull model

Push model: the job is pushed to the CE by the WM Pull model: the CE asks the WM for jobs

• These two models are somewhat mirrored in the resource information flow In order to 'pull' a job a resource must choose where to 'push' information

about itself

Chep 2004 - 12

CECECECE

LSFLSFLSFLSF

Worker NodesWorker NodesWorker NodesWorker Nodes

PBSPBSPBSPBS ????

MonMonMonMon

ClientClient

WEB WEB

CE Architecture

JobSubmitJobAssess

JobKillJobSuspendJobResume

JobGetStatus

Web serviceaccepting jobmanagement

requests

Chep 2004 - 13

CECECECE

LSFLSFLSFLSF

Worker NodesWorker NodesWorker NodesWorker Nodes

PBSPBSPBSPBS ????

MonMonMonMon

ClientClient

WEB WEB

CE Architecture

Notifications

Job requests

Async. notificationsabout job/CE events

Job requests (forCE working in pullmode)

Chep 2004 - 14

Logging & Bookkeeping

• Collects and manages job-related events (e.g. submission, suitable CE found, start of execution, …) from the WMS components

• Processes these events to give a higher level view on job states

• Both job states and raw data available to users Also via Web Service interface

• Possible to subscribe to receive notifications on particular job state changes

• LB event trail can be analyzed to identify problems with resources ("black holes", unusual failure rates, etc).

• See poster #419 for more details

Chep 2004 - 15

Job Provenance

• Keeps track of definition of submitted jobs, execution conditions and job life cycle for a long time Job life logs (JDL, timestamps, jobids, …) Executable and input/output files Execution environment (OS, installed software version, …) Custom data provided by user

• Used for Debugging Post-portem analysis Comparison of job executions in an evolving environment

• Service components Primary Storage Server

• Keeps data in the most compact and economic form Index Servers

• Configured to support a set of queryable attributes

• See poster # 419 for more details

Chep 2004 - 16

Grid Accounting

• Accumulates information about the usage of Grid resources by users / groups (e.g. VOs)

• To be used To track resource usage To discover abuses (and help avoiding them)

• Also possible to charge users for the resources they have used

• Allows implementation of submission policies based on resource usage Exchange market among Grid users and Grid resource owners,

which should result in market equilibrium Load balancing on the Grid

Chep 2004 - 17

Accounting architecture

Accounting

ComputingElement

StorageElement

Resource metering:

getting infoabout resource

usage

Resource metering:

getting infoabout resource

usage

Chep 2004 - 18

Accounting architecture

Accounting

ComputingElement

StorageElement

Reports aboutresource usage per user / VO/

resource

Chep 2004 - 19

Accounting architecture

Accounting

ComputingElement

StorageElement

Resourcepricing

Resource owner

Chep 2004 - 20

Accounting architecture

Accounting

ComputingElement

StorageElement

Resourcepricing

Resource owner

Costcomputation

Chep 2004 - 21

Status

• Workload Manager, Logging & Bookkeeping, Grid Accounting software inherited by DataGrid WMS software Being revised and complemented according to the new architecture

• E.g. Information Supermarket, TaskQueue new developments• Web services interfaces

First implementation already deployed in the EGEE GLITE prototype testbed

• Computing Element New fresh developments CEMon prototype already implemented

• Job Provenance New component being implemented

Chep 2004 - 22

Links

• EGEE JRA1 IT-CZ cluster homepage http://egee-jra1-wm.mi.infn.it/egee-jra1-wm

• EGEE JRA1 (middleware activity) homepage http://egee-jra1.web.cern.ch/egee-jra1

• EGEE project homepage http://www.eu-egee.org