M. Sgaravatto – n° 1 Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid Massimo Sgaravatto INFN Padova

M. Sgaravatto – n° 1

Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid

Massimo Sgaravatto INFN Padova - DataGrid WP1

[email protected]

http://presentation.address

mailto:[email protected]

http://www.hep.net/


WP1 Workload Management System

Working Workload Management System prototype implemented by WP1 in the first phase of the EDG project

Software released in September 2001

One of the very few components in the first release of EDG software

Application users have experienced for a while with this first release of the workload management system

Stress tests and semi-production activities (e.g. CMS stress tests, Atlas efforts)

Significant achievements exploited by the experiments but also various problems were spotted

Impacting in particular the reliability and scalability of the system


Review of WP1 WMS architecture

WP1 WMS architecture reviewed To apply the “lessons” learned and addressing the shortcomings emerged with the first release of the software, in particular

To increase the reliability problems To address the scalability problems

To support new functionalities

To favor interoperability with other Grid frameworks, by allowing exploiting WP1 modules (e.g. RB) also “outside” the EDG WMS


WMS Revised Architecture

UI RLS

Inform.Service

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RB node

CE characts& status

SE characts& status

RBstorage

Match-Maker/ Broker

JobAdapter

Log Monitor

Logging &Bookkeeping

WP 1


Job submission example

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ReplicaCatalog

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

edg-job-submit myjob.jdlMyjob.jdl

JobType = “Normal”;Executable = "$(CMS)/exe/sum.exe";InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"};OutputSandbox = {“sim.err”, “test.out”, “sim.log"};Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ && other.GlueCEPolicyMaxWallClockTime > 10000;Rank = other.GlueCEStateFreeCPUs;

submitted

Job Status

UI: allows users to access the functionalitiesof the WMS

Job Description Language(JDL) to specify job characteristics and requirements

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

Input Sandboxfiles

Job

waiting

submitted

Job StatusNS: network daemon responsible for acceptingincoming requests

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

WM: responsible to takethe appropriate actions to satisfy the request

Job

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

Match-Maker/Broker

Where must thisjob be executed ?

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

Match-Maker/ Broker

Matchmaker: responsible to find the “best” CE where to submit a job

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

Match-Maker/ Broker

Where are (which SEs) the needed data ?

What is thestatus of the

Grid ?

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

Match-Maker/Broker

CE choice

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

waiting

submitted

Job Status

JobAdapter

JA: responsible for the final “touches” to the job before performing submission(e.g. creation of wrapper script, etc.)

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

Job Status

JC: responsible for theactual job managementoperations (done via CondorG)

Job

submitted

waiting

ready

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

CE characts& status

SE characts& status

RBstorage

Job Status

Job

InputSandboxfiles

submitted

waiting

ready

scheduled

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

RBstorage

Job Status

InputSandbox

submitted

waiting

ready

scheduled

running

“Grid enabled”data transfers/

accesses

Job

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

RBstorage

Job Status

OutputSandboxfiles

submitted

waiting

ready

scheduled

running

done

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

RBstorage

Job Status

OutputSandbox

submitted

waiting

ready

scheduled

running

done

edg-job-get-output <dg-job-id>

Job submission

UI

NetworkServer

Job Contr.-

CondorG

WorkloadManager

RLS

Inform.Service

ComputingElement

StorageElement

RB node

RBstorage

Job Status

OutputSandboxfiles

submitted

waiting

ready

scheduled

running

done

cleared


Job monitoring

UI

Log Monitor

Logging &Bookkeeping

NetworkServer

Job Contr.-

CondorG

WorkloadManager

ComputingElement

RB node

LM: parses CondorG logfile (where CondorG logsinfo about jobs) and notifies LB

LB: receives and stores job events; processes corresponding job status

Log ofjob events

edg-job-status <dg-job-id>edg-job-get-logging-info <dg-job-id>

Job status


Interoperability with other services

Information services Queried by the RB to see the status of the Grid (characteristics and status of CEs

and SEs)

WMS able to work with: Globus MDS based information services (as in LCG and INFN-GRID testbeds) RGMA-GOUT (as in EDG testbed)

Evaluating direct (i.e. no via GOUT) interoperability to RGMA

Replica Location Service Queried by the RB to see where the specified data are physically available (in

which SEs)

WMS able to work with EDG-WP2 RLS

On-going plans to work also with US RLS

VOMS Used for VO based security


Deployment of WMS services

RLS

One or more for each RB

Usually deployed in

the “RB node” (but

not required)

One for each RB

“Community” (VO) RB or

“Personal” RB

Queue of a LRMS (LSF,

PBS)

Submitting machine (UI)

”RB node”

LB server

II/GOUT

CE CESE SE

Usually one per

VO

NS, WM, JC, LM, PR

MyProxy server

Possibility to submit to more

than one RBs from a single UI

Used for proxy

renewal

VOMS

Used for

VO based security


WMS release 2: new functionalities

User APIs

GUI

Support for interactive jobs

Job checkpointing

Support for parallel jobs

Gangmatching

Support for automatic output data upload and registration

VOMS support

…


GUI & APIs


Interactive jobs

Specified setting JobType = “Interactive” in JDL

When an interactive job is executed, a window for the stdin, stdout, stderr streams is opened

Possibility to send the stdin to

the job

Possibility the have the stderr

and stdout of the job when it

is running

Possibility to start a window for

the standard streams for a

previously submitted interactive

job with command edg-job-attach


Job checkpointing

Checkpointing: saving from time to time job state Useful to prevent data loss, due to unexpected failures

Approach: provide users with a “trivial” logical job checkpointing service

User can save from time to time the state of the job (defined by the application)

A job can be restarted from an intermediate (i.e. “previously” saved) job state

Different than “classical checkpointing (i.e. saving all the information related to a process: process’s data and stack segments, open files, etc.)

Very difficult to apply (e.g. problems to save the state of open network connections)

Not necessary for many applications

To submit a checkpointable job Code must be instrumented (see next slides)

JobType=Checkpointable to be specified in JDL


Job checkpointing example

int main () { … for (int i=event; i < EVMAX; i++) { < process event i>;} ...exit(0); }

Example ofApplication(e.g. HEP MonteCarlosimulation)


Job checkpointing example#include "checkpointing.h"

int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); }

User code must be easily instrumented in order to exploit the checkpointing framework …




•User defines what is a state•Defined as <var, value> pairs• Must be “enough” to restart a computation from a previously saved state




User can savefrom time to timethe state of the job




Retrieval of the last saved stateThe job can restart from thatpoint


Job checkpointing scenarios

Scenario 1 Job submitted to a CE When job runs it saves from time to time its state Job failure, due to a Grid problem (e.g. CE problem) Job resubmitted by the WMS possibly to a different CE Job restarts its computation from the last saved state

No need to restart from the beginning The computation done till that moment is not lost

Scenario 2 Job failure, but not detected by the Grid middleware User can retrieve a saved state for the job (typically the last one)

edg-job-get-chkpt –o <state><edg-jobid>

User resubmits the job, specifying that the job must start from a specific (the retrieved one) initial state

edg-job-submit –chkpt <state> <JDL file>


Submission of parallel jobs

Possibility to submit MPI jobs

MPICH implementation supported

Only parallel jobs inside a single CE can be submitted

Submission of parallel jobs very similar to normal jobs Just needed to specify in the JDL:

JobType= “MPICH” NodeNumber = n;

The number (n) of requested CPUs

Matchmaking CE chosen by RB has to have MPICH sw installed, and at least n total

CPUs If there are two or more CEs satisfying all the requirements, the one

with the highest number of free CPUs is chosen


Gangmatching

With “standard” matchmaking only 2 “involved entities” the job and the CE

Gangmatching allows to take into account, besides CE information, also SE information in the matchmaking

Typical use case for gangmatching: My job has to run on a CE close to a SE with at least 200 MB of available

space:

Requirements = anyMatch(other.storage.CloseSEs, target.GlueSAStateAvailableSpace > 200);


Other new functionalities

VOMS support VO taken from VOMS user proxy

Matchmaking performed wrt VO Not necessary to publish anymore in the information service the list of authorized

users (only list of authorized VOs needed)

In any case WMS works also with non-VOMS proxies

Compliance with Glue Schema Common Information Service schema between US and EU HENP Grid Projects

LB ACLs Allow setting who can query the status of a given job

Output data upload and registration Possibility to trigger (via JDL) output data upload into a SE and registration in

Replica Location Service (RLS)


Output data registration

OutputData = {

[

OutputFile = "filename1";

LogicalFileName = "lfn:mylfn1";

StorageElement = "testbed007.cnaf.infn.it"

],

[

OutputFile = "filename2"

],

[


LogicalFileName = "lfn:mylfn2"

],

[


StorageElement = "testbed007.cnaf.infn.it"

]

}

Both LFN and target SE specified

Nor LFN nor target SE specified

Only LFN specified

Only target SE specified


Status

WMS release 2 being used and evaluated by applications Deployed in LCG testbed

Deployed in EDG testbeds

Being deployed-customized-exploited in CrossGrid testbed

Deployed in INFN-GRID testbeds INFN-GRID development testbed used to test new stuff

So far very good feedbacks Users are reporting great improvement wrt release 1, in particular for

what concerns reliability

Also quite happy about the level of support (e.g. bug fixes) that we are able to provide


Ratio of succesful jobs of retrieved jobs (4963 of 5000 = 99.26%)

99.9

0.4

91.099.3 95.4 98.7 100.0

0.0

20.0

40.0

60.0

80.0

100.0

CNAF Taiw an Hungary FNAL adc0018 adc0015 Germany

Sites

Per

cen

t

Geographical Job Distribution

CNAF; 838; 16%

Taiwan; 224; 5%

Hungary; 564; 11%

FNAL; 838; 17%

adc0018; 819; 17%

adc0015; 849; 17%

Germany; 831; 17%

LCG 1.0 Test (19./20. Sept. 2003):• 5 streams• 5000 jobs in total

• Input and OutputSandbox• Brokerinfo query• 30 sec sleep

Slide presented at the latest EDG workshop by Markus Schulz (LCG)


Future activities

Support and bug fixes

Working on new functionalities Dependencies of jobs

Integration of Condor DAGMan “Lazy” scheduling: job (node) bound to a resource (by RB) just before that job can be submitted (i.e.

when it is free of dependencies)

Support for job partitioning Use of job checkpointing and DAGMan mechanisms

Original job partitioned in sub-jobs which can be executed in parallel At the end each sub-job must save a final state, then retrieved by a job aggregator, responsible to collect the results of the sub-jobs and produce the overall output

Grid Accounting Based upon a computational economy model

Users pay in order to execute their jobs on the resources and the owner of the resources earn credits by executing the user jobs

To take account of resource usage And to make possible a nearly stable equilibrium able to satisfy the needs of both resource `producers' and `consumers‘

Getting ready for EGEE …


Conclusions

Revised WMS architecture To address emerged shortcomings

To support new functionalities

Deployed in various testbeds (in particular by LCG, our main customer)

Very good feedbacks

Working on some new functionalities to be shown at the last EDG review and in case going to be exploited by LCG (e.g. DAGMan)

More info http://www.infn.it/workload-grid

Documents

M. Sgaravatto – n° 1 Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid Massimo Sgaravatto INFN Padova