Upload
cecilia-peters
View
214
Download
0
Embed Size (px)
Citation preview
M. Sgaravatto – n° 1
Overview of release 2 of the EDG WP1 Workload Management System deployed in the INFN production Grid
Massimo Sgaravatto INFN Padova - DataGrid WP1
http://presentation.address
M. Sgaravatto – n° 2
WP1 Workload Management System
Working Workload Management System prototype implemented by WP1 in the first phase of the EDG project
Software released in September 2001
One of the very few components in the first release of EDG software
Application users have experienced for a while with this first release of the workload management system
Stress tests and semi-production activities (e.g. CMS stress tests, Atlas efforts)
Significant achievements exploited by the experiments but also various problems were spotted
Impacting in particular the reliability and scalability of the system
M. Sgaravatto – n° 3
Review of WP1 WMS architecture
WP1 WMS architecture reviewed To apply the “lessons” learned and addressing the shortcomings emerged with the first release of the software, in particular
To increase the reliability problems To address the scalability problems
To support new functionalities
To favor interoperability with other Grid frameworks, by allowing exploiting WP1 modules (e.g. RB) also “outside” the EDG WMS
M. Sgaravatto – n° 4
WMS Revised Architecture
UI RLS
Inform.Service
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RB node
CE characts& status
SE characts& status
RBstorage
Match-Maker/ Broker
JobAdapter
Log Monitor
Logging &Bookkeeping
WP 1
M. Sgaravatto – n° 5
Job submission example
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ReplicaCatalog
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
edg-job-submit myjob.jdlMyjob.jdl
JobType = “Normal”;Executable = "$(CMS)/exe/sum.exe";InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"};OutputSandbox = {“sim.err”, “test.out”, “sim.log"};Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ && other.GlueCEPolicyMaxWallClockTime > 10000;Rank = other.GlueCEStateFreeCPUs;
submitted
Job Status
UI: allows users to access the functionalitiesof the WMS
Job Description Language(JDL) to specify job characteristics and requirements
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
Input Sandboxfiles
Job
waiting
submitted
Job StatusNS: network daemon responsible for acceptingincoming requests
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
WM: responsible to takethe appropriate actions to satisfy the request
Job
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
Match-Maker/Broker
Where must thisjob be executed ?
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
Match-Maker/ Broker
Matchmaker: responsible to find the “best” CE where to submit a job
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
Match-Maker/ Broker
Where are (which SEs) the needed data ?
What is thestatus of the
Grid ?
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
Match-Maker/Broker
CE choice
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
waiting
submitted
Job Status
JobAdapter
JA: responsible for the final “touches” to the job before performing submission(e.g. creation of wrapper script, etc.)
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
Job Status
JC: responsible for theactual job managementoperations (done via CondorG)
Job
submitted
waiting
ready
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
RBstorage
Job Status
Job
InputSandboxfiles
submitted
waiting
ready
scheduled
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
RBstorage
Job Status
InputSandbox
submitted
waiting
ready
scheduled
running
“Grid enabled”data transfers/
accesses
Job
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
RBstorage
Job Status
OutputSandboxfiles
submitted
waiting
ready
scheduled
running
done
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
RBstorage
Job Status
OutputSandbox
submitted
waiting
ready
scheduled
running
done
edg-job-get-output <dg-job-id>
Job submission
UI
NetworkServer
Job Contr.-
CondorG
WorkloadManager
RLS
Inform.Service
ComputingElement
StorageElement
RB node
RBstorage
Job Status
OutputSandboxfiles
submitted
waiting
ready
scheduled
running
done
cleared
M. Sgaravatto – n° 20
Job monitoring
UI
Log Monitor
Logging &Bookkeeping
NetworkServer
Job Contr.-
CondorG
WorkloadManager
ComputingElement
RB node
LM: parses CondorG logfile (where CondorG logsinfo about jobs) and notifies LB
LB: receives and stores job events; processes corresponding job status
Log ofjob events
edg-job-status <dg-job-id>edg-job-get-logging-info <dg-job-id>
Job status
M. Sgaravatto – n° 21
Interoperability with other services
Information services Queried by the RB to see the status of the Grid (characteristics and status of CEs
and SEs)
WMS able to work with: Globus MDS based information services (as in LCG and INFN-GRID testbeds) RGMA-GOUT (as in EDG testbed)
Evaluating direct (i.e. no via GOUT) interoperability to RGMA
Replica Location Service Queried by the RB to see where the specified data are physically available (in
which SEs)
WMS able to work with EDG-WP2 RLS
On-going plans to work also with US RLS
VOMS Used for VO based security
M. Sgaravatto – n° 22
Deployment of WMS services
RLS
One or more for each RB
Usually deployed in
the “RB node” (but
not required)
One for each RB
“Community” (VO) RB or
“Personal” RB
Queue of a LRMS (LSF,
PBS)
Submitting machine (UI)
”RB node”
LB server
II/GOUT
CE CESE SE
Usually one per
VO
NS, WM, JC, LM, PR
MyProxy server
Possibility to submit to more
than one RBs from a single UI
Used for proxy
renewal
VOMS
Used for
VO based security
M. Sgaravatto – n° 23
WMS release 2: new functionalities
User APIs
GUI
Support for interactive jobs
Job checkpointing
Support for parallel jobs
Gangmatching
Support for automatic output data upload and registration
VOMS support
…
M. Sgaravatto – n° 24
GUI & APIs
M. Sgaravatto – n° 25
Interactive jobs
Specified setting JobType = “Interactive” in JDL
When an interactive job is executed, a window for the stdin, stdout, stderr streams is opened
Possibility to send the stdin to
the job
Possibility the have the stderr
and stdout of the job when it
is running
Possibility to start a window for
the standard streams for a
previously submitted interactive
job with command edg-job-attach
M. Sgaravatto – n° 26
Job checkpointing
Checkpointing: saving from time to time job state Useful to prevent data loss, due to unexpected failures
Approach: provide users with a “trivial” logical job checkpointing service
User can save from time to time the state of the job (defined by the application)
A job can be restarted from an intermediate (i.e. “previously” saved) job state
Different than “classical checkpointing (i.e. saving all the information related to a process: process’s data and stack segments, open files, etc.)
Very difficult to apply (e.g. problems to save the state of open network connections)
Not necessary for many applications
To submit a checkpointable job Code must be instrumented (see next slides)
JobType=Checkpointable to be specified in JDL
M. Sgaravatto – n° 27
Job checkpointing example
int main () { … for (int i=event; i < EVMAX; i++) { < process event i>;} ...exit(0); }
Example ofApplication(e.g. HEP MonteCarlosimulation)
M. Sgaravatto – n° 28
Job checkpointing example#include "checkpointing.h"
int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); }
User code must be easily instrumented in order to exploit the checkpointing framework …
M. Sgaravatto – n° 29
Job checkpointing example#include "checkpointing.h"
int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); }
•User defines what is a state•Defined as <var, value> pairs• Must be “enough” to restart a computation from a previously saved state
M. Sgaravatto – n° 30
Job checkpointing example#include "checkpointing.h"
int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); }
User can savefrom time to timethe state of the job
M. Sgaravatto – n° 31
Job checkpointing example#include "checkpointing.h"
int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); }
Retrieval of the last saved stateThe job can restart from thatpoint
M. Sgaravatto – n° 32
Job checkpointing scenarios
Scenario 1 Job submitted to a CE When job runs it saves from time to time its state Job failure, due to a Grid problem (e.g. CE problem) Job resubmitted by the WMS possibly to a different CE Job restarts its computation from the last saved state
No need to restart from the beginning The computation done till that moment is not lost
Scenario 2 Job failure, but not detected by the Grid middleware User can retrieve a saved state for the job (typically the last one)
edg-job-get-chkpt –o <state><edg-jobid>
User resubmits the job, specifying that the job must start from a specific (the retrieved one) initial state
edg-job-submit –chkpt <state> <JDL file>
M. Sgaravatto – n° 33
Submission of parallel jobs
Possibility to submit MPI jobs
MPICH implementation supported
Only parallel jobs inside a single CE can be submitted
Submission of parallel jobs very similar to normal jobs Just needed to specify in the JDL:
JobType= “MPICH” NodeNumber = n;
The number (n) of requested CPUs
Matchmaking CE chosen by RB has to have MPICH sw installed, and at least n total
CPUs If there are two or more CEs satisfying all the requirements, the one
with the highest number of free CPUs is chosen
M. Sgaravatto – n° 34
Gangmatching
With “standard” matchmaking only 2 “involved entities” the job and the CE
Gangmatching allows to take into account, besides CE information, also SE information in the matchmaking
Typical use case for gangmatching: My job has to run on a CE close to a SE with at least 200 MB of available
space:
Requirements = anyMatch(other.storage.CloseSEs, target.GlueSAStateAvailableSpace > 200);
M. Sgaravatto – n° 35
Other new functionalities
VOMS support VO taken from VOMS user proxy
Matchmaking performed wrt VO Not necessary to publish anymore in the information service the list of authorized
users (only list of authorized VOs needed)
In any case WMS works also with non-VOMS proxies
Compliance with Glue Schema Common Information Service schema between US and EU HENP Grid Projects
LB ACLs Allow setting who can query the status of a given job
Output data upload and registration Possibility to trigger (via JDL) output data upload into a SE and registration in
Replica Location Service (RLS)
M. Sgaravatto – n° 36
Output data registration
OutputData = {
[
OutputFile = "filename1";
LogicalFileName = "lfn:mylfn1";
StorageElement = "testbed007.cnaf.infn.it"
],
[
OutputFile = "filename2"
],
[
OutputFile = "filename3";
LogicalFileName = "lfn:mylfn2"
],
[
OutputFile = "filename4";
StorageElement = "testbed007.cnaf.infn.it"
]
}
Both LFN and target SE specified
Nor LFN nor target SE specified
Only LFN specified
Only target SE specified
M. Sgaravatto – n° 37
Status
WMS release 2 being used and evaluated by applications Deployed in LCG testbed
Deployed in EDG testbeds
Being deployed-customized-exploited in CrossGrid testbed
Deployed in INFN-GRID testbeds INFN-GRID development testbed used to test new stuff
So far very good feedbacks Users are reporting great improvement wrt release 1, in particular for
what concerns reliability
Also quite happy about the level of support (e.g. bug fixes) that we are able to provide
M. Sgaravatto – n° 38
Ratio of succesful jobs of retrieved jobs (4963 of 5000 = 99.26%)
99.9
0.4
91.099.3 95.4 98.7 100.0
0.0
20.0
40.0
60.0
80.0
100.0
CNAF Taiw an Hungary FNAL adc0018 adc0015 Germany
Sites
Per
cen
t
Geographical Job Distribution
CNAF; 838; 16%
Taiwan; 224; 5%
Hungary; 564; 11%
FNAL; 838; 17%
adc0018; 819; 17%
adc0015; 849; 17%
Germany; 831; 17%
LCG 1.0 Test (19./20. Sept. 2003):• 5 streams• 5000 jobs in total
• Input and OutputSandbox• Brokerinfo query• 30 sec sleep
Slide presented at the latest EDG workshop by Markus Schulz (LCG)
M. Sgaravatto – n° 39
Future activities
Support and bug fixes
Working on new functionalities Dependencies of jobs
Integration of Condor DAGMan “Lazy” scheduling: job (node) bound to a resource (by RB) just before that job can be submitted (i.e.
when it is free of dependencies)
Support for job partitioning Use of job checkpointing and DAGMan mechanisms
Original job partitioned in sub-jobs which can be executed in parallel At the end each sub-job must save a final state, then retrieved by a job aggregator, responsible to collect the results of the sub-jobs and produce the overall output
Grid Accounting Based upon a computational economy model
Users pay in order to execute their jobs on the resources and the owner of the resources earn credits by executing the user jobs
To take account of resource usage And to make possible a nearly stable equilibrium able to satisfy the needs of both resource `producers' and `consumers‘
Getting ready for EGEE …
M. Sgaravatto – n° 40
Conclusions
Revised WMS architecture To address emerged shortcomings
To support new functionalities
Deployed in various testbeds (in particular by LCG, our main customer)
Very good feedbacks
Working on some new functionalities to be shown at the last EDG review and in case going to be exploited by LCG (e.g. DAGMan)
More info http://www.infn.it/workload-grid