The FNAL/CMS GlideinWMS: experience at BNL

The FNAL/CMS GlideinWMS: experience at BNL

Maxim Potekhin

Panda/DDM Workshop

October 4, 2007

BNL

glideinWMS

What will be covered: a very brief overview of a few existing Workload Management Systems general idea of the FNAL/CMS glideinWMS (I.Sfilogoi, FNAL) glideinWMS test bench at BNL strengths and weaknesses of the glideinWMS, and what we can learn

from it, in the Panda system context

Workload Management Systems Overview for details, see a talk by Igor Sfiligoi and Burt Holzman at CHEP07:

http://indico.cern.ch/contributionDisplay.py?contribId=216&sessionId=26&confId=3580

systems considered: Condor-G ReSS gLite WMS glideinWMS – Igor’s effort

what was looked at: performance scalability reliability

Workload Management Systems Overview notes on Condor-G:

significantly, it is used as the underlying submission mechanism for most others WMS

part of the Condor distribution job submission with Condor-G:

scales up to 7k jobs in the queue start-up speed depends on the number of Grid Managers in configuration

(from 30 with 30 managers to 60 with a 100) same applies to job-removal speed empirical result: with 100 managers, Condor-G has enough throughput to

saturate the batch system with CE crashes, jobs may stay in the queue forever

Workload Management Systems Overview notes on ReSS:

Resource Selection System is a matchmaker for Condor-G, using the information harvested from CEMon on the Grid sites

submission is still done via Condor-G, with ReSS responsible for the determination as to where to submit

tested with 4x10k jobs queued characteristics similar to plain Condor-G

Workload Management Systems Overview notes on gLite WMS:

relies on BDII fir information on Grid sites has a dedicated cliens uses Condor-G fro submission

performance slow, down to 5 submissions per minute however, in collection mode, submission can be very fast with an

effective rate of 1000 jobs per minute. Caveat: occasional failures of the collection submissions due to overload of the WMS itself

monitoring: no easy way to find IDs of owned jobs

glideinWMS

For a complete overview of the FNAL/CMS glideinWMS, see http://home.fnal.gov/~sfiligoi/glideinWMS/doc/manual/index.html#overview

(two following slides are borrowed from that source) in the context of glideinWMS, a glidein is a Condor startd

submitted as a Grid job once it starts, it registers with a Condor pool available from

the submission node the users then submit their payload jobs to Condor, while

being insulated from the Grid implementation details obvious benefits of a familiar environment and monitoring

tools that come with it

glideinWMS

glideinWMS

glideinWMS our experience with glideinWMS:

the installation script is complex, mostly works and is being improved due to ready available support from the developer (I.Sfiligoi) in the

installation stage, we got it up and running after a few initial misconfigurations

for testing purposes, the Front End and the Factory were colocated on the submission machine

used an instance of the Apache server on a separate node to host the payload

we successfully ran a test job on Panda Pilots, which in turn were deployed on the BNL cluster via the glideinWMS mechanism – a nice proof of principle

the monitoring tools that come with the product work rather nicely (provide detailed stats of the factory operation in graphic and XML formats)

glideinWMS The payload hosted on the Apache server: note the sha1sum file

that is used to verify the payload’s authenticity

glideinWMS observations:

Condor-G is used for glidein submissions; we can, therefore, expect same intrinsic limitations as with other WMS

Q: how does glideinWMS apparently do better in certain tests? A: preemptive submission of startd’s, which allows the following:

sequential execution of jobs off the same startd effectively, advance reservation on remote sites by occupying batch slots

Q: is that compatible with current practices and policies? A: remains to be seen, and probably not... (cf the large number of slots

that need to be “hot” for speedy submission) Q: what happens to the unused glideins? A: they die after a predetermined timeout, typically 20 minutes empirically, the latencies behave as expected, i.e. when a sufficeint

number of glideins are active, the submissions are indeed speedy; when there is not enough glideins, the submission is effectively throttled


Q: how does glideinWMS handle inter-site, inter-process communications despite the presence of firewalls?

A: by using the GCB (Generic Connection Broker), which is reached by the two communicating nodes via an outgoing connection on either side. The GCB must reside on the public Internet.

Q: How does using the GCB effect the security? A: Most likely slightly adversely, and remains to be seen (see the GCB

site for detail). Almost by definition, anything that defeats a firewall can’t possibly enhance security of the system.

Q: are there known scalability problems with GCB? A: yes, a GCB instance will keel over if more than approx. 600

connections are made, and will take all the associated Condor jobs with it. Work is being done to rectify that, however currently this is more than a trivial limitation and potential vulnerability of the production system


Q: what are other scalability problems with the glideinWMS? A: Due to memory requirements of the current Condor implementations,

the submission machine itself can be a bottleneck, however tests were successful with up to 4k running jobs and this problem can be further circumvented by using multiple submission machines; number of queued jobs can be significantly higher.

Scalability issues can be addressed with multiple Front Ends and multiple Factories

Hardware matters! Dedicated machines needed for critical functions. redundancy?

one of the useful features of the glideinWMS is that a loss of a startd process is handled elastically in the system, i.e. not user jobs are lost. However, this overlaps with an identical feature of the Panda Pilot submission... Do we need an extra layer of indirection?

glideinWMS Conclusions:

we have learned a lot about the glideinWMS, and having installed it locally, demonstrated that it can be used to instantiate Panda Pilots on the Grid

we formed an opinion that the glideinWMS is, in principle, subject to the same limitations as any WMS which is using Condor-G as the underlying remote job submission protocol, however it works around them by running payloads serially on the same CE (which is effectively sequestered by the user) and in addition having a configurable number of jobs idling on the remote site, waiting for payload. The latter can be achieved with the existing Panda system, yet it is not likely to be aligned with site policies

we are in the process of integrating our experience with the glideinWMS into our work with the Panda Pilot

certain features of the glideinWMS are highly practical and can be used in our systems, such as using checksums to validate payloads, thus mitigating security gaps in the Panda system

Documents

The FNAL/CMS GlideinWMS: experience at BNL