Batch System Operation & Interaction with the Grid LCG/EGEE Operations Workshop May 25 th 2005 Tony.Cass@ CERN.ch

Batch System Operation

&Interaction with the

Grid

LCG/EGEE Operations Workshop

May 25th 2005

[email protected]

2 [email protected]

Why a Batch Workshop at HEPiX? Proposed after the last Operations Workshop.

Remember the complaints then?– “ETT doesn’t work”– “ETT is meaningless when fairsharing is in place”– “The solution of a queue per VO while easy to

implement now but is not a good or long term solution.”

– “The [ETT] algorithm was questioned and other proposals were given.”

Idea was to bring together site managers, grid & local scheduler developers.

3 [email protected]

Workshop Aims Understand how different batch scheduling

systems are used at HEP sites– Are there any commonalities?

How do sites see the Grid interface? How would sites like to see the Grid interface? What is the impact of the current interface? How do developers of local and Grid level

schedulers see the future? How/can HEP site managers influence future

developments? Well attended (70-80)

– Definite interest in this area from site managers See http://www.fzk.de/hepix

4 [email protected]

Agenda Local Scheduler usage

– SLAC, RAL, LeSC, JLab, IN2P3, FNAL, DESY, CERN, BNL

– LSF, PBS, Torque/Maui, SGE (N1GE6), BQS, Condor Impact of Grid on sites

– Jeff Templon overview (c.f. previous talk), BQS@IN2P3

Local scheduler view– LSF, PBS, LoadLeveler, Condor, BQS

Grid Developments– EGEE/BLAHP, GLUE

Common batch environment– See earlier.

5 [email protected]

Site Presentations --- I Site reports covered

– Brief overview of the available computing resources, showing (in)homogeneity of resources

– Queue configuration---what and why– How do users select queues---cpu time alone or

specifying other resources (e.g. memory, local disk space availability)

– Need for, and use of, "special" queues---for "production managers", sudden high priority work, other reasons.

» Question from LHCC referee: “If there is some urgent analysis, how can [gLite] send this to a special queue?”

– Level of resource utilisation

6 [email protected]

Site Presentations --- II Overall, configurations and concerns were

broadly equivalent across sites.

Concerns were around – Scheduling– Security– Interface Scalability

Cover these issues in next few slides.

Scheduling Issues

8 [email protected]

Local Load Scheduling: summary Batch schedulers at local sites enable fine-grained

control over heterogeneous systems and are used to enforce local policies on resource allocation and provide “SLA” for users (turnround time).– Large sites have subdivision of user groups

Scheduling is by CPU time, some need to request– minimum CPU capacity for server– memory requirement– available disk work space (/pool, /scratch, /tmp)

Sites want Grid interface to use existing queue(s)– NOT to create a queue per VO.– EMPHATICALLY NOT to replicate queue structure per

VO

9 [email protected]

Grid/Local interface problems Jeff’s presentation!

In short– Not enough information passed from the site to the

Grid – No information passed from the Grid to the site

Result:– Queues at sites whilst others sit empty– Confused/frustrated site managers– Inefficient behaviour as people work the system

» “Tragedy of the commons”

10 [email protected]

Should sites (be able to) enforce policies? Sites are funded for particular tasks and need

to show funding agencies and users that they are fulfilling their mission.

This is a Grid. Why does it matter if you are running jobs for X not Y? Y may be happily running jobs at another site.

My view:– Sites need to understand and feel comfortable with

the way they accept jobs from the Grid.– If they are comfortable, account may be taken of

global activity when setting local priorities.– Let’s walk before we try to run…


Can/Should we fix this? … or should we wait to see some general

standard emerge?

Strong support from commercial people (especially Platform and Sun) for HEP to work out solutions to this problem.– They are interested in what we do.

Standards bodies (GGF,…) won’t come up with any common solution soon.– But this doesn’t mean HEP shouldn’t participate

» Raise profile of problems of interest to us» Give practical input based on realworld experience.


How to fix? Improve information available to Grid scheduler

– VO information added in GLUE schema (v1.2)» Need volunteer per batch system to maintain dynamic plug-

ins and the job manager. CERN will do this for LSF. Need other volunteers!

– but still assumption of homogeneous resources at a site.

– There is a plan to start work on GLUE v2 in November» No requirement for backwards compatibility.» Discussion should start NOW!

But need to assess impact of v1.2 changes before rushing into anything.

Grid scheduler should pass job resource requirements to the local resource manager.– Not yet. When? How?– Needs normalisation… Does this need to be per VO?

Security


Security Issues Sites are still VERY concerned about

traceability of users. Mechanisms seem to be in place to allow this,

but sites have little practical experience.– c.f. delays for CERN to block user systematically

crashing worker nodes.– Security group have doubts that sites are fulfilling

obligations in terms of log retention.– “Security Challenges” mooted; these may help

increase confidence… Whatever, it does NOT seem to be a good idea

to have a portal handling user job requests and passing these on with a common certificate…

Interface Scalability


Interface Scalability IN2P3 example: “GridJobManager asks job

status once per minute (even for 15-hour jobs).– 5000 queued jobs + 1000 running jobs = 100

queries/s” Being solved by egee BLAHP

– Caches query response But…

– further example need for discussion between sites & developers (IN2P3 fixing this issue independently)

– are there other similar issues out there?» c.f. LSF targets:

Scalability: 5K hosts, 500K active jobs, 100 concurrent users, 1M completed jobs per day

Performance: >90% slot utilistion, 5s max command response time, 4kB memory/job, master failover in <5mins

» What are targets for the CE? RB?

Some other Topics


End-to-End Guarantees The Condor talk raised many interesting points.

One in particular was the (in)ability of the overall system to offer end-to-end execution guarantees to the users.

Condor “glide-in”: pilot job submitted via the Grid which takes a job from a condor queue.

Fair enough [modulo security…] for system managers PROVIDED pilot job expresses same resource requests as it advertises in a class-ad when it starts.– Shouldn’t claim to be maximum possible length then

run short job.– Class ads and GLUE schema not so different: Both

are ways of saying what a node/site can do in a way that can be used to express (and then match) requirements.


Pre-emption & Virtualisation Strong message from batch system developers

that pre-emption is A GOOD THING. With pre-emption schedulers can maximise throughput/resource usage by– suspending many jobs to allow parallel job to run– suspending long running jobs to provide quick

turnround for priority jobs. Interest in virtualisation as method to ease this

– Also discussed at last operations workshop as a way to ease handling of multiple (conflicting) requirements for OS versions.

– Something to watch. How would (pre-empted) users like this?

– No guarantee of time to completion once job starts…


Push vs Pull A false dichotomy

– Sites can manipulate pull model to create a local queue

Real issue is early vs. late allocation of task to resource– Early: site resource utilisation maximised: a free cpu

resource can be filled immediately with a job from the local queue

– Late: user doesn’t see job sent to site A just before a cpu becomes free at site B.

Questions:– Long term, will most cpu resources be full?– What do people want to maximise? Throughput or ?

» Efficient scheduling important anyway… transparency of grid/local interface will be key.

– Pre-emption, anyone?

Conclusion

ConclusionSummary


Workshop Summary Useful workshop. [IMHO…]

Good that there has been progress since the November workshop at CERN (GLUE schema update), but much is still to be done.

The The ServiceService is is the the ChallengeChallenge


Workshop Summary Useful workshop. Good that there has been progress since the

November workshop at CERN (GLUE schema update), but much is still to be done.

[Still] Need to increase dialogue between site managers and Grid [scheduler] developers – Site managers know a lot about running services.– Unfortunate that a meeting change created a clash

and reduced scope for egee developers to participate in Kaelsruhe discussions.

– A smaller session is pencilled in for HEPiX in SLAC, October 10th – 14th. More dialogue then?

Not too early to start thinking about GLUE v2!

Documents

Batch System Operation & Interaction with the Grid LCG/EGEE Operations Workshop May 25 th 2005 Tony.Cass@ CERN.ch