“The Grid” Gatekeeper / Batch Scheduler Workers Worker Node Workers Worker Node Submitter Site Gatekeeper / Batch Scheduler

Care and feeding of a gatekeeper

Burt Holzman ([email protected])

Fermilab Computing Division and CMS

Overview“The Grid”

Gatekeeper / Batch Scheduler


WorkersWorkersWorkersWorkerNode

WorkersWorkersWorkersWorkerNode

Submitter

Site

Site


Outline

Job lifecycle GRAM Authentication & authorization Job Managers and Grid Monitors Batch Submission Clean-up

The biggest problem? Some fixes

(Not covering: GIP, Gratia, RSV)

GRAM

Globus Resource Allocation Manager GRAM listens on your gatekeeper

Speaks Globus RSL (Resource Specification Language) Comes in three flavors

GRAM2 (a.k.a. “pre-WS GRAM”) I’ll talk about this quite a bit

GRAM4 (a.k.a. “WS GRAM”) Big design changes; never largely adopted Deprecated – mostly in use by TeraGrid

GRAM5 Built on GRAM2

A note on plumbing"No, no. Lemme think," Harry interrupted himself. "It's more like

you're hired as a plumber to work in an old house full of ancient, leaky pipes laid out by some long-gone plumbers who were even weirder than you are. Most of the time you spend scratching your head and thinking: Why the !@#$ did they do that?"

"Why the !@#$ did they?" Ethan said.Which appeared to amuse Harry to no end. "Oh, you know," he

went on, laughing hoarsely, "they didn't understand whatever the !@#$ had come before them, and they just had to get something working in some ridiculous time. Hey, software is just a !@#$load of pipe fitting you do to get something the hell working. Me," he said, holding up his chewed, nail-torn hands as if for evidence, "I'm just a plumber."

Ellen Ullman, “The Bug”

Submitting to GRAM (the user’s view)

globus-job-run – if you speak GlobusRSL (and for simple tests)

globus-job-run cmsosgce.fnal.gov /usr/bin/hostname Condor-G for the rest of us!

Submit a job of type grid Condor_schedd notices, and starts the

condor_gridmanager Condor_gridmanager translates Condor JDL into

GlobusRSL, listens on a port, and submits two processes:

Your job Grid_monitor.sh

The grid monitor runs on the gatekeeper itself

What is the grid monitor anyhow?

By default, GRAM spawned a separate globus-job-manager for every job queued on a gatekeeper to constantly poll and report status

In contrast, the grid monitor wakes up and does collective polling (per user id per submit host)

Grid monitors expire by default (for OSG submission hosts) after one hour and are resubmitted

Grid monitor binary is sent by Condor-G – not resident on the gatekeeper

If grid monitor is killed (or new ones cannot start after expiry of the old), gatekeeper will revert to old behavior of one globus-job-manager per job!

Job stage-in

Stage-in is a PULL from the gatekeeper Gridftp from the globus-job-manager on the gatekeeper to the

condor_gridmanager on the submit host ~user/.globus/job/`hostname`/X.Y/ – where X,Y are integers

and form a unique job contact Remote_io_url: remote grid manager contact URL X509_up: grid proxy of submitter Scheduler_condor_submit_script: JDL for local batch system Scheduler_condor_submit_stderr: error from submission to

local batch (if any) Stdout, stderr

~user/.globus/job/.gass_cache/{hash, hash} Stage-in and stage-out directories for input/output files

Authentication & Authorization

Two modes /etc/grid-security/gridmap-file

Maps x509 DNs to local unix UIDs Edg-mkgridmap can autogenerate this by polling each

supported VO

Callouts Gatekeeper supports PRIMA callouts to a GUMS server Much more flexible – can assign UIDs based on extended

attributes (groups, roles, etc.) More complex – separate service to maintain

Globus error 7: failed authentication

Job Manager

Every job submit fires off a globus-job-manager:$VDT_LOCATION/globus/lib/perl/Globus/GRAM/JobManager/

condor.pm Uses perl API to Globus to construct a JDL for submission

to local batch Note: if you hack about in there, clean up after yourself –

the gatekeeper will try to execute anything named *.pm Globus Error 47: non-zero Job Manager exit status

The Big Picture

Submittercondor

grid_managerGRAM Aut

h?

.gass_cacheinput files

.globus/job/`hostname/X.Y/

Submit fileX509 proxy

Fork jobmanager

Grid_monitor

Batch jobmanagerBatch submit

Globus-job-manager

Job submission and stage-out

After generating the JDL, the job manager does the submit (i.e. condor_submit)

Grid monitor polls for job completion, and stages files out via PUSH when done

Globus Error 155 Stage-out failed – mostly due to files not being found

for stage-out Job crash and output files never generated Administrator removed job from batch system before it

generated output

Cleaning up

Successful jobs are tidy Failed jobs leave a mess

~user/.globus/job/`hostname`/X.Y/ ~user/.gass_cache/… ~user/gram_condor_log.* (GRAM logs) $VDT_LOCATION/globus/tmp/ (globus state files)

What I said in March 2008:

“Plans to provide clean-up tools using tmpwatch and/or find but not in production, so watch your disk space !”

Sadly nothing has changed

Biggest problem?

Overloaded gatekeeper Gatekeeper filesystem issues Shared filesystem issues Grid_monitor fails to start Setup.sh (OSG 1.2.13 and below) Too many users and jobs

Gatekeeper filesystem issues Massive cruft accumulation

Only failed jobs leave cruft, but it sticks around forever Write your own tmpwatch script:

VDT should provide a standard one! Slow disk

Move colocated services to other machines Faster spindles (SSB?) Ramdisks for some services (condor spool?)

#!/bin/shfor in `grep “:\/home” /etc/passwd | awk –F: ‘{print $6}’ `;do/usr/bin/tmpwatch 120 $i/usr/bin/tmpwatch 120 –d /opt/osg/globus/tmpdone

Shared filesystem issues

NFS is notoriously inefficient at small writes Grid gatekeepers love small writes! If you think your NFS solution is rock-solid, you don’t have

enough users yet (q.v. James Letts’ talk, Doug Johnson’s talk) Not all batch systems require sharing user home directories

between gatekeeper and workers Condor_nfslite – it uses Condor file transfer mechanisms to

move the files to the worker (and the admin can use Condor configuration options to limit the scale of the transfers)

In principle this is possible with PBS, LSF Maybe not SGE

Grid monitor problems Load abnormally high; ps shows no grid_monitor processes Paging Capt. Yossarian: new grid monitors may not be able to start

because of the increased load created by the missing grid monitors! Check the StarterLog for the managedfork jobmanager – if

managedfork is failing, new grid monitors can’t take the place of the expired ones

Counterintuitive band-aid: shut off managedfork (revert to non-managed fork jobmanager)

Resource contention between grid monitor and batch scheduler (or other services)

Renice globus-gatekeeper by editing /etc/xinet.d/globus-gateekeper and addingnice = 20

VDT should do this automatically! Increase polling interval? (from Brian)

Add sleep(10 + rand(10)) to poll subroutine in fork.pm VDT should do this automatically!

Setup.sh execution

In OSG 1.2.13 and below, job managers executed $VDT_LOCATION/setup.sh, which in turn executes a number of other setup scripts

On busy systems, this created a high amount of load classified as “system load” just from the large number of fork/exec calls

As of 1.2.14, it is no longer executed – so upgrade!

If you can’t upgrade, you can replace setup.sh – take a snapshot of your environment before and after, and replace the file with lines of static exports!

Still overloaded?

Add a gatekeeper! Multi-gatekeeper CEs are widely supported in OSG Tier 1 and many Tier 2s are using them (FNAL, CIT,

UW, MIT, UCSD, Vanderbilt …) Tony Tiradani @ CMS T1 has written an automated

install/update script – many things hard-coded for the T1, but if you’re interested, it may be useful

If all else fails – ask for help from Rob, Doug, and osg-sites!

Documents

“The Grid” Gatekeeper / Batch Scheduler Workers Worker Node Workers Worker Node Submitter Site Gatekeeper / Batch Scheduler