CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007

CASTOR2 CASTOR2 Disk Cache SchedulingDisk Cache Scheduling

LSF, Job Manager and Python Policies.

Dennis WaldronDennis Waldron

CERN / ITCERN / IT


Switzerlandwww.cern.ch/it 2

OutlineOutline

• LSF limitations, pre 2.1.3• v2.1.3:

– Resource Monitoring and Shared Memory.– LSF changes and New Scheduler Plugin.– Python Policies.

• v2.1.4:– Scheduling Requirements/Problems– Job Manager

• v2.1.6+– Future Developments (v2.1.6 & v2.1.7)



LSF Limitations, LSF Limitations, pre 2.1.3 releasespre 2.1.3 releasesWhat was killing us:What was killing us:

• The LSF queue was limited to ~2000 jobs, more then this resulted in instabilities.

• LSF jobs remained in PSUSP after timeout between stager and rmmaster (#17153)

• Poor submissions rates into LSF, ~10 jobs/second. Half of the advertised LSF rate.

• RmMaster did not keep node status after restart (#15832)

• Database latency between LSF plugin (schmod_castor) and stager DB resulted in poor scheduling performance.

• These were just the start!!!

• Additional Information available at: http://castor.web.cern.ch/castor/presentations/2006/



Resource Monitoring and Shared MemoryResource Monitoring and Shared Memory

• In 2.1.3 both the LSF plugin and Resource Monitor (rmMasterDaemon) now share a common area of memory for exchanging information between the two processes.– Advantage: Access to monitoring information from inside

the LSF Plugin is now a pure memory operation on the scheduler machine. (extremely fast!)

– Disadvantage: the rmMasterDaemon and LSF must operate on the same machine! (no possibility for LSF failover)

• Changes to daemons in 2.1.3:– rmmaster became a pure submission daemon.– rmMasterDaemon was introduced for collecting monitoring

information.– rmnode was replaced by rmNodeDaemon on all diskservers



Resource Monitoring Cont.Resource Monitoring Cont.

• New monitoring information contains– On diskservers : ram(total + free), memory(total + free),

swap(total + free), load, status and adminStatus.– For each filesystem : space(total + free),

nbRead/ReadWrite/WriteStreams, read/writeRate, nbMigrators, nbRecallers, status and adminstatus.

• Monitoring intervals :– 1minute for slow moving info (total*, *status)– 10s for fast moving info (*Streams, *rate, load)

• Status can be Production, Draining or Down• Admin status can be None, Force or Deleted

– Set via rmAdminNode.– Force prevents updates from monitoring.– Deleted, deletes it from the DB – Release allows to move back from Force to None

• By default, new diskservers are in status DOWN and admin status FORCE.



• Added multiple LSF queues, one per svcclass.– Not for technical reasons!!!– Allows for user restrictions at queue level and better

visualization of jobs on a per svcclass basis via bqueues.

• Utilisation of External Scheduler options during job submission.– Recommended by LSF experts.– Increased job submission from 10 to 14 jobs/second.– Calls to LSF (mbatchd) from CASTOR2 components

reduced from 6 to 1. As a result queue limitations no longer needed. (Not totally disappeared!!)

– Removed the need for message boxes, i.e. jobs are no longer suspended and resumed at submission time.

– Requires LSF_ENABLE_EXTSCHEDULER to be enabled in lsf.conf (both scheduler and rmmaster machines)

LSF changes and New Scheduler PluginLSF changes and New Scheduler Plugin



LSF Changes Cont.LSF Changes Cont.

• Filesystem selection now transferred between LSF and the job (stagerJob) via the SharedLSFResource.– The location of the SharedLSFResource can be defined in

castor.conf– Can be a shared filesystem e.g NFS or web server

• Why is it needed?– LSF is CPU aware not filesystem aware.– The LSF scheduler plugin has all the logic for filesystem

selection based on monitoring information and policies.– The final decision needs to be transferred between the

Plugin and the LSF execution host.

– Could have been LSF messages boxes or the SharedLSFResource. Neither are great! But, we select the lesser of two evils!



LSF Python PoliciesLSF Python Policies

• Why?– Filesystem selection has moved from the Stager DB to the

Plugin. The Plugin must now take over its functionality.– Scheduling needs to be sensitive to other non scheduled

activity and respond accordingly.

• Initial implementation was a basic equation with coefficients set in castor.conf.– Advantage: Simplicity– Disadvantages

• Simplicity• Every new internal release during testing of 2.1.3 required

changes to this equation inside the code!!• We couldn’t ask the operations team to make these changes

during runtime so another language was need for defining policies.

• The winner was Python!



Python Policies Cont.Python Policies Cont.

• Examples: /etc/castor/policies.py.example• Policies are defined on a per svcclass level.

Many underestimate there importance!

• Real example:• 15 diskservers, 6 LSF slots each, all slots occupied transferring

1.2GB files in both read and write directions. Expected throughput per stream ~ 20MB/s (optimal)

• Problems:– At 20 MB/s migration and recall streams suffer.– Migrations and Recalls are unscheduled activities.

• Solution:– Define a policy which favours migration and recall streams by

restricting user activity on the disk server allowing more resources (bandwidth, disk I/O) to be used by migrations and recalls.



• The LSF queue was limited to ~2000 jobs, more then this resulted in instabilities. No messages boxes, 6 to 1 LSF calls

• LSF jobs remained in PSUSP after timeout between stager and rmmaster (#17153)

• Poor submissions rates into LSF, ~10 jobs/second. Half of the advertised LSF rate. Now at 14 jobs/second

• RmMaster did not keep node status after restart (#15832). States now stored in the Stager DB for persistence

• Database latency between LSF plugin (schmod_castor) and stager DB resulted in poor scheduling performance. Shared memory implementation

• These were just the start!!!

• Additional Information available at: http://castor.web.cern.ch/castor/presentations/2006/

LSF Limitations, LSF Limitations, pre 2.1.3 releasespre 2.1.3 releasesWhat was killing us:What was killing us:



• Job submission rates still not at the advertised LSF rate of 20 jobs per second.

• Jobs remain in a PEND’ing status indefinitely in LSF if no resources exist to run them (#15841)

• Administrative actions such as bkills do not notify the client of a request termination (#26134)

• CASTOR cannot throttle requests if they exceed a certain amount (#18155) - infamous LSF meltdown

Scheduling Requirements/ProblemsScheduling Requirements/Problems

A requirement was needed for a daemon to manage and monitor jobs whilst in LSF and take appropriate actions

where needed.



Job Manager - ImprovementsJob Manager - Improvements

• The stager no longer communicates directly with the submission daemon. – All communication is done via the DB making the jobManager

stateless.– Two new statues exist in the subrequest table

• SUBREQUEST_READYSHCED 13• SUBREQUEST_BEINGSCHED 14

– No more timeouts between stager and rmmaster resulting in duplicate submissions and rmmaster meltdowns.

• Utilises a forked process pool for submitting jobs into LSF.– The Previous rmmaster forked a process for each submission into

LSF which is expensive.– The number of LSF related process is now restricted to 2 x the

number of submission processes.– Improved submission rates from 14 to 18.5 jobs/second

• New functionality added to detect when a job has been terminated by an administrator, `bkill` and notify the client to the jobs termination.– New error code: 1719 - 'Job killed by service administrator'



Job Manager – Improvements Cont.Job Manager – Improvements Cont.

• Jobs can now be killed if they remain in LSF for too long in a PEND’ing status.– The timeout value can be defined on a per svcclass basis.– The user receives error code: 1720 - 'Job timed out while

waiting to be scheduled‘.

• Jobs whose resource requirements can no longer be satisfied can be terminated:– Error code: 1718 - 'All copies of this file are unavailable for

now. Please retry later‘– Must be enabled in castor.conf via option

JobManager/ResReqKill

• Multiple JobManagers can operate in parallel for a redundant, high availability solution.

• All known rmmaster related bugs closed!



Future Developments 2.1.6+Future Developments 2.1.6+

• Disk-2-Disk copy scheduling• Support for multiple rmMasterDaemons

running in parallel on a single CASTOR 2 instance.



Comments, questions?Comments, questions?

Documents

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk