When the Grid Comes to Town Chris Smith, Senior Product Architect Platform Computing [email protected]

When the Grid Comes to Town

Chris Smith, Senior Product Architect

Platform Computing

[email protected]

© Platform Computing Inc. 20042

LSF 6.0 Feature Overview

Comprehensive Set of Intelligent Scheduling Policies

Goal-oriented SLA Scheduling

Queue-Based Fairshare Enhancements

Job Groups

Advanced Self-Management

Job-level Exception Management

Job Limit Enhancements

Non-normalized Job Run Limit

Resource Allocation Limit Display


LSF 6.1 was focused on performance and scalability

Scalability Targets

5K hosts per cluster

500K active jobs at any one time

100 concurrent users executing LSF commands

1M completed jobs per day

Performance Targets

90% min slot utilization

5 seconds max command response time

20 seconds real pending reason time

4Kb max mem usage per job (mbatchd+mbschd)

5 minutes max master failover time

2 minutes max reconfig time


Industry leading performance, reliability, & scalability

Supporting the largest and most demanding enterprise clusters

Extending leadership over the competition

Feature Benefits

Faster response times for user submission & query commands

Improved user experience

Faster scheduling and dispatch times Increased throughput and cluster utilization

Faster master fail-over Improved availability, minimizes downtime

Dynamic host membership improvements – host group supported

Reduced administration effort, higher degree of self-management

Pending job management - Limiting the number of pending jobs

Prevents the accidental overloading of the cluster with error jobs

Performance, Reliability and Scalability


When we tested Platform LSF V6.0 with 100K job load, we observed that mbatchd size increased to 1.3GB and used 99.8% CPU.

slot utilization

job throughput (jobs/hr)

job memory

CLI response

time (bjobs)

CLI response

time (bqueues)

LSF daemon not

respond msgs ??

failover time

(mbdrestart)

reconfig time

> 90% - < 4K < 5 sec < 5 sec NO < 300s < 120sLSF 6.0

(3K, 50K)74% 45,117.70 8.39 67.83 64.82 YES 637 181

LSF 6.1 (3K, 50K)

94% 68,960.60 1.38 0.86 0.48 NO 200 82

LSF 6.1 (3K, 100K)

94% 66,635.40 1.52 0.90 0.49 NO

LSF 6.1 (3K, 500K)

93% 70,773.90 1.08 1.00 0.91 NO 318 58

LSF 6.1 (5K, 100K)

79% 90,017.40 1.29 1.72 1.16 NO

LSF 6.1 (5K, 500K)

73% 77,947.90 1.11 1.68 1.20 NO

Results – Platform LSF V6.0 vs V6.1

Grid Computing Issues


Grid level scheduling changes some things

With the wider adoption of computing Grids as access mechanisms to local cluster resources, some of the requirements for the cluster resource manager have changed.

Users are coming from different organizations. Have they been authenticated? Do they have a user account?

I have to stage in data from where!??

Local policies must reflect some kind of balance between meeting local user requirements, and promoting some level of sharing.

How can the sites involved in a Grid get an idea what kind of workload is being run, and how it impacts the resources?

How can users access resources without needing a 30” display to show load graphs and queue lengths for the 10 different clusters they have access to?

Thinking about these issues can keep one awake at night.


Grid Identities are not UNIX user identities

Traditionally, LSF’s notion of users is very much tied to the UNIX user identity

Local admins must define local users for all users of the system

Can use some (brittle) form of user name mapping

Grid middleware (globus based) uses the GSI (PKI)

Grid map file maps users to local uids

Same management nightmare

Grid users are usually “second class citizens”

It would be nice to have some identity model where both grid and local scheduler shared a notion of a consumer, and perhaps allowed more flexible use of local user account (e.g. Legion)


Where are applications located and how are they configured

Users get used to their local configurations

local installations of applications

environment variable names

there is a learning curve per site

Need some kind of standardization

could do Teragrid style software stack standardization, but this is very inflexible

need a standardized job description database

application location

local instantiation of environment variables

tie in with DRMAA job category

Platform PS people used the “jsub” jobstarter

Are provisioning services the answer?

would be nice to dynamically install an application image and environment on demand with a group of jobs


How do administrator’s set scheduler policy?

It’s probably easiest to make those pesky grid users second class citizens (back to the identity issue)

A federated identity system (based on user’s role within a VO) could make sure that they get into the “right queue”

There are too many tuneables within local schedulers. Would be nice to have some kind of “self configuration” based on higher level policies

Platform’s goal based scheduling (project based scheduling)

Current “goals” include deadline, throughput, and velocity

How are resources being used, and who is doing what?

Need some kind of insight into the workload, users and projects

Needs to be “VO aware”

Something like Platform’s analytics packages


Data set management/movement for batch jobs

Should a job go to its data, or should data flow to a job

current schedulers don’t take this into consideration

ideally would like to flow jobs using the same data to a site (set of hosts) which have already “cached” the data

but where’s the sweet spot where this becomes a hot spot?

The scheduler’s job submission mechanism (both local and Grid) need to be able to specify data set usage, and the scheduler should use this as a factor in scheduling

Moreover, there needs to be some kind of feedback loop between the flowing of data between sites and the flowing of jobs between sites

If I had a predictive scheduler, I could have data transfers happen “just in time”

Platform’s Activities


So how do we find the solution to these issues?

We (Platform) need some experience working within Grid environments.

CSF (Community Scheduler Framework - not RAL’s scheduler) provides a framework we can use to experiment with metascheduling concepts and issues

But there aren’t the wide array of features or the scalability we have in LSF

Why not use LSF itself as a metascheduler?

We are engaged in Professional Services contracts doing this right now

Sandia National Lab - Job Scheduler interface to many PBS resources using LSF as the bridge. Integrates Kerberos and external file transfer.

National Grid Office of Singapore - LSF (and its WebGUI) will be the interface to computing resources at multiple sites. There are PBS, SGE and LL clusters (some with Maui). Automatic matching of jobs to clusters is desired.


CSF Architecture

Platform LSF User

Globus Toolkit User

Platform LSF

LSFLSF

Meta-scheduler

Plugin

Meta-scheduler

Plugin

SGE PBS

Grid Service Hosting Environment

Job Service

Reservation Service

Meta-SchedulerGlobal

Information Service

RIPS

GRAM SGE RIPS

GRAM PBS RIPS

RM Adapter

RIPS = Resource Information Provider Service

Queuing Service


LSF as a Metascheduler60,000ft

LSF PBS SGE LL

Cluster/Desktops

LSF Scheduler

Web PortalJob Scheduler

Cluster/Desktops

LSF SchedulerMultiCluster


Data Centric Scheduling

The solution comes in two parts:

Data Centric Scheduling

Dispatch compute jobs to machines to which the cost of accessing data is “cheapest”

cache aware scheduler

topology aware scheduler e.g. uses distance vectors to measure how far a host is from a data set

Workload Driven Data Management

Just as the workload scheduler is cognizant of data locality, a data manager needs to be cognizant of future workload that will exercise given data sets

If data sets can be transferred before they are needed, the latency of synchronous data transfer is mitigated


Data cache aware scheduling

MOLMOL

Site 1 Site 3

Site 2

Data ManagementService

Site 1 – MOL, MOL2

Site 2 – (none)

Site 3 - MOL

MOL2

1. Poll for datasets

2. Update cache info

3. bsub -extsched MOL

4. Local site is overloadeddata cache aware scheduler plug-in decides to forward the job to site 3, since it has the MOL database

5. Job forwarded to site 3


Goal-Oriented SLA-Driven Scheduling

What is it?

Goal-oriented "just-in-time" scheduling policy

Unlike current scheduling policies based on configured shares or limits, SLA-driven scheduling is based on customer provided goals:

Deadline based goal: Specify the deadline for a group of jobs

Velocity based goal: Specify the number of jobs running at any one time

Throughput based goal: Specify the number of finished jobs per hour

Allows users to focus on the "what and when" of a project instead of "how"


Goal-Oriented SLA-Driven Scheduling

Benefits

Guarantees projects are completed on time according to explicit SLA definitions

Provides visibility into the progress of projects to see how well projects are tracking to SLAs

Allows the admin focus on “What work and When” needs to be done, not “how” the resources are to be allocated

Guarantees service level deliveries to the user community, reduces the risks of projects and administration cost


Summary

Local scheduler technology continues to progress well …. within the cluster.

Grid level schedulers raise issues which haven’t been dealt with before

cluster users are no longer “local”

local scheduling policies aren’t really applicable

data management and environment management is more difficult

Platform is working to solve some of these issues

implementing meta-schedulers

researching new scheduling policies

Need to work closely with the HEP community since they are causing the biggest problems!

Questions?

Documents

When the Grid Comes to Town Chris Smith, Senior Product Architect Platform Computing [email protected]