94
© 2012 IBM Corporation IBM Platform LSF Training Nov 19-21, 2014

IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

IBM Platform LSF Training Nov 19-21, 2014

Page 2: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

IBM Platform LSF Architecture Overview

Page 3: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

3

Without IBM Platform LSF

Which node

can run my

job or task?

In a distributed environment (hundreds of hosts)

Monitoring and control of resources is complex

Work load is “silo”-based

Resource usage imbalance

Users perceive a lack of resources

Page 4: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

4

With IBM Platform LSF

All nodes are grouped

into a “cluster”

Now, IBM

Platform LSF

will run my job

or task on the

best node

available!

Virtual Pool of computing

resources managed by

IBM Platform LSF

Page 5: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

5

Key LSF Objectives

• Provide the means to create a powerful computer system made up of many smaller systems to increase productivity and lower operating costs

• Match limited supply of resources with demand

IBM Platform LSF

Page 6: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

6

LSF 8 Architecture

Services in

Real-time

Application

Servers On-

demand

Platform Enterprise Grid Orchestrator™ (EGO)

Network

Bandwidth Servers Licenses Data Storage

Platform LSF®

Heterogeneous Enterprise Resources

Application Application Application Application Application

Enterprise Applications

Page 7: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

7

LSF Terminology

• Cluster

A collection of TCP/IP networked hosts running Platform LSF

• Master host

A cluster requires a master host. The master host controls the rest of the hosts in the grid

• Master candidates

Master failover hosts

• Server host

A host within the cluster that submits and executes jobs and tasks

• Client host

A host within the cluster that only submits jobs and tasks

Page 8: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

8

LSF Terminology (cont.d)

• Execution host

The host that executes the job or task

• Submission host

The host from which a job or task is submitted

• Job

A command submitted to Platform LSF. Can take more than one Job Slot

• Queue

A network-wide holding place for jobs which implements different job scheduling and control policies

• Job Slot

The basic unit of processor allocation in Platform LSF. Can be more than one per physical processor

Page 9: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

9

LSF Overview

Individual machines are grouped into a cluster to be managed by Platform LSF

Users then submit their jobs to Platform

LSF and the master makes a decision on

where to run the job based on the

collected vital signs

One machine in the cluster is

selected as the “master” of LSF

Each slave machine in the cluster

collects its own “vital signs”

periodically and reports them

back to the master

Page 10: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

10

How It Works – IBM Platform LSF

Load

Information

Manager

Host

Workload

Manager

LSF Web

Services

Broker

Web Application

Job Submission

API

Plugin

Schedulers

Cluster

Workload

Manager

Job Queue

In

tell

igen

t S

ch

ed

ule

r

Fairshare

Preemption

Resource

Reservation

Advance

Reservation

License

Scheduling

SLA

Scheduling Service Level

Agreement

MultiCluster

Other

Scheduling

Modules

Page 11: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

11

Load Information Manager (LIM)

• Runs on every server host in the cluster

• Defines the cluster configuration

– Identifies the master host

– Licenses the static server hosts, dynamic server hosts and fixed client hosts

– Gathers built-in resource load information directly from /dev/kmem and forwards the

information to the master LIM

– Reports site-defined resource load information gathered by Master ELIM and slave

ELIMs to the master LIM

Page 12: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

12

Master LIM

• Selected based on the order of static server hosts in

LSF_MASTER_LIST variable in lsf.conf

• Stores built-in and site-defined resource load information gathered by

the Master ELIM and slave ELIMs

• MBD queries the resource load information from the Master LIM for

MBSCHD

• If master LIM becomes unavailable, a new master LIM is automatically

started in the failover master host

Page 13: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

13

Master and Slave ELIMs

• Site-defined resources can be managed by LSF

• These site-defined resources are gathered using slave ELIMs

(elim.slave)

• Slave ELIMs are managed by a Master ELIM (melim)

• The slave ELIMs are written by the LSF Administrator

• Static ELIMS is used to report user defined static resources

Page 14: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

14

Process Information Manager (PIM)

• Runs on every server host in the cluster

• Responsible for gathering information about every process running on

the server

• Information gathered is used:

– By SBD to enforce load thresholds

– By MBD to calculate fairshare

• Automatically started by LIM

Page 15: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

15

Remote Execution Server (RES)

• Runs on every server host in the cluster

• Provides fast, transparent and secure remote execution of interactive

tasks

• To prevent interactive task submissions, disable the RES daemon

Page 16: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

16

Slave Batch Daemon (SBD)

• Runs on every server host in the cluster

• Receives job requests from MBD

• Is responsible for enforcing load thresholds

• Maintains the state of jobs on a server host

• Launches MBD on the master host

Page 17: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

17

Master Batch Daemon (MBD)

• One MBD per cluster that runs on the master host

• Responds to user queries (bjobs, bhosts, etc.)

• Receives job requests (bsub)

• Responsible for the overall state of all jobs in the system

• Sends resource load information from the master LIM and pending job

information to MBSCHD for scheduling

• Receives scheduling decision from MBSCHD and dispatches jobs to

SBD on designated server host

• Keeps a transaction file on jobs

• Manages queues

Page 18: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

18

LSF Scheduling Daemon (MBSCHD)

• One MBSCHD per cluster that runs on the master host

• Receives resource load information and pending job information from

MBD

• Makes scheduling decisions based on job requirements, policies and

resource availability

• Sends scheduling decisions to MBD for job dispatching

• Launched automatically by MBD on the master host

• If MBSCHD fails, a new MBSCHD is restarted

• Reads the lsf.conf file for environment information

Page 19: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

19

• Upon a successful completion of an LSF installation, the daemons must be started

LIM

PIM

RES

PEM

SBD

LSF Server

Starting LSF Daemons

LSF Master

LIM

PIM

RES

SBD

MBD

MBSCHD

Page 20: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

20

Starting LSF Daemons (cont.d)

• To start all daemons on all hosts in the cluster you can use the following

script

lsfstartup

• Useful for cold starting a cluster

Page 21: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

21

Starting LSF Daemons (cont.d)

• To start daemons on local host: Daemons

% lsadmin limstartup LIM,PIM

% lsadmin resstartup RES

% badmin hstartup SBD[MBD,MBSCHD]

• To start daemons on remote host(s): % lsadmin limstartup host1 [host2…hostn]

% lsadmin resstartup host1 [host2…hostn]

% badmin hstartup host1 [host2…hostn]

• To start daemons on all hosts in the lsf.cluster file: % lsadmin limstartup all

% lsadmin resstartup all

% badmin hstartup all

Page 22: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

22

Starting LSF Daemons (cont.d)

• To start daemons on local host: # $LSF_LSFSERVERDIR/lsf_daemons start

or

# /etc/init.d/lsf start

or

# cd $LSF_SERVERDIR

# ./lim

# ./res

# ./sbatchd

Page 23: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

23

IBM Platform LSF Daemons

Host 2 Host N

/dev/kmem /dev/kmem

Master

/dev/kmem

LIM LIM LIM

RES RES RES

MBD

MBSCHD

MELIM MELIM MELIM

PIM PIM PIM

SELIM SELIM SELIM

Master

SBD SBD SBD

Page 24: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

24

LSF LIM & RES Status

% lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem

training8 ok 0.0 0.0 0.0 0% 0.0 1 11728 112M 114M 52M

training3 ok 1.9 1.0 1.1 19% 1.9 1 0 121M 113M 31M

training1 -ok 2.9 3.0 1.6 49% 9.9 2 0 127M 143M 35M

training5 busy 9.5 12.0 8.1 *100% 10.9 5 0 30M 21M 40M

training2 unavail - - - - - - - - - -

* Indicates that a load threshold has been exceeded

Page 25: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

25

LSF SBD Status

% bhosts

HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV

training8 ok - 32 10 5 4 1 0

training1 ok - 16 0 0 0 0 0

training3 unreach - 1 0 0 0 0 0

training5 closed - 2 1 1 0 0 0

training2 unavail - 8 0 0 0 0 0

Page 26: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

26

LSF SBD Status (cont.d)

% bhosts –l training5

HOST training5

STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOWS

closed_Busy 18.60 - 2 1 1 0 0 0 ()

CURRENT LOAD USED FOR SCHEDULING:

r15s r1m r15m ut pg io ls it tmp swp mem

Total 7.8 6.9 5.4 *75% 5.5 3.4 3 0 82M 44M 52M

Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M

LOAD THRESHOLD USED FOR SCHEDULING:

r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - - - *0.2 - - - - - - -

loadStop - - - *0.7 - - - - - - -

* Indicates that a load threshold has been exceeded

Page 27: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

27

LSF Job Process

• By default, LSF handles a job as follows:

– Receives the job

– During the next dispatch turn, considers the job for dispatch

– Places the job on the best available host

– Sets the environment on the host

– Starts the job

Page 28: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

28

Queue Definition

• User access restriction

• Host restriction

• Queue status

• Exclusive execution restriction

• Job resources requirement

Page 29: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

29

Job Dispatching

• Every JOB_SCHEDULING_INTERVAL, MBD sends

jobs for scheduling to MBSCHD

• Jobs may not be dispatched in order of submission

• MBD sends job information and resource information

to MBSCHD for scheduling

• As soon as MBD receives scheduling decisions from

MBSCHD, it immediately dispatches the job for execution

Page 30: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

30

Job Scheduling

• MBSCHD evaluates jobs and makes scheduling

decisions based on:

– Job priority

– Scheduling policies

– Available resources

• MBSCHD selects the best appropriate execution host and sends it’s

decision to MBD

Page 31: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

31

Host Selection

• A host is eligible to run a job if all conditions are met:

– Job slot availability on host

– Job slot limits

– Host load levels

– Host dispatch windows

– Resource requirements of the job

– Resource requirements of the queue

Page 32: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

32

Job Execution

• By default:

– The execution environment is maintained to be as close to the submission

environment as possible

– LSF transfers global environment variables from the submission host to the

execution host

• LSF sets LSF-specific environment variables for jobs

Page 33: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

33

Execution Environment Variables

LSB_JOBID

Job id assigned by LSF

LSB_MCPU_HOSTS

The list of hosts that are used

to run the batch job

LSB_QUEUE

The name of the queue the

job is dispatched from

LSB_JOBNAME

The name of the job

LSB_INTERACTIVE

Set to “Y” if the job is submitted

with the –I option.

LS_JOBPID

Set to the process ID of the job

LS_SUBCWD

The directory on the submission

host when the job was

submitted

Refer to the LSF Reference Guide for more information

Page 34: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

IBM Platform LSF Key Features Overview

Page 35: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

IBM Platform LSF Key Features Overview

Page 36: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

36

Key LSF Concepts

• Resource - Computers, applications, licenses, storage ... a cluster can

be thought of as a collection of resources

• Transparency – job executing on any node in the cluster must appear to

the end user like it’s running on his/her local node

• Policies – resources are allocated to the jobs according to centrally

configured policies

LSF continuously matches demand with supply. Job is dispatched to run

on remote node once job resource requirements are matched with

resources supplied by the host(s) in the cluster, and job met scheduling

polices currently in effect.

Page 37: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

37

Resource Requirement String (cont.d)

• A resource requirement string is divided into the following

sections:

– Selection - select[selection_string]

– Usage - rusage[rusage_string]

– Ordering - order[order_string]

– Locality - span[span_string]

– Same - same[same_string]

– CU - cu[cu_string]

• The span and same sections are specifically used for parallel

jobs

Page 38: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

38

Selection String

• A logical expression built from a set of resource names

• Specifies the characteristics of a server host to be considered as a

potential execution host

• Evaluated for each host

– If the expression evaluates to ‘true’, then that host is considered a candidate

Page 39: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

39

Selection String (cont.d)

• The select keyword can be omitted if the selection

section is first in the resource requirement string

• The default selection string for execution type commands such as bsub or lsrun is:

select[type==local]

• The default selection string for query type commands such as lsload is:

select[type==any]

Page 40: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

40

Selection String Examples

• $ bsub –R "select[type==any && swp>=300 && \

mem>500]" job1

Select a candidate execution host of any type which has at least 300MB of

available swap and more than 500MB of available memory

• $ lsload –R "select[type==local && cpuf<18.0]"

Displays all candidate execution hosts of the same type as the submission host

which has a CPU factor less than 18.0

Page 41: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

41

• $ bsub –R“(ut<0.50 && ncpus==2) || \

(ut<0.75 && ncpus==4)“ job2

Select a candidate execution host the CPU utilization is less than 0.50 and the

number of CPUs is 2, or the CPU utilization is less than 0.75 and the number of

CPUS is 4

• $ bsub –R "type==SUNSOL && swap>300 || \

type==HPPA && swap>400" task1

Select a candidate execution host where the type is SUNSOL and more than

300MB of available swap or where the type is HPPA and has more than 400MB

of available swap

Selection String Examples (cont.d)

Page 42: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

42

Resource Usage String

• Specifies resource reservations for jobs on execution hosts

• Ignored when running interactive tasks

• By default no resources are reserved

• If rusage is defined at the job level and queue level, the job level

takes precedence

• Keywords duration and decay can be used

• If a job can run with more than one rusage string, it is possible to

specify multiple strings with an “OR” operator and have LSF pick the

first one that matches

Page 43: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

43

Resource Usage String Examples

• $ bsub –R "select[type==any && swap>=300 && \

mem>500] order[swap:mem] \

rusage[swap=300,mem=500]" job1

On the selected execution host, reserve 300MB of swap space and 500MB of

memory for the duration of the job

• $ bsub –R rusage[mem=500:app_lic_v2=1 || \

mem=400:app_lic_v1.5=1]" job1

Job will use 500MB with app_lic_v2, or 400MB with app_lic_v1.5

• Resource reservation is ignored for interactive tasks (ie: lsload, lsrun)

Page 44: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

44

Resource Usage String Examples (cont.d)

• $ bsub –R “select[ut<0.50 && ncpus==2] \

rusage[ut=0.50:duration=20:decay=1]“ job2

On the selected execution host, reserve 50% of cpu utilization and linearly

decay the amount of cpu utilization reserved over the duration of the period

• $ bsub –R "select[type == SUNSOL && mem > 300] \

rusage[mem=300:duration=1h]" job3

On the selected execution host, reserve 300MB of memory for 1 hour

Page 45: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

45

Resource Usage String Examples (cont.)

% bjobs –lp Job <215>, User <john>, Project <default>, Status <PEND>, Queue <reserve>, Command

<job1> Thu Jul 24 07:12:16: Submitted from host <delpe07>, CWD </home/john>, Requested

Resources <rusage[res1=4, tmp=1000]>; Thu Jul 24 07:12:21: Reserved <1> job slot on host <delpe07>; Thu Jul 24 07:12:21: Reserved <581> megabyte tmp on host <581M*delpe07>; Thu Jul 24 07:12:21: Reserved <2> res1

$ bsub -q reserve –R “rusage[res1=4,tmp=1000]” Job1

Page 46: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

46

Order String

• Allows candidate execution hosts to be sorted according to the value of

resources

• The first index is the primary sort key, the second is your secondary sort

key, etc

• Hosts are ordered from best to worst on the given index or indices

• If defined at the job and queue level, job level takes precedent

• The default order string used is order[r15s:pg]

Page 47: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

47

Order String Examples

• $ bsub –R "select[type==any && swp>=300 && \

mem>500] order[mem]" job1

Order the candidate execution hosts from the highest to lowest amount of

available memory

• $ lsload –R "select[type==local && cpuf < 18.0] \

order[cpuf]”

Order the candidate execution hosts from highest to lowest CPU factor

• If the “Order” is not specified, order of the candidate execution hosts

will use the default order string of [r15s:pg]

Page 48: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

48

Order String Examples (cont.d)

• $ bsub –R “select[ut<0.8 && mem>200] \

order[r1m:ut:-mem]" small_mem_job.sh

Order the candidate execution hosts from lowest to highest one minute run

queue length, from lowest to highest CPU utilization, and from lowest to highest

amount of available memory

Page 49: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

49

Span String

• Specifies the locality of a parallel job

• Supported options:

– span[hosts=1] which indicates that all processors allocated to this job must be on

the same execution host

– span[ptile=n] which indicates that up to n processors on each execution host

should be allocated to the job

– span[ptile=![,HOSTTYPE:n] uses the predefined maximum job slot limit in

lsb.hosts (MXJ per host type/model) as the value for other host model or type, other

then those host type is specified.

• When defined at both job-level and queue-level, the job-level definition

takes precedence

Page 50: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

50

Span String Examples

• $ bsub –n 16 –R "select[ut<0.15] order[ut] \

span[hosts=1]" parallel_job1

All processors required to complete this job must reside on the same

execution host

• $ bsub –n 16 –R "select[ut<0.15] order[ut] \

span[ptile=2]" parallel_job2

Up to 2 CPUs per execution host can be used to execute this job therefore at

least 8 execution hosts are required to complete this job

Page 51: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

51

Same CPU String

• Used to specify that all processes of a parallel job must run on hosts

with the same resources

• The parallel job scheduler plugin must be installed to use this option

• When defined at both job-level and queue-level, both requirements are

combined to allocate processors

• Any static resource can be specified

Page 52: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

52

Same CPU String Examples

• $ bsub –n 64 –R "select[type==SGI6||type==SOL7] \

same[type]" parallel_job1

Run all parallel processes on the same host type, either SGI IRIX or Solaris 7,

but not both

• $ bsub –n 64 –R "select[type==any] \

same[type:model]" parallel_job2

Run all parallel processes on the same host type and model

Page 53: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

53

• Support for multiple resource requirement strings (-R) Options

• The administrator can easily change resource requirements In job level

submission bsub -R "select[swp > 15]" -R “select[hpux] order[r15m]” -R “rusage[mem=100]” –

R “order[ut]” -R “same[type]” -R “rusage[tmp=50:duration=60]” -R

“same[model]” myjob

• LSF merges the multiple -R options into one string and selects a host

that meets all of the resource requirements

• The number of -R option sections is unlimited

• Up to a maximum of 512 characters for the entire string per –R option

Multiple Resource Requirements

Page 54: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

54

Compound Resource Requirements

• Specify different requirements for some slots within a job

Example:

Requests 32 processors: one on a X86_64 machine, on which the job reserves 16000 M memory; and the rest on X86_64 machines.

Requests 48 processors to run MPMD application: the first 16 must be on 2 XT5 nodes, each with 8 cores; the remaining 32 processors must be on 8 XT4 nodes, each with 4 cores.

$bsub –R “1*{ select[type==X86_64 && mem>16000]\

rusage[mem=16000] } \

+ 31*{ select[type==X86_64] }” myjob

$ bsub –R “16*{ select[type==XT5] span[ptile=8] } \

+ 32*{ select[type==XT4] span[ptile=4] }”\

crayjob

Page 55: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

55

Multi-phase Resource Reservation

• bsub -R option can contain multiple durations with multiple memory and decay requirements.

Example:

After job runs for 10 minutes, job will reserve 100M memory and release other 400M memory.

The job reserves 500M memory with decay for the first 20 minutes, then reserves 400M memory with no decay for the next 10 minutes, then reserves 300M memory with decay for the next 5 minutes, then reserves no memory for the rest of job’s life cycle.

bsub –R”rusage[mem=(500 100):duration=(10)] myjob

bsub –R “rusage[mem=(500 400 300):duration=(20 10

5):decay=(1 0 1)]” myjob

Page 56: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

56

Job Submission & Control Commands

• bsub [options] command [cmdargs]

• bjobs [-a][-J jobname][-u usergroup|-u all][…] jobID

• bhist [-a][-J jobname][-u usergroup|-u all][…] jobID

• bbot/btop [jobID | "jobID[index_list]“] [position]

• bkill [-J jobname] [-m] [-u ] [-q] [-s signalvalue]

• bmod [bsub_options] jobID

• bpeek [-f] jobID

• bstop/bresume jobID

• bswitch destination_queue jobID

Page 57: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

57

LSF Job Submission Commands

• bsub [commonly used options]

-n # - number of CPUs required for the job

-o filename – redirect stdout, stderr and resource usage

information of the job to the specified output file

-e filename – redirect stderr to the specified error file

-i filename – use the specified file as standard input for the job

-q qname – submits the job to the specified queue

-m hname – select host(s) or host group. Keywords “all” and “others” can be used

-J jobname – assigns the specified name to the job

-Q “[exit_code] [EXCLUDE(exit_code)]” - Success exit code

Page 58: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

58

LSF Job Submission Commands (cont.d)

• bsub [options]

-oo filename - same as -o, but overwrite file if it exists

-eo filename - same as -e, but overwrite file if it exists

-L login_shell - Initializes the execution environment using the specified login shell

-n number - if PARALLEL_SCHED_BY_SLOT=Y in lsb.params, then specify number of

job slots, not CPUs

-g jobgroup - submit job to specified group

-sla serviceclass - submit job to specified service class

-W runlimit - if ABS_RUNLIMIT=Y uses wall clock time

-app - Application Profiles

Page 59: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

59

LSF Job Submission Commands (cont.d)

-C core_limit Set the per-process core file size limit (KB)

-c cpu_time Limit the total CPU time for the job ([HH:]MM)

-cwd Specify the current working directory for the job

-W runlimit Set the run time limit for the job ([HH:]MM)

-We Specifies an estimated run time for the job

• bsub [setting limits]

Page 60: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

60

LSF Job Submission Commands (cont.d)

• bsub [options]

-B Send email when the job is dispatched

-H Hold the job in the PSUSP state after submission

-N Email job header only when job completes

-b begin_time Dispatch the job after the specified time with year & time

-G user_group Associate job with specified user group (fairshare)

-L login_shell Initialized exec environment using specified shell

-t term_time Specify the job termination deadline with year & time

-u mail_user Send email to the specified address

Page 61: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

61

LSF Interactive Job Submission

• bsub

-I – submit an interactive job

-Ip – submit an interactive job with pseudo-tty support

-Is – submit an interactive job and create a pseudo-tty with shell mode support

-XF – Submit an interactive with support SSH X11

• By default, LSF uses ssh for interactive X-Window jobs

Page 62: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

62

bsub - Condensed Host Notations

bsub -m “host[1-12,24] host[12-64]+2” job1

• Optional Configuration in lsf.conf

– LSB_MAX_ASKED_HOSTS_NUMBER=integer

– Limits the number of hosts a user can specify with the bsub -m option. The request is rejected if

more hosts are specified than the value set

– Default value is 512

• Commands can use the notation:

bsub brun bmod brestart brsvadd brsvmod bswitch bjobs bhist bacct

brsvs bmig bpeek

Page 63: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

63

• By script or command % cd /home/user/project_dir

% bsub –q parallel –a fluent –n 4 ./my_fluent_launcher.sh

• By job spooling % bsub < spoolfile

• Interactively % bsub

bsub> #BSUB –q parallel –n 4

bsub> #BSUB –a fluent

bsub> cd /home/user/project_dir

bsub> ./my_fluent_launcher.sh

bsub> ^D

Job <1234> submitted to queue <parallel>

Available Methods for Submitting Jobs

Example spoolfile #BSUB –q parallel

#BSUB –n 4 –a fluent

cd /home/user/project_dir

./my_fluent_launcher.sh

Page 64: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

64

View Job Information

bjobs Can display parallel jobs and condensed host groups in an aggregate format

-a Display information about jobs in all states (including finished jobs)

-A Display summarized information about job arrays

-d Display information about jobs that finished recently

-l|-w Display information in long or wide format

-p Display information about pending jobs

-r Display information about running jobs

-g job_group Display information about jobs in specified group

-J job_name Display information about specified job or array

-m host_list Display information about jobs on specified hosts or groups

-P project_name Display information about jobs in specified project

-q queue_name Display information about jobs in specified queue

-u user_name Display information about jobs for specified users/groups

Page 65: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

65

View Job Information (cont.d)

bhist

-a Display information about all jobs (overrides -d, -p, -r, and -s)

-b|-l|-w Display information in brief, long, or wide format

-d Display information about finished jobs

-p Display information about pending jobs

-s Display information about suspended jobs

-t Display job events chronologically

-C|-D|-S|-T start_time,end_time Display information about completed, dispatched, submitted, or all

jobs in specified time window

-P project Display information about jobs belonging to specified project

-q queue Display information about jobs submitted to specified queue

-u username|all Display information about jobs submitted by user

Page 66: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

66

% bjobs –u all –a

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME

1233 user1 DONE normal training8 training1 *sortName Nov 21 10:00

1234 user1 RUN priority training8 training1 *verilog Nov 21 10:00

1235 user2 PEND night training9 *sortFile Nov 21 10:03

1236 user2 PEND normal training9 *sortName Nov 21 10:04

View Submitted Job Information Example

Page 67: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

67

View Detailed Submitted Job Information

% bjobs -l 1235

Job <1235>, User <user2>, Project <default>, Status <PEND>, Queue <night>,

Command <night_job>

Wed Nov 21 10:03:51: Submitted from host <training9>, CWD <$HOME>;

PENDING REASONS:

Dispatch window closed: 1 queue;

SCHEDULING PARAMETERS:

r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - - - - - - - - - - -

loadStop - - - - - - - - - - -

Page 68: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

68

View Historical Job Information Example

% bhist

Summary of time in seconds spent in various states:

JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL

2299 alfred *eep 100 5 0 2 0 0 0 7

2300 alfred *eep 100 4 0 2 0 0 0 6

2301 alfred *eep 100 4 0 2 0 0 0 6

2302 alfred *eep 100 3 0 2 0 0 0 5

Page 69: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

69

View Historical Job Information Example (cont.d)

% bhist -l 2302

Job <2302>, User <alfred>, Project <default>, Command <sleep 100>

Wed Mar 30 13:53:44: Submitted from host <blade02>, to Queue

<normal>,CWD<$HOME>;

Wed Mar 30 13:53:47: Dispatched to <blade02>;

Wed Mar 30 13:53:47: Starting (Pid 24595);

Wed Mar 30 13:53:52: Running with execution home </home/alfred>, Execution CWD

</home/alfred>, Execution Pid <24595>;

Wed Mar 30 13:55:32: Done successfully. The CPU time used is unknown;

Wed Mar 30 13:55:32: Post job process done successfully;

Summary of time in seconds spent in various states by Wed Mar 30 13:55:32

PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL

3 0 105 0 0 0 108

Page 70: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

70

Manipulating Jobs

• bbot – moves a pending job to the bottom of the queue

• btop – moves a pending job to the top of the queue

• bkill – sends a signal to kill, suspend or resume unfinished

jobs (use a job ID of “0” to kill all your jobs).

• bmod – modifies job submission options of a job

• bpeek – displays the stdout and stderr of an unfinished job

• bstop – suspend unfinished jobs

• bresume – resumes one or more suspended jobs

• bswitch – switches unfinished jobs to another queue

Page 71: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

71

• Example #1 % bsub /home/LSF/scripts/sleeper Job <1234> is submitted to default queue <normal>

% bstop 1234 Job <1234> is being stopped

% bresume 1234 Job < 1234> is being resumed

• Example #2

% bsub –q night –m "host1 host2" night_job Job <1235> is submitted to queue <night>

% bmod –m "hostGroupA" 1235 Parameters of job <1235> are being changed

LSF Job Submission Examples

Page 72: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

72

LSF Job Submission Examples (cont.d)

• Example #3

% bsub –i "/home/user1/in.dat"

-o "/home/user1/out.dat"

–e "/home/user1/error.dat" long_job

Job <1236> is submitted to default queue <normal>

% bswitch night 1236

Job <1236> is switched to queue <night>

• Example #4

% bsub –P Research –J Projection1 project_job

Job <1238> is submitted to default queue <normal>

% btop 1238

Job <1238> has been moved to position 1 from top

% bbot 1238

Job <1238> has been moved to position 1 from bottom

Page 73: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

73

File Transfer Option (-f )

• Copies file from local (submission) host to the remote (execution) host if

there is no shared file system:

bsub –f “local_file operator [remote_file]”

The following operators can be used:

> Copies local file to remote file before jobs start

< Copies remote file to local file after job completes

<< Appends the remote file to the local file after job completes

>< or <> Copies local file to remote file before job starts, and

remote file to local file after job completes

Page 74: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

74

• Submit myjob with input file /data/data2. After job has completed, copy

the output file out to /data/out2

bsub –f “/data/data2 > data2” \

-f “/data/out2 < out” myjob data2 out

File Transfer Examples

Page 75: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

75

LSF Cluster Query Commands

• lsid

• lsinfo [-l][-r][-m][-t][resourcename ...]

• lshosts [-w|-l][-R "res_req"][hostname ...]

• lsload [-N|-E][-l][-R "res_req"] [hostname... ]

• lsmon [-i][-L logfile]

• bhosts [-w|-l][-R "res_req"][hostname|hostgroup]

• bmgroup [-r][hostgroup …]

• bqueues [-w|-l|-r][-m hostname|-m all] [queuename …]

• busers [username …|usergroup …| all]

• bugroup [-r][usergroup …]

Page 76: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

76

View Cluster and Master Names

% lsid

Platform LSF 8.0, Jan 20 2011

Copyright 1992-2011 Platform Computing Corporation

My cluster name is cluster8

my master name is training1

View load sharing configuration information

% lsinfo RESOURCE_NAME TYPE ORDER DESCRIPTION r15s Numeric Inc 15-second CPU run queue length maxtmp Numeric Dec Maximum /tmp space (Mbytes) cpuf Numeric Dec CPU factor hname String N/A Host name

TYPE_NAME DEFAULT LINUX LINUXPPC64

MODEL_NAME CPU_FACTOR ARCHITECTURE PC1133 23.10 x6_1189_PentiumIIICoppermine

Page 77: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

77

Server Host Configuration Information

% lshosts HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES

training1 LINUX86 PC1133 23.1 16 2.0G 1.0G Yes (mg)

training8 LINUX64 Intel64 30.3 32 6.0G 2.0G Yes (mg)

training5 SOL32 Ultra450 25.0 8 1.0G 740M Yes ()

training2 LINUX86 PC1133 23.1 1 512M 256M Yes ()

training3 LINUX86 PC1133 23.1 1 512M 256M Yes ()

training9 HPPA HP735 - - - - No ()

Page 78: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

78

% lshosts -l training3 HOST_NAME: training3

type model cpuf ncpus ndisks maxmem maxswp maxtmp rexpri server nprocs ncores nthreads

X86_64 PC6000 116.1 1 1 2007M 4094M 40319M 0 Yes 1 1

1

RESOURCES: (mg)

RUN_WINDOWS: (always open)

LICENSES_ENABLED: (LSF_Base LSF_Manager LSF_Make)

LOAD_THRESHOLDS:

r15s r1m r15m ut pg io ls it tmp swp mem

- 3.5 - - - - - - - - -

Detailed Server Host Configuration Information

Page 79: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

79

% lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem

training8 ok 0.0 0.0 0.0 0% 0.0 1 11728 112M 114M 52M

training2 ok 0.3 0.7 0.1 9% 0.2 0 2467 330M 201M 290M

training5 ok 1.5 2.0 0.1 25% 2.9 5 0 130M 101M 90M

training1 ok 2.9 3.0 1.6 49% 6.9 2 0 127M 143M 35M

training3 ok 3.1 4.0 2.6 69% 9.9 6 0 117M 103M 25M

Detailed server host load information (Including I/O information and external load indices)

% lsload –l HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem licA

training8 ok 0.0 0.0 0.0 0% 0.0 3 0 11728 112M 114M 52M 12

training2 ok 0.3 0.7 0.1 9% 0.2 12 0 2467 330M 201M 290M 12

training5 ok 1.5 2.0 0.1 25% 2.9 21 5 0 130M 101M 90M 12

training1 ok 2.9 3.0 1.6 49% 6.9 80 2 0 127M 143M 35M 12

training3 ok 3.1 4.0 2.6 69% 9.9 381 6 0 117M 103M 25M 12

Server Host Load Information

Page 80: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

80

Current Load Information

% lsmon Hostname: training1 Refresh Rate: 10 secs

HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem

training8 ok 0.0 0.0 0.0 0% 0.0 1 11728 112M 114M 52M

training2 ok 0.3 0.7 0.1 9% 0.2 0 2467 330M 201M 290M

training5 ok 1.5 2.0 0.1 25% 2.9 5 0 130M 101M 90M

training1 ok 2.9 3.0 1.6 49% 6.9 2 0 127M 143M 35M

training3 ok 3.1 4.0 2.6 69% 9.9 6 0 117M 103M 25M

Page 81: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

81

Server Host Information

% bhosts

HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV

training8 ok 2 32 10 5 4 1 0

training5 ok 1 8 7 6 0 1 0

training2 ok - 16 3 1 1 1 0

training1 ok - - 0 0 0 0 0

training3 closed - 1 0 0 0 0 0

Page 82: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

82

Detailed Server Host Information

% bhosts –l training3

HOST training3

STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW

closed_Wind 10.3 - 1 0 0 0 0 0 12:00-15:00

CURRENT LOAD USED FOR SCHEDULING:

r15s r1m r15m ut pg io ls it tmp swp mem

Total 0.0 0.0 0.0 0% 0.0 1 1 11720 512M 256M 412M

Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M

LOAD THRESHOLD USED FOR SCHEDULING:

r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - - - - - - - - - - -

loadStop - - - - - - - - - - -

Page 83: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

83

Queue Information

% bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP

priority 73 Open:Active 10 1 1 1 1 0 1 0

interact 70 Open:Active 1 - - - 5 4 1 0

night 40 Open:Inact - - 1 - 3 3 0 0

normal 30 Open:Active - 1 - - 10 4 4 2

Page 84: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

84

% bqueues -l normal QUEUE: normal

-- For normal low priority jobs. This is the default queue.

PARAMETERS/STATISTICS

PRIO NICE STATUS MAX JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV

30 20 Open:Active - - - 10 4 4 1 1 0

SCHEDULING PARAMETERS

r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - 0.7 - - - - - - - - -

loadStop - 2.0 - - - - - - - - -

USERS: all users

HOSTS: all

Detailed Queue Information

Page 85: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

85

User and User Group Information

% busers all USER/GROUP JL/P MAX NJOBS PEND RUN SSUSP USUSP RSV

default - 1 - - - - - -

user1 - - 15 10 4 1 0 0

user6 1 4 24 20 4 0 0 0

userGroupA 1 - 35 25 7 3 0 0

View user group member information % bugroup

GROUP_NAME USERS

userGroupA user2 user3 UnixAdminGrp

userGroupB NISgroup NTgroup

Page 86: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

86

bparams -a

• Display all parameters in lsb.params configurations

$ bparams –a

DEFAULT_QUEUE = normal

DEFAULT_HOST_SPEC = NULL

MBD_SLEEP_TIME = 20 (seconds)

SBD_SLEEP_TIME = 15 (seconds)

JOB_ACCEPT_INTERVAL = 1

PG_SUSP_IT = 180

CLEAN_PERIOD = 3600

MAX_JOB_NUM = 1000

MAX_SBD_FAIL = 3

HIST_HOURS = 5

DEFAULT_PROJECT = default

Page 87: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

87

Queue Configuration: Example 1

• Queue used for short running jobs

Begin Queue

QUEUE_NAME = short (mandatory)

DESCRIPTION = for short running jobs

ADMINISTRATORS = userGroupA

PRIORITY = 75

USERS = userGroupA engineer/ engineer

CPULIMIT = 2

RUNLIMIT = 5 10

CORELIMIT = 0

MEM = 800/100

SWP = 500/50

HOSTS = hostGroupC

End Queue

Page 88: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

88

Queue Configuration: Example 2

Begin Queue

QUEUE_NAME = normal

DESCRIPTION = default queue for single CPU jobs

PRIORITY = 30

USERS = all

INTERACTIVE = NO

NICE = 20

MEMLIMIT = 204800 # 200MB of memory

PROCLIMIT = 1

HOSTS = all

End Queue

General submission queue

Page 89: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

89

Queue Configuration: Example 3

Begin Queue

QUEUE_NAME = night

DESCRIPTION = used for jobs running at night

ADMINISTRATORS = lsfadmin user1

PRIORITY = 40

DISPATCH_WINDOW = (18:00-07:30)

RUN_WINDOW = (18:00–08:00)

RES_REQ = select[(type==SUNSOL && r1m<2.0)||\

(type==HPPA && r1m<1.0)]

HOSTS = all ~training4

End Queue

After hours queue

Page 90: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

90

Queue Configuration: Example 4

Begin Queue

QUEUE_NAME = interactive

DESCRIPTION = default for interactive jobs

ADMINISTRATORS = user2 userGroupA

PRIORITY = 80

INTERACTIVE = ONLY

NEW_JOB_SCHED_DELAY = 0

HOSTS = hostGroupB+5 hostGroupA+2 others

End Queue

Interactive job queue

Page 91: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

91

Disabling a Queue

• Remove queue and configuration from file

• Comment the Begin and End Queue lines

• No need to comment the queue configurations

# Begin Queue

QUEUE_NAME = interactive

DESCRIPTION = default for interactive jobs

ADMINISTRATORS = user2 userGroupA

PRIORITY = 80

INTERACTIVE = ONLY

NEW_JOB_SCHED_DELAY = 0

HOSTS = hostGroupB+5 hostGroupA+2 others

# End Queue

Page 92: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

92

lsb.applications - LSF Application Profile

• Application Encapsulation

– Application-specific job containers

– Maps common execution requirements for an application

– Centralized application job submission properties

– Different applications may be submitted through the same queue

– Minimize amount of queues

– Minimize system administration overhead for maintaining queues

– User manageable application profiles

– New configuration lsb.applications

Page 93: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

93

lsb.applictions - LSF Application Profile (cont.d)

• Profiles provides a central definition for application specific LSF

attributes including:

– Pre/Post-exec and job starter

– Automatic job controls

– Process and processor limits

– Runtime hints/estimates, in addition to limits, improved scheduling

– Re-runnable jobs

– Re-queue exit values

– Default resource requirements

– Control of job chunking

– Estimated runtime

Page 94: IBM Platform LSFaadityahpc.tropmet.res.in/Aaditya/INCOIS/INOIS_LSF... · 2015. 11. 5. · Work load is “silo”-based Resource usage imbalance ... • Provide the means to create

© 2012 IBM Corporation

Platform Computing

94

lsb.applications - LSF Application Profile (cont.d)

• Example: execution requirements for the FLUENT application:

• bsub -app fluent -q overnight myjob

Begin Application

NAME = fluent

DESCRIPTION = FLUENT Version 6.2

CPULIMIT = 180/hostA # 3 hours of host hostA

FILELIMIT = 20000

DATALIMIT = 20000 # jobs data segment limit

CORELIMIT = 20000

PROCLIMIT = 5 # job processor limit

PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out

REQUEUE_EXIT_VALUES = 55 34 78

End Application