J. Skovira 5/05 v1J. Skovira 5/05 v1 22
Agenda
l Batch Scheduling BasicsBatch Scheduling Basics
l LoadLeveler basicsLoadLeveler basics
l LoadLeveler configurationLoadLeveler configuration
Basic CommandsBasic Commandsl Job SubmissionJob Submissionl Job cancellationJob cancellationl Job monitoringJob monitoring
l Job command filesJob command files
l Advanced FunctionsAdvanced Functions
l Questions and AnswersQuestions and Answers
J. Skovira 5/05 v1J. Skovira 5/05 v1 33
Who Needs a Job Scheduler?
Single Machine
IBM
Job 1Job 2….Job N
HPC Machine
OS multi-tasks single CPU: time-shared scheduling
User 1:Job 1Job 2….Job N
User 2:Job 1Job 2….Job N
User 3:Job 1Job 2….Job N
Parallel DimensionMany Machines and Users:
More Jobs
Parallel Dimension
User may impact a distant job
Scheduler runs jobs according to: Scheduling Theory Site-defined Policy
J. Skovira 5/05 v1J. Skovira 5/05 v1 44
Scheduling Terms
HPC Cluster
Resource manager
Scheduler
Start jobs on specific resources at specific times
Job Queue
Job 1Job 2Job 3….
Batch Scheduler
J. Skovira 5/05 v1J. Skovira 5/05 v1 55
More Tasks for User?
Job Command File is a small set of job directivesJob Command files can be “borrowed” from samples
Simple Command files take predefined defaultsExperienced users may enhance command files
Application Code
Job Meta Data
Once control is handed to the job, scheduler is out of the way
J. Skovira 5/05 v1J. Skovira 5/05 v1 66
LoadLeveler Components
Loadleveler Central Manager Negotiator Daemon
IBM
IBM Cluster
Worker NodesStartd daemon
Schedd Machine Schedd Machine
High Performance
Switch
J. Skovira 5/05 v1J. Skovira 5/05 v1 88
Priority and Scheduling
Jobs arrive: from different users at different time in different job classes with different priorities
Job A 8 2Job B 12 1Job C 10 1Job D 4 1Job E 4 5
JobE
JobA
JobC
JobD
JobB
Loadleveler sorts the job queue
Loadleveler schedules the jobs in queue order
J. Skovira 5/05 v1J. Skovira 5/05 v1 99
Reservation vs Backfill
Reservation (standard) Scheduling Top job waits a short time for resources to free Defer if not available
BackfillTop job starts if it can
If not enough resources, compute when available which resources job will use
Backfill jobs onto available nodes
Backfill superior for parallel machines
J. Skovira 5/05 v1J. Skovira 5/05 v1 1010
BackfillBackfill
Job Queue
Job Nodes Time
Job A 8 2Job B 12 1Job C 10 1Job D 4 1Job E 4 5
J. Skovira 5/05 v1J. Skovira 5/05 v1 1111
Backfill
Job Queue
Job Nodes Time
Job A 8 2Job B 12 1Job C 10 1Job D 4 1Job E 4 5
J. Skovira 5/05 v1J. Skovira 5/05 v1 1212
Job Command File Basics
Command file contains job “directives”
Basic items include:ShellClassInput/output directoriesNotification controlQueue keyword
2 ways to specify job executable:Executable keywordScript invocation after the keyword
Application Code
Job Command File
J. Skovira 5/05 v1J. Skovira 5/05 v1 1313
Basic Job Command File
#!/bin/ksh# @ class = demo# @ queueperlspin2 > /tmp
J. Skovira 5/05 v1J. Skovira 5/05 v1 1414
More Job Command File Keywords
Requirements allow you to select:I/O directivesNode requirementsWallclock limitLocally defined requirementsEtc…
notification controls what LL sends about the jobFrom never to always
notify_user tells LL where to send job infoAn email address
J. Skovira 5/05 v1J. Skovira 5/05 v1 1515
Serial Job Command File
#!/bin/ksh# @ error = ./out/job2.$(jobid).err# @ output = ./out/job2.$(jobid).out# @ wall_clock_limit = 180# @ class = demo# @ notification = complete# @ notify_user = [email protected]# @ queueperlspin2
J. Skovira 5/05 v1J. Skovira 5/05 v1 1616
Communication on the System
Each node has a connection to the high-performance switch
There are 2 ways to use the switchip mode "unlimited" channels slower communication performance
User space mode limited number of channels faster than ip mode
Can be selected in job command file
J. Skovira 5/05 v1J. Skovira 5/05 v1 1717
Parallel Job Command File Keywords
nodeHow many nodes your job requires
tasks_per_node How many tasks will run on each node
networkHow your job will communicate
wall_clock_limitAn estimate of how long your job runs
J. Skovira 5/05 v1J. Skovira 5/05 v1 1818
The Network Keyword
network.protocol = network_type, usage, mode
protocol: MPI, LAPI, PVM
network_type: sn_single or sn_all for switch adapter
usage: shared or not_shared
mode: IP, US
An example:
# @ network.MPI = sn_single, shared, us
J. Skovira 5/05 v1J. Skovira 5/05 v1 1919
Parallel Job Command File
#!/bin/ksh# @ job_type = parallel # @ node = 1# @ tasks_per_node = 4# @ error = ./out/job3.$(jobid).err# @ output = ./out/job3.$(jobid).out# @ wall_clock_limit = 05:00# @ class = demo# @ notification = complete# @ notify_user = [email protected]# @ network.MPI = sn_all,shared,us# @ queuepoe perlspin2
J. Skovira 5/05 v1J. Skovira 5/05 v1 2020
Basic Loadleveler Commands
llsubmit – submits a job to Loadleveler
llcancel – cancels a submitted job
llq – queries the status of jobs in the job queue
llstatus – queries the status of machines in the cluster
J. Skovira 5/05 v1J. Skovira 5/05 v1 2121
llq example
v01n08:/u/skoviraj $ llsubmit mybasic.cmd
llsubmit: The job "v01n08.vendor.pok.ibm.com.205" has been submitted
Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- v01n08.204.0 skoviraj 11/11 22:29 R 50 No_Class v01n02 v01n08.205.0 skoviraj 11/11 22:30 R 50 No_Class v01n02 v01n08.203.0 skoviraj 11/11 22:28 I 50 No_class
3 job steps in queue, 1 waiting, 0 pending, 2 running, 0 held
v01n08:/u/skoviraj $ llq
J. Skovira 5/05 v1J. Skovira 5/05 v1 2222
llstatus example
v01n08:/u/skoviraj/suspender1.0/suspender_stuff $ llstatus v01n02
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys v01n02.vendor.pok.ibm.com Avail 0 0 Run 1 0.00 9999 R6000 AIX43
v01n08:/u/skoviraj/suspender1.0/suspender_stuff $ llstatus | more
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys v01n01.vendor.pok.ibm.com Avail 0 0 Idle 0 0.05 9999 R6000 AIX43 v01n02.vendor.pok.ibm.com Avail 0 0 Run 1 0.00 9999 R6000 AIX43 v01n03.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43 v01n04.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43 v01n05.vendor.pok.ibm.com Avail 0 0 Idle 0 0.02 9999 R6000 AIX43 v01n06.vendor.pok.ibm.com Avail 0 0 Idle 0 0.05 9999 R6000 AIX43 v01n07.vendor.pok.ibm.com Avail 1 0 Idle 0 0.06 155 R6000 AIX43 v01n08.vendor.pok.ibm.com Avail 1 0 Idle 0 0.00 83 R6000 AIX43 v01n09.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43
J. Skovira 5/05 v1J. Skovira 5/05 v1 2323
llctl Examples
llctl -h hostname command
Useful Commands:
reconfig - Forces all daemons to reread the configuration files.
start - Starts the LoadLeveler daemons on the specified machine.
stop - Stops the LoadLeveler daemons on the specified machine.
Commands sometimes used:
flush - Terminates running jobs on this machine, places jobs in idle
recycle - Stops all LoadLeveler daemons and restarts them.
J. Skovira 5/05 v1J. Skovira 5/05 v1 2424
llctl Example
drain [schedd|startd [classlist |allclasses]]
With no options: (1) no more LoadLeveler jobs can begin running on this machine, (2) no more LoadLeveler jobs can be submitted through this machine.
When you issue drain schedd, the following happens: (1) the schedd machine accepts no more LoadLeveler jobs for submission. (2) jobs in the Starting or Running state in the queue are allowed to continue running. (3) jobs in the Idle state in the schedd queue are drained
When you issue drain startd, the following happens: (1) the startd machine accepts no more LoadLeveler jobs to be run (2) jobs already running on the startd machine are allowed to complete.
J. Skovira 5/05 v1J. Skovira 5/05 v1 2525
More Loadleveler Commands
llclass - returns information about available classes
llprio - changes the user priority of a job step
J. Skovira 5/05 v1J. Skovira 5/05 v1 2626
llclass Example
v60n129:/u/skoviraj $ llclass -l X_Class=============== Class X_Class =============== Name: X_Class Priority: 0 Exclude_Users: Include_Users: Exclude_Groups: Include_Groups: Admin: NQS_class: F NQS_submit: NQS_query: Max_processors: -1 Maxjobs: -1Resource_requirement: Class_comment: Class_ckpt_dir: Ckpt_limit: undefined, undefined Wall_clock_limit: 11+13:46:39, 11+13:46:39 (999999 seconds, 999999 seconds) Job_cpu_limit: undefined, undefined
…
v60n129:/u/skoviraj $ llclassName MaxJobCPU MaxProcCPU Free Max Description d+hh:mm:ss d+hh:mm:ss Slots Slots--------------- -------------- -------------- ----- ----- ---------------------inter_class undefined undefined 192 192X_Class undefined undefined 192 192
J. Skovira 5/05 v1J. Skovira 5/05 v1 2727
llprio Example
v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llq Id Owner Submitted ST PRI Class Running On v01n07.137.0 skoviraj 11/11 22:51 I 50 No_class 1 job steps in queue, 1 waiting, 0 pending, 0 running, 0 held
v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llprio -p 100 v01n07.137.0 llprio: Priority command has been sent to the central manager.
v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llq Id Owner Submitted ST PRI Class Running On v01n07.137.0 skoviraj 11/11 22:51 I 100 No_class 1 job steps in queue, 1 waiting, 0 pending, 0 running, 0 held
J. Skovira 5/05 v1J. Skovira 5/05 v1 2828
Advanced Topics
Job Preemption
Job Checkpointing
Submit filter
Loadleveler APIs (data access, scheduling)
Workload Manager (WLM) integration
Advance Reservation
Consumable resource control
J. Skovira 5/05 v1J. Skovira 5/05 v1 2929
Job Suspension
4 way restarts
16 way job runs
4 Node job runs
4 Node suspended
16 way job completes
J. Skovira 5/05 v1J. Skovira 5/05 v1 3030
Job Checkpoint
4 way restarts from saved state
16 way job runs
4 Node job runs
4 Node Checkpoints and ends
16 way job completes
4 Node job state saved
GPFS
J. Skovira 5/05 v1J. Skovira 5/05 v1 3131
Submit Filter
$NetKey = FALSE;while (<STDIN>) { chomp($value = $_); if ( $value =~ /network/ ) { # If we find the network keyword.... $NetKey = TRUE; # remember it! } if ( $value =~ /queue/ ) { # If at the end of LL keywords for this job step... if ( $NetKey eq FALSE ) { # if No network keyword... # Add one which uses the switch print "# @ network.MPI = sn_all,not_shared,US\n" } $NetKey = FALSE; # Reset network keyword memory } print "$value\n"; # Copy a single ll cmd file line to new cmd file}
J. Skovira 5/05 v1J. Skovira 5/05 v1 3232
Tips for Efficient Job Processing
Assumptions: One task per CPU Classes Configured
Get your job to the TOP of the queue: Short run Small number of nodes Use ip communication over the switch Priority? Submit during low use periods (evening)
These are FREE! all above tips (except priority) will impact no other job
J. Skovira 5/05 v1J. Skovira 5/05 v1 3333
More Tips for Efficient Job Processing
Allow your job to run as QUICKLY as possible:
Balance node operations
Keep data entirely in physical memory
Use processors of similar types (system admin?)
Use distributed data load and store
Profile your applications for efficient compiler use
This could be an entirely new presentation!