Gridengine Configuration review
● Gridengine overview ● Our current setup● The scheduler● Scheduling policies● Stats from the clusters
Gridengine Overview
● Accepts jobs from the outside world● Puts jobs in a holding area until they
can be run● Sends jobs from the holding area to an
execution device● Manages running jobs ● Records details about finished job
Gridengine Overview (2)
● Four types of hosts– Execution: runs jobs.– Submit: allowed to submit jobs from– Master: schedules jobs.– Admin: allowed to run admin cluster from.
● Hosts can be many types but only one master (hot spare).
● Could run everything on one host...silly but possible.
Queues (Cluster Queues)
● Container for a class of jobs● Can define specific resources
– large memory machines– specific processor– architecture– time restricted (runtime or time of day/week)
● Contain one or more execution hosts● Can be preemptive● Can contain subqueues
Queues(2)
● Queue instance– Each queue is bound to an included
execution host via a queue instance– Each execution host can have multiple queue
instances attached.– Can have one or more job slots.
Simple configuration
● One cluster queue● Each execution host has one queue
instance● Jobs are scheduled in FIFO.● This is the default configuration
gridengine ships with.
Our Hardware
● 4 clusters running gridengine– Lion: 64+ nodes (GX240)– Lutzow: 16 nodes (PE530)– Townhill: 34 nodes (PE1425 dual CPU)– Hermes: 24 nodes (PE1425 single CPU)
● 4 head nodes (1 per cluster)● 1Tb local home directories ● 1Tb “scratch” space
Current setup
● All hosts are admin hosts● Single “head node” configured as
submit/master● Execution hosts have ssh blocked● Users ssh onto head node and submit
jobs.– Actually they tend to run scripts which
submit jobs– Lots of jobs– Not all of them will run properly.
The Scheduler Process
Prioritisation
● Prioritisation based on– Entitlement– Urgency– Custom
● Generates a Dispatch priority● Real number based on combination of
above.
Entitlement
● Priority based on users/groups● Can be explicit(user A jobs before user
B)● Can allocate ratio of resources (group A
get 60% CPU usage over , group B get 40%)
● Share tree allows the allocation to be spread over a defined time period.
● Need to configure information for users/groups
Share tree example
Urgency
● Deadline contribtion– Priority rises closer to deadline specified at
submission● Wait time contribution
– Priority rises with time● Resource contribution
– Can assign urgency to a resource (Maltab licenses)
Custom
● Allows for prioritisation based on site specific requirements
● Run arbitrary script which alters priority.● Defaults to posix priority (like nice)
– Users can lower priority– Admin can raise priority
Summary
● Can control job execution based on– Queues: assign specific execution hosts for
specific tasks or users/groups. Queues can be calendar controlled.
– Scheduler: prioritise jobs based on who submitted them or what resources they require.
Current setup
● Single queue containing all nodes in a cluster
● Limited user/group support (FC5)● Allocates equal priority to each user with
jobs in pending queue.
It's mostly downhill from here
Gathering job data
● Sun dbwriter● Java script runs on accounting/reporting
file and populates postgresql database (42GB footprint).
● Data from Dec/Jan until yesterday “with holes”
● Difficult to analyse some jobs (parallel,stopped jobs)
How many jobs
Row 360
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Job throuput
Hermes
LionLutzow
Townhill
Hmm thats a lot of short jobs
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9+ 0
10
20
30
40
50
60
70
80
90
100
% of jobs by runtime.
Hermes
Lion
Lutzow
Townhill
Average
Run Time (hours)
% o
f jo
bs
That's really a lot of short jobs
● Remember all those scripts?● How many of these jobs actually run for
any length of time?
How many jobs (>3min)
Hosts0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
550000
Main Title
Hermes
Lion
Lutzow
Townhill
Tota
l Jo
bs
Remove the <3min jobs
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9+ 0
10
20
30
40
50
60
70
80
90
100
% of jobs by runtime (no short jobs)
Hermes
Lion
Lutzow
Townhill
Average
Runtime (hours)
% o
f jo
bs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 24+
0
5
10
15
20
25
30
35
40
45
50
55
60
% Run length
Hermes
Lion
LutzowTownhill
Average
job length (cpuhours)
% o
f sys
tem
run
time
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
28+
05
10152025303540455055606570
% Run time in days
Hermes
Lion
Lutzow
Townhill
Average
Job run length(cpudays)
% o
f syste
m tim
e
00-01
01-02
02-03
03-04
04-05
05-06
06-07
07-08
08-09
09-10
10-11
11-12
12-13
13-14
14-15
15-16
16-17
17-18
18-19
19-20
20-21
21-22
22-23
23-00
0
2
4
6
8
10
12
14
16
18
% Jobs By Submission Time
Hermes
Lion
Lutzow
Townhill
Average
Time of Day
% of
Job
s
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
24+
0
10
20
30
40
50
60
70
80
90
Wait time
Hermes
Lion
Lutzow
Townhill
Average
Wait time (hours)
% o
f jo
bs
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230
2
4
6
8
10
12
14
16
18
free slots
Hermes
Lion
Lutzow
Townhill
Average
time of day
free
slo
ts
Tentative conclusions
● Could add more submit hosts/backup scheduler for redundancy (virtualisation).
● Need to set up queue to handle short jobs with quick turnaround
● Also need preempted queue for longer running jobs.
● User scripts can muddy the water, can't assume quiet time for system admin tasks