Download pdf - Gridengine Configuration revie · 2007. 7. 6. · Current setup All hosts are admin hosts Single “head node” configured as submit/master Execution hosts have ssh blocked Users

Gridengine Configuration review

● Gridengine overview ● Our current setup● The scheduler● Scheduling policies● Stats from the clusters

Gridengine Overview

● Accepts jobs from the outside world● Puts jobs in a holding area until they

can be run● Sends jobs from the holding area to an

execution device● Manages running jobs ● Records details about finished job

Gridengine Overview (2)

● Four types of hosts– Execution: runs jobs.– Submit: allowed to submit jobs from– Master: schedules jobs.– Admin: allowed to run admin cluster from.

● Hosts can be many types but only one master (hot spare).

● Could run everything on one host...silly but possible.

Queues (Cluster Queues)

● Container for a class of jobs● Can define specific resources

– large memory machines– specific processor– architecture– time restricted (runtime or time of day/week)

● Contain one or more execution hosts● Can be preemptive● Can contain subqueues

Queues(2)

● Queue instance– Each queue is bound to an included

execution host via a queue instance– Each execution host can have multiple queue

instances attached.– Can have one or more job slots.

Simple configuration

● One cluster queue● Each execution host has one queue

instance● Jobs are scheduled in FIFO.● This is the default configuration

gridengine ships with.

Our Hardware

● 4 clusters running gridengine– Lion: 64+ nodes (GX240)– Lutzow: 16 nodes (PE530)– Townhill: 34 nodes (PE1425 dual CPU)– Hermes: 24 nodes (PE1425 single CPU)

● 4 head nodes (1 per cluster)● 1Tb local home directories ● 1Tb “scratch” space

Current setup

● All hosts are admin hosts● Single “head node” configured as

submit/master● Execution hosts have ssh blocked● Users ssh onto head node and submit

jobs.– Actually they tend to run scripts which

submit jobs– Lots of jobs– Not all of them will run properly.

The Scheduler Process

Prioritisation

● Prioritisation based on– Entitlement– Urgency– Custom

● Generates a Dispatch priority● Real number based on combination of

above.

Entitlement

● Priority based on users/groups● Can be explicit(user A jobs before user

B)● Can allocate ratio of resources (group A

get 60% CPU usage over , group B get 40%)

● Share tree allows the allocation to be spread over a defined time period.

● Need to configure information for users/groups

Share tree example

Urgency

● Deadline contribtion– Priority rises closer to deadline specified at

submission● Wait time contribution

– Priority rises with time● Resource contribution

– Can assign urgency to a resource (Maltab licenses)

Custom

● Allows for prioritisation based on site specific requirements

● Run arbitrary script which alters priority.● Defaults to posix priority (like nice)

– Users can lower priority– Admin can raise priority

Summary

● Can control job execution based on– Queues: assign specific execution hosts for

specific tasks or users/groups. Queues can be calendar controlled.

– Scheduler: prioritise jobs based on who submitted them or what resources they require.

Current setup

● Single queue containing all nodes in a cluster

● Limited user/group support (FC5)● Allocates equal priority to each user with

jobs in pending queue.

It's mostly downhill from here

Gathering job data

● Sun dbwriter● Java script runs on accounting/reporting

file and populates postgresql database (42GB footprint).

● Data from Dec/Jan until yesterday “with holes”

● Difficult to analyse some jobs (parallel,stopped jobs)

How many jobs

Row 360

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

Job throuput

Hermes

LionLutzow

Townhill

Hmm thats a lot of short jobs

0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9+ 0

10

20

30

40

50

60

70

80

90

100

% of jobs by runtime.

Hermes

Lion

Lutzow

Townhill

Average

Run Time (hours)

% o

f jo

bs

That's really a lot of short jobs

● Remember all those scripts?● How many of these jobs actually run for

any length of time?

How many jobs (>3min)

Hosts0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

550000

Main Title

Hermes

Lion

Lutzow

Townhill

Tota

l Jo

bs

Remove the <3min jobs

0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9+ 0

10

20

30

40

50

60

70

80

90

100

% of jobs by runtime (no short jobs)

Hermes

Lion

Lutzow

Townhill

Average

Runtime (hours)

% o

f jo

bs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 24+

0

5

10

15

20

25

30

35

40

45

50

55

60

% Run length

Hermes

Lion

LutzowTownhill

Average

job length (cpuhours)

% o

f sys

tem

run

time

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

28+

05

10152025303540455055606570

% Run time in days

Hermes

Lion

Lutzow

Townhill

Average

Job run length(cpudays)

% o

f syste

m tim

e

00-01

01-02

02-03

03-04

04-05

05-06

06-07

07-08

08-09

09-10

10-11

11-12

12-13

13-14

14-15

15-16

16-17

17-18

18-19

19-20

20-21

21-22

22-23

23-00

0

2

4

6

8

10

12

14

16

18

% Jobs By Submission Time

Hermes

Lion

Lutzow

Townhill

Average

Time of Day

% of

Job

s

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

24+

0

10

20

30

40

50

60

70

80

90

Wait time

Hermes

Lion

Lutzow

Townhill

Average

Wait time (hours)

% o

f jo

bs

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230

2

4

6

8

10

12

14

16

18

free slots

Hermes

Lion

Lutzow

Townhill

Average

time of day

free

slo

ts

Tentative conclusions

● Could add more submit hosts/backup scheduler for redundancy (virtualisation).

● Need to set up queue to handle short jobs with quick turnaround

● Also need preempted queue for longer running jobs.

● User scripts can muddy the water, can't assume quiet time for system admin tasks