moabcon2012 - Transitioning from Grid Engine

Preview:

Citation preview

Calcul Québec

frederick.lefebvre@calculquebec.caMoabCon - April 2012

MOAB: Transitioning from Grid Engine

1

Calcul Québec

Plan2

ColosseRationalTransitionStatus update

Calcul Québec

Colosse3

Room for 56 racksPower: 1.1 MWCooling: ~1.5 MWUPS and power generator

for filesystems and servers

Calcul Québec

Colosse4

cold air plenum(32 m2)

hot air core(25 m2)

Calcul Québec

Colosse5

Sun Constellation System deployed in 2009960 Diskless Compute Nodes

7680 Nehalem CoresQDR Infiniband only, Full Bisection

1PB of Lustre Storage

Calcul Québec

Colosse : Current architecture

Everything is tied together with custom scriptsAccounting is extracted from Grid Engine and

moved to a SQL databaseHas been working well for 2 years...

6

Provisioning Scheduler + ressource manager

Calcul Québec

Grid Engine Vs Moab7

Why switch ???

Calcul Québec

Grid Engine Vs Moab8

SGEUsed on only 1 large Compute Canada systemUnknown vendor commitmentFractured communityLimited support available for large HPC deployment

Calcul Québec

Grid Engine Vs Moab9

MoabAlready well known to our usersSingle vendorCommercial supportStrong communityKnown to scale

Calcul Québec

Colosse : basic plan10

Scheduler

Resource Manager

Torque

Scheduler + Resource Manager

Calcul Québec

Transition planImplement the existing scheduling policyTrain users and get them on the new schedulerTrain staff to work with MoabAdapt/port our management scriptsGive control of the cluster to Moab/Torque

11

Calcul Québec

Scheduling policyPriority based on share treeDedicated nodes onlyMax 200 jobs per project4 queues

test (15m, 16 cores) - 2 nodesshort (24h, 256 cores) - all nodesmed (48h, 128 cores) - all nodeslong (7 days, ? cores) - 120 nodes

12

Calcul Québec

Scheduling policy - exceptionsAnalysts use overide tickets on user’s jobsUsers can qualify for more coresNo exception on maximum wallclock times

BLCR is used to allow checkpoint/restart of serial jobs

13

Calcul Québec

Share Tree80% of system reserved for special allocations20% for groups without an allocation

14

Share tree (100%)

Project1 (20%)

Project2 (15%)

Project9 (5%)

...

Project10 (0.1%)

Project11 (0.1%)

Project13 (0.1%)...

...

Calcul Québec

Share Tree80% of system reserved for special allocations20% for groups without an allocation

15

Share tree (100%)

Special allocations

(80%)

Default allocations

(20%)

Project1 (20%)

Project2 (15%)

Project9 (5%)

...

Project10 (0.1%)

Project11 (0.1%)

Project13 (0.1%)...

...

Calcul Québec

User’s transitionMuch easier than expected for users

Submit files are very similar

Job submission is easy (qsub becomes msub) but more commands to learn to monitor jobsA lot of questions about the difference between the Torque

and Moab commands

16

#!/bin/bash#$ ...#$ ...

mpirun ...

SGE#!/bin/bash#PBS ...#PBS ...

mpirun ...

Moab

Calcul Québec

Staff ’s transitionHarder than expectedThe workflow for working with users’ issues will

need to be reviewedhow to figure out where is the original submit filewhere are each processes and how much memory they use...

More internal documentation will be required

17

Calcul Québec

Management scriptsOld habits die hard....Accounting/reporting

Accounting data is read from Moab event filesQueue status

Maintenance related scriptsmonitoring,account creation,node maintenanceprolog/epilog

18

Calcul Québec

Deployment - progress report

Progressive deploymentGrid engine and Moab will live together for a

couple of months

19

Calcul Québec

Deployment - progress report

We built a different oneSIS image for Torque compute nodesAlso rebuilt the MPI implementation with Torque support

Rebooting a node in the Torque image switches it over to Moab

20

Calcul Québec

Deployment - progress report

10% of nodes controlled by Moab right nowOpen to all users to test their workflow

21

Calcul Québec

Deployment - progress report

The Moab partition will grow over the next weeks

22

Calcul Québec

Thank you23

frederick.lefebvre@calculquebec.caMoabCon - April 2012

Recommended