15
Ian C. Smith Experiences with running MATLAB jobs on a power-saving Condor Pool

Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Embed Size (px)

Citation preview

Page 1: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Ian C. Smith

Experiences with running MATLAB jobs on a power-

saving Condor Pool

Page 2: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

University of Liverpool Condor Pool Contains around 300 machines running the University’s Managed

Windows (XP) Service.

Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.

Software updates via a weekly re-imaging process.

Single combined submit host / central manager running on Sun V440 SMP server.

Restricted access to submit host for registered Condor users.

Currently running Condor 7.0.2 (moving to 7.2.x soon).

Policy is to run jobs only if a least 10 minutes of inactivity and low load average during office hours and at anytime outside of office hours.

Page 3: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

MATLAB advantages Originally developed for linear algebra algorithm development but

now contains many built-functions geared to different disciplines divided into toolboxes.

Intuitive interactive environment allows rapid code development.

Simple but powerful file I/O: save <filename>, load <filename> (useful for checkpointing).

Allows users to create their own functions stored as M-files.

“Standalone” applications can be built from M-files: can run on platforms without MATLAB installed do not need a licence to be able to run can include all toolbox functions

APIs available for FORTRAN and C codes (“MEX files”)

Page 4: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

MATLAB disadvantages Even standalone applications can run slower than equivalent C or

FORTRAN implementations.

Standalone applications aren’t quite what they may seem: more than just an .exe – several files need to be packaged and deployed need access to MATLAB run-time libraries usually via MATLAB Component

Runtime (150 MB self-extracting .exe) luckily we have MATLAB pre-installed on all PCs in Condor pool (originally

used a network drive)

Run-time errors can be difficult to trace when MATLAB jobs are run under Condor: need to run under Condor on local PC configure with USE_VISIBLE_DESKTOP=True to see pop-up messages

Jobs submitted in a UNIX environment but code developed under Windows.

Page 5: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Minor MATLAB irritations

Output files occasionally go missing: specify all required files using transfer_output_files identify problem jobs with condor_q –held resubmit with condor_release –all

Jobs sometimes run “forever”: use condor_vacate to move job to another machine less of a problem during term time as jobs usually get evicted by logins

Difficult to reproduce these problems: happen quite rarely ( < 1 in ~1000 jobs) many jobs based on stochastic methods

Page 6: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

MATLAB Research Applications

Predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary Clinical Science).

Modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science).

Testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics).

Simulation of the infection of a bacterial cell by a virus (Mathematical Sciences).

Modelling the effects of radiotherapy on normal tissue using 3D voxel arrays (Medical Imaging and Radiotherapy).

Page 7: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Power-saving at Liverpool Have around 2 000 centrally managed PCs across campus which

were powered up overnight, at weekends and during vacations.

Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity

Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.

Makes extensive use of PowerMAN system from Data Synergy comprising: service which forces machines into a low-power state and reports machine

activity to Management Reporting Platform Management Reporting Platform - central server from where usage stats

can be retrieved and viewed via a web browser

Page 8: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Adapting Condor for use with power-saving PCs Two main problems:

how to ensure Condor jobs are not evicted by hibernating/powered-off PCs how to wake up dormant PCs to run Condor jobs on-demand

Originally used Microsoft system service to power-down PCs after 30 min inactivity: runs .bat file which checks if a user is logged in and shuts machine down if

not doesn’t detect owner of Condor job as a logged-in user need to check for presence of condor_exe.bat

PowerMAN service now prevents job eviction: can provide PowerMAN with a list of “protected programs” ensures that system remains active if a protected program is running include condor_starter process as a protected program (only present while

a Condor job is running).

Page 9: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Adapting Condor for use with a power-saving PCs Wake-on-LAN (“WoL”) used to bring hibernating machines back to full

power: NICs must be remain powered-up during hibernation/power-off NICs must be capable of waking machines on receipt of a “magic packet” network must be able to route “magic packets”

cron runs on the submit host which examines state of queue (condor_q) and pool (condor_status): if more idle jobs in queue than Unclaimed machines then need to wake up

hibernating machines find number of powered up machines machines in each “teaching centre”

(classroom) estimate the number of hibernating machines in each teaching centre from total

number of machines in each sort centres from highest number of available machines to lowest wake up centres in turn until sufficient machines woken to meet the demand (or

all centres woken up) MAC addresses of machines are stored in files sorted according to teaching

centre (needed for Wake-on-LAN)

Page 10: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Automatic wake up issues Assumes that any job can run on any machine:

users cannot choose particular teaching centres or machines in their job Requirements

ideally, pool needs to be homogenous errors in Requirements specification can cause severe problems

(machines repeatedly wake up then hibernate) cron now includes a “sanity check” for this

Large clusters of jobs can cause condor scheduler to become overloaded: condor_q times out so cron cannot determine queue state only a transient problem – load eventually drops off and condor_q

responds again

Can only estimate number of hibernating machines in each centre

May wake up more machines than needed

Page 11: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Automatic wake up in action – Condor pool machine statistics

Page 12: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Automatic wake up in action – PowerMAN statistics

Page 13: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Recent and Future Developments

Recently moved to a policy of hibernating machines after 10 minutes of inactivity submit host / central manager needs to work harder to get jobs running

before recently woken machines go back to hibernation move execute hosts from Owner to Unclaimed state after just 5 minutes idle

update activity timer every 1 minute (default is 5 minutes) increase number of scheduler and negotiator cycles using

SCHEDD_INTERVAL=60, NEGOTIATOR_INTERVAL=60 around 25 % machines still hibernate after first wakeup see a ramp up in machines running Condor jobs over about an hour little impact on Condor users energy wastage offset by savings with user logouts

Page 14: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Recent and Future Developments Migrating to Condor 7.2 shortly

Has some interesting power-management features Automatic power-down on execute hosts could provide a useful “safety net”

but PowerMAN likely to remain primary power management tool Can retain records of ClassAds of machines in low-power state

could be useful in matchmaking jobs to powered-down machines matchmaking logic already in Condor nice if Condor could use this to provide a list of machines to wake-up on demand ... and wake them up with condor_wakeup ? would like to ensure that powered-down machines are still out there (not broken,

permanently turned off, not listening etc) also useful to see powered-off machines represented in condor_status output

Couple of extra “wishes” allow jobs to claim all slots on a machine (useful if they have large memory

requirements) provide a “logged-in user” machine ClassAd attribute

Page 15: Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

Further Information

http://www.liv.ac.uk/e-science/condor

[email protected]