29
Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Embed Size (px)

Citation preview

Page 1: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Ian C. Smith*

Harvesting unused clock cycles with Condor

*Advanced Research Computing

The University of Liverpool

Page 2: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Overview

what is Condor ?

High Performance versus High Throughput Computing

Condor fundamentals

setting up and running a Condor Pool

The University of Liverpool Condor Pool

example applications

Page 3: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

What is Condor ?

a specialized system for delivering High Throughput Computing

a harvester of unused computing resources

developed by Computer Science Dept at University of Wisconsin in late ‘80s

free and (now) open source software

widely used in academia and increasing in industry

available for many platforms: Linux, Solaris, AIX, Windows XP/Vista/7, Mac OS

Page 4: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

HPC vs HTC (1)

High Performance Computing (HPC)

delivers large amounts of computing power over relatively short periods of time (peak FLOPS ratings important)

can also provide lots of memory, large amounts of fast (parallel) storage

fairly exotic hardware, may need plenty of TLC

large capital outlay on hardware

need to run specialised parallel (MPI) codes to get the benefit (can run serial codes but these are a poor use of resources)

users run relatively small numbers of parallel jobs

essential for certain time-critical applications

Page 5: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

HPC vs HTC (2)

High Throughput Computing (HTC)

allows many computational tasks to be completed over a long period of time (peak FLOPS ratings not so important)

users more concerned with running large numbers of jobs over a long time span than a few short burst computations

makes use of existing commodity hardware (e.g. desktop PCs)

small capital outlay on hardware possible

limited memory and storage available generally

mostly aimed at running concurrent serial jobs (although MPI and PVM are supported by Condor)

Page 6: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Types of Condor application

large numbers of independent calculations typically (“pleasantly parallel”)

data parallel applications – split large datasets into smaller parts and analyse independently

biological sequence analysis

processing of census data

optimisation problems

microprocessor design and testing

applications based on Monte Carlo methods

radiotherapy treatment analysis

epidemiological studies

Page 7: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

Page 8: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

ClassAdsClassAds

ClassAdsClassAds

Page 9: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

Match Info

Match InfoMatch Info

Match Info

Page 10: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

JobsJobs

Page 11: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

A “typical” Condor pool

Central manager

Submit/execute hostSubmit host

Execute hostsExecute hosts

Output

Output

Page 12: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

ClassAds and Matchmaking

ClassAds are a fundamental part of Condor

similar to classified advertisements in a paper

“Job Ads” represent jobs to Condor (similar to “wanted” ads)

“Machine Ads” represent compute resources in a Condor Pool (similar to “for sale” ads)

Condor central manager matches Machine Ads to Job Ads and hence machines to jobs

Job Ads are created using submit description files

Page 13: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Simple submit description file

# simple submit description file # (anything following a # is comment and is ignored by Condor)# this would be used for Windows XP based execute hosts

universe = vanillaexecutable = example.exe # what to runoutput = stdout.out$(PROCESS) # job`s standard outputlog = mylog.log$(PROCESS) # log job`s activitiestransfer_input_files = common.txt, myinput$(PROCESS).txt # input files neededrequirements = ( Arch=="Intel") && ( OpSys=="WINNT51" ) # what machines to run onqueue 2 # number of jobs to queue

Page 14: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Requirements and Rank

Requirements expression determines where (and when) a job will run e.g.

Rank is used to express a preference

Requirements = ( OpSys==“WINNT51” ) && # Windows XP OS wanted ( Arch==“Intel” ) && \ # Intel/compatible processor

( Memory >= 2000 ) && \ # want a least 2GB memory and( Disk >= 33554432 ) && \ # at least 32 GB of free disk

( HAS_MATLAB == TRUE ) && \ # must have MATLAB installed ( ( ClockMin > 1020 ) || \ # only run jobs after 5 pm OR ... ( ClockMin == 6 ) || ( ClockDay == 0) ) # at weekends

Rank = Kflops # run on machines with best floating point performance first

Page 15: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Job submission and monitoring[einstein@submit ~]$ condor_submit example.subSubmitting job(s).2 job(s) submitted to cluster 100.[einstein@submit ~]$ condor_q-- Submitter: submit.chtc.wisc.edu : <128.104.55.9:51883> : submit.chtc.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD1.0 sagan 7/22 14:19 172+21:28:36 R 0 22.0 checkprogress.cron2.0 heisenberg 1/13 13:59 0+00:00:00 I 0 0.0 env3.0 hawking 1/15 19:18 0+04:29:33 R 0 0.0 script.sh4.0 hawking 1/15 19:33 0+00:00:00 R 0 0.0 script.sh5.0 hawking 1/15 19:33 0+00:00:00 H 0 0.0 script.sh6.0 hawking 1/15 19:34 0+00:00:00 R 0 0.0 script.sh...96.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh97.0 bohr 4/5 13:46 0+00:00:00 I 0 0.0 c2b_dops.sh98.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh99.0 bohr 4/5 13:52 0+00:00:00 I 0 0.0 c2b_dopc.sh100.0 einstein 4/5 13:55 0+00:00:00 I 0 0.0 cosmos

557 jobs; 402 idle, 145 running, 1 held[einstein@submit ~]$

Page 16: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Condor policies

Condor supports a wide range of policies for when to start jobs e.g.

run jobs only outside office hours

run jobs only if load average on host is small and there has been no recent activity

run jobs at any time on one core (at low priority)

run jobs only submitted by certain users

also a wide choice of what to do when a job is about to be interrupted e.g.

suspend the job for a limited time then let it resume

checkpoint the job and migrate it to another machine

kill off the job immediately

Page 17: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

UNIX or Windows execute hosts ? (1)

UNIX

Condor’s natural environment

not widely installed on desktop machines (but depends on institution...)

supports the Condor “standard universe” containing many useful features

checkpointing allows jobs to be migrated from one machine to another without loss of useful work

Remote Procedure Calls give transparent access to files on submit host

streaming of standard output (stdout) from jobs to submit host

Network filesystems work well making installation and configration much easier

leverages large amount of scientific and engineering codes which have been developed under UNIX

Page 18: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

UNIX or Windows execute hosts ? (2)

Windows

world’s most widely installed OS – rich source of execute hosts

many commercial 3rd party applications run on Windows

using shared (network) filesystems can be difficult under Condor

only supports the “vanilla” Condor universe

no checkpointing – evicted jobs may waste a lot of cycles

all input and output files need to be transferred to/from execute host

output streaming not supported

may be difficult to port “legacy” UNIX codes (although Cygwin and Co-Linux can make life easier)

Windows support from the U-W Condor Team tends to lag behind UNIX

Page 19: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Setting up a Condor pool best to start off small and build up pool slowly

need to understand Condor fundamentals:

role of Condor processes and how they interact

life-cycle of jobs

ClassAds and Matchmaking

avoid firewalls if possible (may be easier said than done ...)

talk to central IT services (particularly network and PC teams)

submit hosts may need to be fairly high spec if large numbers of jobs are to be run - ideally want

multi-core/processor machine (quad core at least)

plenty of memory (say 8 GB or more)

large fast access filestore (e.g. 1 TB RAID)

Page 20: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Where to go for help

Read The Fine Manual !

log files contain a lot of useful information

take a look at the presentations, tutorials and “how-to recipes”on the Condor website: (www.cs.wisc.edu/condor)

search the condor-users mail list archive: (lists.cs.wisc.edu/archive/condor-users)

subscribe to the condor-users mail list

join the Campus Grids SIG: (wikis.nesc.ac.uk/escinet/Campus_Grids)

commercial support is also available (e.g. Cycle Computing)

Page 21: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

University of Liverpool Condor Pool

contains around 400 machines running the University’s Managed Windows Service (currently XP but moving to Windows 7 soon)

most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine

single submission point for Condor jobs provided by Sun Solaris V445 SMP server

policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours

job will be killed off if running when a user logs in to a PC

web interface for specific applications

support for running large numbers of MATLAB jobs

Page 22: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Condor service caveats

only suitable for DOS-based applications running in batch mode no communication between processes possible (“pleasantly

parallel” applications only) statically linked executables work best (although can cope with

DLLs) all files needed by application must be present on local disk

(cannot access network drives) shorter jobs more likely to run to completion (10-20 min seems to

work best) very long running jobs can accommodated using Condor

DAGMan or user level check-pointing (details available soon on the Condor website)

Page 23: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Running MATLAB jobs under Condor

many users prefer to create applications using MATLAB rather than traditional compiled languages (e.g. FORTRAN, C)

need to create standalone application from M-file(s) using MATLAB compiler

standalone application can run without a MATLAB license

run-time libraries still need to be accessible to MATLAB jobs

nearly all toolbox functions available to standalone applications

simple (but powerful) file I/O makes checkpointing easier

see Liverpool Condor website for more information

Page 24: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Power-saving and Green IT at Liverpool we have around 2 000 centrally managed classroom PCs across

campus which were powered up overnight, at weekends and during vacations.

original power-saving policy was to power-off machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity

policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.

3rd party power management software (PowerMAN) prevents machines hibernating whilst Condor jobs are running

Condor’s own power management features allows machines to be woken up automatically according to demand

Page 25: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Condor-G and Grid Computing

Condor-G is an extension to Condor allowing job submission to remote resources using Globus

provides familiar Condor-like interface to users hiding the underlying middleware complexity

we have used Condor-G to give users grid access to a variety of HPC resources:

local HPC clusters (UL-Grid)

NW-Grid resources at Daresbury Lab, Lancaster and Manchester

National Grid Service facilities

Grid Computing Server tools provide a batch environment similar to that of cluster systems (e.g. Sun Grid Engine)

Web portal removes the need for command line use completely

Page 26: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Radiotherapy example

3D model of normal tissue was developed in which complications are generated when ‘irradiated’ [1]

aim is to provide insight into connection between dose-distribution characteristics, different organ architectures and complication rates beyond that of analytical methods

code written in MATLAB and compiled into standalone executable

set of 800 simulations took ~ 36 hours to run on Condor pool

would require 4-5 months of computing time on a single PC

several dozen sets of simulations have since been completed

[1] Rutkowska E., Baker C.R. and Nahum A.E. Mechanistic simulation of normal-tissue damage inradiotherapy—implications for dose–volume analyses. Phys. Med. Biol. 55 (2010) 2121–2136.

Page 27: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Personalised Medicine example

project is a Genome-Wide Association Study

aims to identify genetic predictors of response to anti-epileptic drugs

try to identify regions of the human genome that differ between individuals (referred to as SNPs)

800 patients genotyped at 500 000 SNPs along the entire genome

test statistically the association between SNPs and outcomes (e.g. time to withdrawl of drug due to adverse effects)

very large data-parallel problem – ideal for Condor

divide datasets into small partitions so that individual jobs run for 15-30 minutes

batch of 26 chromosomes (2 600 jobs) required ~ 5 hours compute time on Condor but ~ 5 weeks on a single PC

Page 28: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Epidemiology example

researchers have simulated the consequences of an incursion of H5N1 avian influenza into the British poultry flock [2]

Monte Carlo type method - highly parallel

original code written in MATLAB and compiled into standalone application

individual simulations take only 10-15 minutes to run – ideal for Condor

require ~ 10 000 - 20 000 simulations per scenario

would have needed several years of compute time on single machine, on Condor needed a few weeks

[2] Sharkey, K.J., Bowers R.G., Morgan K.L., Robinson S.E. and Christley R.M. Epidemiological consequences of an incursion of highly pathogenic H5N1 avian influenza into the British poultry flock. Proc. R. Soc. B 2008 275, 19-28

Page 29: Ian C. Smith* Harvesting unused clock cycles with Condor *Advanced Research Computing The University of Liverpool

Further Information

http://www.liv.ac.uk/e-science/condor

[email protected]