19
Introduction to CamGrid Mark Calleja Cambridge eScience Centre www.escience.cam.ac.uk

Introduction to CamGrid Mark Calleja Cambridge eScience Centre

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

Introduction to CamGrid

Mark Calleja

Cambridge eScience Centre

www.escience.cam.ac.uk

Why grids?• The idea comes from electricity grids: you don’t

care which power station your kettle’s using.• Also, there are lots of underutilised resources

around. The trick is to access them transparently.

• Not all resources need to be HPC with large amounts of shared memory and fast interconnects.

• Many research problems are “embarrassingly parallel”, e.g. phase space sampling.

• We’d like to use “anything”: dedicated servers or desktops.

What is Condor?• Condor converts collections of distributively owned

workstations and dedicated clusters into a distributed high-throughput computing (HTC) facility.

• Machines in a Condor pool can submit and/or service jobs in the pool.

• Highly configurable of how/when/whose jobs can run.

• Condor has several useful mechanisms such as :– Process checkpoint/ restart / migration– MPI support (with some effort)– Failure resilience– Workflow support

Getting Started: Submitting Jobs to Condor

• Choosing a “Universe” for your job (i.e. sort of environment the job will run in): vanilla, standard, Java, parallel (MPI)…

• Make your job “batch-ready” (namely stdin)• Must be able to run in the background: no

windows, GUI, etc.• Creating a submit description file• Run condor_submit on your submit description

file

A Submit Description File# Example condor_submit input file# (Lines beginning with # are comments)

Universe = vanillaExecutable = job.$$(OpSys).$$(Arch)InitialDir = /home/mark/condor/run_$(Process)Input = job.stdinOutput = job.stdoutError = job.stderrArguments = arg1 arg2Requirements = Arch == “X86_64” && OpSys == “Linux”Rank = KFlopsQueue 100

DAGMan – Condor’s workflow manager

• Directed Acyclic Graph Manager

• DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

• Allows complicated workflows to be built up (can embed DAGs).

• E.g., “Don’t run job “B” until job “A” has completed successfully.”

• Failed nodes can be automatically retried.

Condor Flocking• Condor attempts to run a submitted job in its local

pool.• However, queues can be configured to try sending

jobs to other pools: “flocking”.• User-priority system is “flocking-aware”

– A pool’s local users can have priority over remote users “flocking” in.

• This is how CamGrid works: each group/department maintains its own pool and flocks with the others.

CamGrid• Started in Jan 2005 by five groups (now up to eleven; 13

pools).• UCS has its own, separate Condor facility known as

“PWF Condor”.• Each group sets up and runs its own pool, and flocks

to/from other pools.• Hence a decentralised, federated model.• Strengths:

– No single point of failure– Sysadmin tasks shared out

• Weaknesses:– Debugging is complicated, especially networking issues.– Many linux variants: can cause library problems.

Participating departments/groups

• Cambridge eScience Centre

• Dept. of Earth Science (2)

• High Energy Physics

• School of Biological Sciences

• National Institute for Environmental eScience (2)

• Chemical Informatics

• Semiconductors

• Astrophysics

• Dept. of Oncology

• Dept. of Materials Science and Metallurgy

• Biological and Soft Systems

Local details (1)• CamGrid uses a set of RFC 1918 (“CUDN-only”) IP

addresses. Hence each machine needs to be given an (extra) address in this space.

• A CamGrid Management Committee, with members drawn from participating groups, maps out policy.

• Currently have ~1,000 core/processors, mostly 4-core Dell 1950 (8GB memory) like HPCF.

• Aside: SMP/MPI works very nicely!• Pretty much all linux, and mostly 64 bit.• Administrators can decide configuration of their pool, e.g.

such issues as:- Extra priority for local users- Renice Condor job- Only run at certain times- Have a preemption policy

Local details (2)• Responsibility of individual pools to authenticate local

submitters.• Need to trust root on remote machine, especially for

Standard Universe.• There’s no shared FS across CamGrid, but Parrot (from

the Condor project) is a nice user-space file system tool for linux. Means a job can mount a remote data source like a local file system (á la NFS).

• Firewalls: a submit host must be able to communicate with every possible execute node. However, can have a well defined port range.

• Two mailing lists set up: one for users (92 currently registered) and the other for sysadmins.

• Have a nice web-based utility for viewing job files in realtime on execute hosts.

41 refereed publications to date, (Science, Phys. Rev. Lett., PLOS,…)

USERS YOUR GRID

GOD SAVE THE GRID

How you can help us help you

• Pressgang local resources. Why aren’t those laptops/desktops on CamGrid?

• When applying for grants, please ask for funds to put towards computational resources (~£10k?)

• Publications, publications, publications! Please remember to mention CamGrid and inform me of accepted articles.

• Evangelise locally, especially to hierarchy.

• Tell us what you’d like to see (centralised storage, etc.)

reports books images audio papers research data pdf data preprints documents eprints PhD learning objects TIFF bitstreams scholarly conference papers video text articles xml working papers web pages digital theses multimedia statistics manuscripts photos source code

We can archive your digital assets…

Elin Stangeland, Repository [email protected]

Take home message

• It works. Cranked out 386 years of CPU usage since Feb ’06 (King James I, Jamestown Massacre).

• Those who put the effort in and get over the initial learning curve are very happy with it:

“Without CamGrid this research would simply not be feasible.” – Prof. Bill Amos (population geneticist )

“We acknowledge CamGrid for invaluable help." – Prof. Fernando Quevedo (theoretical physicist)

• Does not need outlay for any new hardware and the middleware’s free (and open source).

• This is a grass-roots initiative: you need to help recruit more/newer resources.

Links

• CamGrid: www.escience.cam.ac.uk/projects/camgrid/

• Condor: www.cs.wisc.edu/condor/

• Email: [email protected]

Questions?