22
John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at CCLRC) John Kewley [email protected] http://www.e-science.clrc.ac.uk/web/staff/john_kewl ey

John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Embed Size (px)

DESCRIPTION

John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI What is a Compute Zoo?

Citation preview

Page 1: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Caging the CCLRC Compute Zoo

(Activities at CCLRC)

John [email protected]

http://www.e-science.clrc.ac.uk/web/staff/john_kewley

Page 2: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Outline

What is a Compute Zoo?

Caging Problems

A Trip to the Zoo

Uses for a Compute Zoo

Page 3: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

What is a Compute Zoo?

Page 4: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Compute Farm

Homogenous: large numbers of (near) identical resources

Often co-located physically: a training room, lab workstations or a large cluster

Centrally managed, often by dedicated staff

Typical of many Condor Pools: excellent for High Throughput Computing

Page 5: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Compute Farm

Page 6: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Compute Zoo

Heterogeneous: resources are of many different operating systems and architectures

Located across a site

Individually, or variously managed

Of minimal use for HTC

Page 7: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Compute Zoo

Page 8: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Caging Problems(Firewall Mirroring)

Page 9: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Firewalls within a Condor Pool

Some resource owners have firewalls on their personal workstationsSince Condor needs each submit node to be able to talk to every potential execute node, this necessitates the opening of every firewall in the pool to every submit node when it is added.Between adding the new node and the firewalls being updated, the firewalled nodes will be unavailable for use.

Or are they? Maybe someone should tell Condor!

Page 10: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Adding a new machine to the pool

If we add a new machine to the pool, the existing firewalls may not have anticipated this.The firewalls will likely block this new machineA Job may still match for the newly added machine to the firewalled resource.This job will not be able to runParts of the system can jam as a result.o condor_q on submitting nodeo Subsequent parts of the submit scripto (maybe also parts of the central node)

Page 11: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Private networks

Similar "jams" occur if part of your pool (or flock of pools) is on a network that is unavailable to some of the other nodesHow can we permit jobs from submit nodes that can access the private network to run on these nodes whilst preventing Condor sending jobs from other submit nodes there?

Page 12: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

How can we get round this?

1. Restrict the number of submit nodes

2. Automatically update the firewall files

3. Ensure everything is up-to-date

4. Permit pool to evolve whilst persuading Condor to “avoid” going to nodes where the job can’t run

Page 13: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Firewall Mirroring (1)1. Each machine with a firewall declares the fact in

its ClassAds:HAS_FIREWALL = TRUE

2. Also, which machines and/or subnets it permits to access its Condor ports (mirroring FW table settings):

FW_ALLOWS_113 = TRUEFW_ALLOWS_rjavig6 = TRUE

3. Finally, it needs to export these settings:STARTD_EXPRS = HAS_FIREWALL, FW_ALLOWS_113, \

FW_ALLOWS_rjavig6

Page 14: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Firewall Mirroring (2)To ensure that jobs can only go to resources they can

reach,

1. Ensure that submit machines declare their subnet and hostname:

MY_SUBNET = 113MY_HOST = condor

2. Use these value in the following expression which is added to all REQUIREMENTS for jobs from this machine:

APPEND_REQUIREMENTS = ( \ (HAS_FIREWALL =!= TRUE) || \ (FW_ALLOWS_$(MY_HOST) == TRUE) || \ (FW_ALLOWS_$(MY_SUBNET) == TRUE) )

Page 15: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

And Private Networks?

Same solution can be used for private networks by pretending they have a firewall and declaring which other nodes have access to that network

Page 16: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

A Trip to the Zoo(Viewing the Pool)

Page 17: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

The CCLRC Compute Zoo2x Windows XP Professional2x Windows 2000 Professional1x Windows NT 4.0 Workstation7x SuSE Linux 9.02x SuSE Linux 8.01x SuSE Linux 9.15x White Box Enterprise Linux 3.01x Red Hat Enterprise Linux AS release 3.01x Red Hat Enterprise Linux WS release 3.03x Red Hat Linux 92x Red Hat Linux 8.02x Red Hat Linux 7.31x Mandrake Linux 10.11x Gentoo Linux 1.4

Page 18: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Viewing the Pool

http://tardis.dl.ac.uk/Condor/cgi-bin/CondorStatus.cgi

http://tardis.dl.ac.uk/Condor/cgi-bin/WiscStatus.cgi

Page 19: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Uses of a Zoo

Page 20: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

“Build and Test”The CCLRC pool was part of the UK Grid Engineering Task Force “Build and Test” project.Software bundles were distributed to a variety of OS types around the flocked pool for building and testing.This type of (flocked) pool relies on heterogeneity and small numbers of each type are all that are required.

http://polaris.ecs.soton.ac.uk:65000/http://wiki.nesc.ac.uk/read/sfct?HomePage

Page 21: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

Presenter NameFacility Name

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

Other non-HTC Uses

I want to ensure my code compiles without warnings and/or runs its basic tests ono As many OSs as possibleo With as many different compilers as possible

I want to perform a release build of my product for platform X, but I only have accounts on A, B and C

I have several server-licensed products and many potential occasional users. How can this be made available to them more easily (within the bounds of the licence of course!)

Page 22: John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison,…

John Kewleye-Science Centre

CCLRC Daresbury Laboratory

15th March 2005Paradyn / Condor WeekMadison, WI

What other uses are there for a Compute Zoo?