37
Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Grids and Condor Barcelona, 2006

Grids and Condor Barcelona, 2006

  • Upload
    yardan

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Grids and Condor Barcelona, 2006. Agenda. Extended user’s tutorial Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing Case studies, and a discussion of your application‘s needs. Resources. - PowerPoint PPT Presentation

Citation preview

Page 1: Grids and Condor Barcelona, 2006

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Grids and Condor

Barcelona, 2006

Page 2: Grids and Condor Barcelona, 2006

2http://www.cs.wisc.edu/condor

AgendaExtended user’s tutorialAdvanced Uses of Condor

Java programsDAGManStorkMWGrid Computing

Case studies, and a discussion of your application‘s needs

Page 3: Grids and Condor Barcelona, 2006

3http://www.cs.wisc.edu/condor

Resources

There are many resources (machines) in the world, and many are or can be made available!

Groups of machines may be labeled as grids

Welcome to the power of the grid !

Page 4: Grids and Condor Barcelona, 2006

4http://www.cs.wisc.edu/condor

Condor and Grids

Condor has always been a tool to harness grid computing

Condor’s mechanisms have evolved as technologies have evolved. Roughly categorized: Flocking Glidein The grid universe

Page 5: Grids and Condor Barcelona, 2006

5http://www.cs.wisc.edu/condor

Flocking

• A way for jobs to run within a different, separate Condor pool

• Condor runs here, and Condor runs there

herethere

Page 6: Grids and Condor Barcelona, 2006

6http://www.cs.wisc.edu/condor

Connect Condor Poolswith Flocking

Flocking is a Condor-specific technology

Flocking is enabled with configuration Jobs flock from here to there when

they cannot be run here due to lack of available machines

Page 7: Grids and Condor Barcelona, 2006

7http://www.cs.wisc.edu/condor

Configuration

Configuration files contain lots of the administrative information used by Condor

Format is like that in submit description files:

AttributeName = Value

Page 8: Grids and Condor Barcelona, 2006

8http://www.cs.wisc.edu/condor

Configuration here For jobs to be able to flock from here to

there In the configuration file on the pool

where jobs flock from:FLOCK_TO = <central manager machine name>FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)

Page 9: Grids and Condor Barcelona, 2006

9http://www.cs.wisc.edu/condor

Configuration there In the configuration file on the pool where

jobs flock to:FLOCK_FROM = <submit machine name>, . . . ,

<submit machine name>

To make security work:HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE),

$(FLOCK_FROM)

HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)

HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM)

HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)

Page 10: Grids and Condor Barcelona, 2006

10http://www.cs.wisc.edu/condor

Submit Description File

Enable file transfer:universe = vanillaexecutable = myjob.exeinput = myjob.inputoutput = myjob.outputlog = myjob.logshould_transfer_files = YESwhen_to_transfer_output = ON_EXITqueue

Page 11: Grids and Condor Barcelona, 2006

11http://www.cs.wisc.edu/condor

The Glidein Concept

Assume:We need more machines, and we

have permission to use a set of machines

Glidein temporarily adds a set of machines to the local pool

Page 12: Grids and Condor Barcelona, 2006

12http://www.cs.wisc.edu/condor

Glidein

In addition, Glidein solves the problem:“My job needs to run on that particular

resource, and my job needs Condor.” For example: a job that must run under

the standard universe

Page 13: Grids and Condor Barcelona, 2006

13http://www.cs.wisc.edu/condor

Glidein

Condor sends and runs its own executables on the resource

The needed resource appears to temporarily join the local Condor pool !

Page 14: Grids and Condor Barcelona, 2006

14http://www.cs.wisc.edu/condor

Glideinrun condor_glidein to add the remote

resource to the local pool

local pool remote

resource

the master and

startd daemons

become grid

universe jobs

using gt2

Page 15: Grids and Condor Barcelona, 2006

15http://www.cs.wisc.edu/condor

Making Glidein Work Change the configuration to give access

permission (HOSTALLOW_WRITE) to the remote resource

No changes to jobs’ submit description files! But, do enable file transfer in the submit

description file: universe = vanilla

executable = myjob.exeinput = myjob.inputoutput = myjob.outputlog = myjob.logshould_transfer_files = YESwhen_to_transfer_output = ON_EXITqueue

Page 16: Grids and Condor Barcelona, 2006

16http://www.cs.wisc.edu/condor

Force Job to Glidein Resource

In the submit description file: universe = standard

executable = ajob.exeinput = ajob.inputoutput = ajob.outputlog = ajob.logrequirements = \ ( machine == “example.mcs.anl.gov" ) \ && Arch != "" && OpSys != ""queue

Page 17: Grids and Condor Barcelona, 2006

17http://www.cs.wisc.edu/condor

The Grid Universe

Most useful when1. We want to send a job off to a far away

machine2. We want to hand a job to another batch

processing system on the local machine3. We want to send a job off to a far away

machine, in order to hand that job to another batch processing system on that machine

Page 18: Grids and Condor Barcelona, 2006

18http://www.cs.wisc.edu/condor

The Grid Universe All handled in the submit description file Supports several back end types:

Globus: GT2, GT3, GT4 NorduGrid UNICORE Condor PBS LSF

Page 19: Grids and Condor Barcelona, 2006

19http://www.cs.wisc.edu/condor

Condor-G

Condor-G describes jobs to be handed off to a machine, and the machine is utilizing Globus middleware gt 2: Globus Toolkit 1 or 2 or the

pre-web services GRAM gt 3: Globus Toolkit 3 gt 4: Globus Toolkit 4 or WS GRAM

Page 20: Grids and Condor Barcelona, 2006

20http://www.cs.wisc.edu/condor

Submit Description File

For gt2:universe = grid

input = job1.input

output = job1.result

log = job1.log

grid_resource = gt2 example.wisc.edu/jobmanager

queue

jobmanager

jobmanager-condor

jobmanager-pbs

jobmanager-lsf

jobmanager-sge

One of:

Page 21: Grids and Condor Barcelona, 2006

21http://www.cs.wisc.edu/condor

For gt3:universe = grid

input = job2.input

output = job2.result

log = job2.log

grid_resource = gt3 http://198.51.254.40:8080/osga/services/base /gram/XXXManagedJobFactoryService

queue

Submit Description File

Fork

Condor

PBS

LSF

SGE

XXX is one of:

IP address:Port number

Page 22: Grids and Condor Barcelona, 2006

22http://www.cs.wisc.edu/condor

For gt4:universe = gridinput = job3.inputoutput = job3.resultlog = job3.loggrid_resource = gt4 https://198.51.254.40:8080/wsrf/service/ManagedJobFactoryService XXX

queue

Submit Description File

Fork

Condor

PBS

LSF

SGE

XXX is one of:

IP address:Port number

OR

Host name:Port number

Page 23: Grids and Condor Barcelona, 2006

23http://www.cs.wisc.edu/condor

Nordugrid and the Submit Description

Fileuniverse = grid

input = job4.input

output = job4.result

log = job4.log

grid_resource = nordugrid ngexample.com

queue

Page 24: Grids and Condor Barcelona, 2006

24http://www.cs.wisc.edu/condor

Unicore and the Submit Description

Fileuniverse = grid

input = job5.input

output = job5.result

log = job5.log

grid_resource = unicore usite.example.com vsite

keystore_file = /frieda/certificates/keystore

keystore_alias = “frieda”

keystore_passphrase_file = /frieda/private/passphrase

queue

vsite is the name of the

Unicore virtual resource

Page 25: Grids and Condor Barcelona, 2006

25http://www.cs.wisc.edu/condor

PBS and the Submit Description

File Details of the PBS installation in$(GLITE_LOCATION)/etc/batch_gahp.config

universe = gridinput = job6.inputoutput = job6.resultlog = job6.loggrid_resource = pbsqueue

Page 26: Grids and Condor Barcelona, 2006

26http://www.cs.wisc.edu/condor

LSF and the Submit Description

File Details of the LSF installation in$(GLITE_LOCATION)/etc/batch_gahp.config

universe = gridinput = job7.inputoutput = job7.resultlog = job7.loggrid_resource = lsfqueue

Page 27: Grids and Condor Barcelona, 2006

27http://www.cs.wisc.edu/condor

Condor-C

Condor is running here,and Condor is running over there

For the case whereWe want to send a job off to a far away

machine, in order to hand that job to another batch processing system on that machine

Page 28: Grids and Condor Barcelona, 2006

28http://www.cs.wisc.edu/condor

Condor-C and the Submit Description

Fileuniverse = gridinput = job8.inputoutput = job8.resultlog = job8.loggrid_resource = condor [email protected] remotecentralmanager.example.com

+remote_jobuniverse = 5+remote_requirements = True+remote_ShouldTransferFiles = "YES"+remote_WhenToTransferOutput = "ON_EXIT"queue

schedd name

collector

machine name

vanilla universe

Page 29: Grids and Condor Barcelona, 2006

29http://www.cs.wisc.edu/condor

Credentials

Not just anybody can use any resource at any time. . .

Key concepts:Authentication

verification of an identity

Authorizationpermission to do something

Page 30: Grids and Condor Barcelona, 2006

30http://www.cs.wisc.edu/condor

Authentication

If Frieda says “I am Frieda.”,

how do we distinguish this from

if Frieda says “I am George

Bush.” ?

Page 31: Grids and Condor Barcelona, 2006

31http://www.cs.wisc.edu/condor

Authentication

Bush can do whatever he pleases If Frieda claims to be Bush, (and

this is accepted), then Frieda can do whatever she pleases

Authentication attempts to verify the identity of the entity that is communicating

Page 32: Grids and Condor Barcelona, 2006

32http://www.cs.wisc.edu/condor

Authorization

Who is allowed (permitted) to do what Frieda may run gt4 jobs on the Open

Science Grid machines Fred may write to files in /usr/bin the Unix user root may do anything!

Can be implemented with a list of those authorized

Page 33: Grids and Condor Barcelona, 2006

33http://www.cs.wisc.edu/condor

Condor and Authentication

Authentication within Condor comes in many forms. Here are three.

1. File system: Have the entity write a file. The OS attaches a name to the file owner. Condor checks that the entity’s claim is the same as the file owner.

2. GSI (Grid Security Infrastructure)3. Kerberos

Page 34: Grids and Condor Barcelona, 2006

34http://www.cs.wisc.edu/condor

Authentication Idea

• A centralized certificate authority (CA) does verification of an entity’s identity.

• When satisfied, the CA issues a signed certificate (also called a credential)

I am

Frieda

CA

Page 35: Grids and Condor Barcelona, 2006

35http://www.cs.wisc.edu/condor

Authentication• To authenticate,

the entity presents the certificate

• All is well, if we trust the CA and the remote machine

I am

Frieda

CA

Page 36: Grids and Condor Barcelona, 2006

36http://www.cs.wisc.edu/condor

GSI Authentication

GSI uses X.509 certificates Grid universe, submitting to back

end types using Globus middleware (gt2, gt3, gt4), as well as nordugrid, and unicore use X.509 certificates

Condor can also use GSI

Page 37: Grids and Condor Barcelona, 2006

37http://www.cs.wisc.edu/condor

Revocation, Trust, and Proxies

The CA may revoke a credential Frieda gives the signed credential to the remote

machine. If the remote machine is malicious, it could impersonate Frieda. Therefore, a password protects the credential.

A proxy is a credential that includes the password, but is only valid for a specific (short) time period.

MyProxy software enables GSI proxy management