25
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 1 European Condor Week 2006 European Condor Week 2006 Using Condor Glide-Ins and GCB to run on the Grid The CDF experience Elliot Lipeles, Matthew Norman, Frank Würthwein, University of California, San Diego Subir Sarkar, Igor Sfiligoi, INFN

Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 1

European Condor Week 2006European Condor Week 2006

Using Condor Glide-Ins and GCB to run on the Grid

The CDF experience

Elliot Lipeles, Matthew Norman, Frank Würthwein,University of California, San Diego

Subir Sarkar, Igor Sfiligoi,INFN

Page 2: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 2

Traditional use of Condor in CDFTraditional use of Condor in CDF

Single Point of Submission system● CAF Daemons accept and authenticate user jobs, handle output.● Condor Schedd does the real job

CAF (Portal)

Worker Node (WN)

Worker Node (WN)

Startd

Startd

Collector

Schedd

SubmitterDaemons Negotiator

User jobsUser jobs

Legacyprotocol

MonitoringDaemons

CDF Analysis Farm (CAF)...

Page 3: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 3

Had to move to the GridHad to move to the Grid

“The Grid”

●HEP moving to the Grid●Nobody wants to finance dedicated nodes, anymore

●Need to move●Want to preserve user interface

Page 4: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 4

Why not plain Condor-G?Why not plain Condor-G?

Works, but...●No central matchmaking●No control over priorities●Site selection problem●Black holes can eat most of the user jobs

Grid Pool

Globus

SubmitterDaemons

Grid Pool

Globus

Batch queue

CAF (Portal)

User jobsUser jobs

Schedd

Batch queue

Batch Slot

User Job

Page 5: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 5

All jobsdone

CDF decided to go with Condor glide-insCDF decided to go with Condor glide-ins

Grid Pool

Globus

SubmitterDaemons

Grid Pool

Globus

Batch queue

CAF (Portal)

User jobsUser jobs

Schedd

Batch queue

Collector/Neg

ScheddGlid

eke

eper

Monit

or

Monit

or

Glid

ein

sG

lidein

s Batch Slot

Glidein

User Job

Glide-ins solve all the problems

Page 6: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 6

What is the GlidekeeperWhat is the Glidekeeper

● Just a simple script– Checks number jobs and number glide-ins– Makes sure there is always at least a glide-

in per CE, when idle jobs in the queue● Does not need to be complex

– Just blindly submit to all the CEs– But could be made as complex as needed

Page 7: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 7

What is a glide-inWhat is a glide-in

condor_config

DAEMON_LIST = MASTER,STARTDNEGOTIATOR_HOST = $(HEAD_NODE)COLLECTOR_HOST = $(HEAD_NODE)

TMOUT=288000MaxJobRetirementTime=$(TMOUT)SHUTDOWN_GRACEFUL_TIMEOUT=$(TMOUT)

# How long will it wait in an unclaimed stateSTARTD_NOCLAIM_SHUTDOWN = 1200

HEAD_NODE = cdfhead.fnal.gov

glidein_startup.sh

validate_node()local_config()./condor_master -r $mins -dyn -f

● Glide-ins are properly configured startds– Same old binaries used

● CDF uses a startup script to validatethe node before starting startd– Prevent job failure

Page 8: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 8

CDF using glide-ins for a year, now CDF using glide-ins for a year, now

CNAF one year history plot: Note over a thousand running jobs for long period of time.

SDSC six months history plot

CNAF (Bologna, Italy), the first GlideCAF SDSC (San Diego, CA, USA)

Fermilab (Batavia, IL, USA) IN2P3 (Lyon, France) MIT (Boston, MA, USA)

1k1k

Page 9: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 9

Problems found along the roadProblems found along the road

● File delivery● Firewalls● Security concerns

Page 10: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 10

Problems found along the roadProblems found along the road

● File delivery● Firewalls● Security concerns

Page 11: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11

Condor binaries deliveryCondor binaries delivery

glidein.submit

Universe = GlobusGlobusScheduler = mysite/jobmanager-mybatch

Executable = glidein_startup.shtransfer_Input_files = condor.tgz

Queue

Only the executable guaranteedto be transfered on EGEE.

Other files need gLite tools.Works fine on OSG

Cannot useCondor-G

transfer mechanism!

Page 12: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 12

glidein_startup.sh

validate_node()local_config()tar -czf condor.tgz./condor_master -r $retmins -dyn -f

HTTPd for file transferHTTPd for file transfer

glidein_startup.sh

validate_node()wget http://cdfhead.fnal.gov/condor.tgzsha1sum knownSHA1 condor.tgzif [ $? -eq 0 ]; then tar -xzf condor.tgz local_config() ./condor_master -r $retmins -dyn -ffi

GlideCAF (Portal)

Collector/Neg

Main Schedd

Glidekeeper Glide-inSchedd

HTTPd (1)

(2)

(3)

(4)SubmitterDaemons

(2)

(3)

HTTPd serves files

Page 13: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 13

Add HTTP Proxy to reduce loadAdd HTTP Proxy to reduce load

GlideCAF (Portal)

Collector/Neg

Main Schedd

Glidekeeper Glide-inSchedd

HTTPd (1) (2)

(3)

(4)SubmitterDaemons

glidein_startup.sh

validate_node()env http_proxy=proxy1.fnal.gov wget http://cdfhead.fnal.gov/condor.tgzsha1sum knownSHA1 condor.tgzif [ $? -eq 0 ]; then tar -xzf condor.tgz local_config() ./condor_master -r $retmins -dyn -ffi

HTTP Proxy(once)

Page 14: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 14

Problems found along the roadProblems found along the road

● File delivery● Firewalls● Security concerns

Page 15: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 15

Working with remote Grid sites - DumbWorking with remote Grid sites - Dumb

Collector

MainSchedd

Available Batch Slot

StartdGlobus

WAN ??

UDP packetsoften lostover WAN

Available Batch Slot

Startd

Firewall/NAT

Firewalls and NATsmake it impossible

GlobusGlide-inSchedd

Available Batch Slot

Startd

Globus

UD

P

UD

P

User Job User Job

Page 16: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 16

Working with remote Grid sites - SmartWorking with remote Grid sites - Smart

Collector

MainSchedd

Available Batch Slot

StartdGlobus

WAN

TCPTC

P

Available Batch Slot

Startd

GlobusGlide-inSchedd

Available Batch Slot

Startd

Globus

UD

P

UD

P

User Job User Job

UDPis fast

TCP trafficis reliable

Firewall/NAT

Use GCB to bridge firewalls

GCB

Outgoing

TCP User Job

Page 17: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 17

Generic Connection Brokering (GCB)Generic Connection Brokering (GCB)

● Condor proxy service– Fully integrated in

Condor– Just a configuration

parameter● For scalability, use

multiple instances– Point different

glide-ins to different GCBs

condor_config

DAEMON_LIST = MASTER,STARTDNEGOTIATOR_HOST = $(HEAD_NODE)COLLECTOR_HOST = $(HEAD_NODE)

TMOUT=288000MaxJobRetirementTime=$(TMOUT)SHUTDOWN_GRACEFUL_TIMEOUT=$(TMOUT)

# How long will it wait in an unclaimed stateSTARTD_NOCLAIM_SHUTDOWN = 1200

HEAD_NODE = cdfhead.fnal.gov

# GCB configurationNET_REMAP_INAGENT = cdfgcb1.fnal.govNET_REMAP_ENABLE = trueNET_REMAP_SERVICE = GCB

Page 18: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 18

How GCB worksHow GCB works

Firewall/NAT

Collector

Schedd

Grid Pool

GCB Broker

Startd Globus

4) Schedd address forwarded

3) Send own address

6) A job can be sent to the startd

GCB must be on a public network

TCP

2) Advertise itself and the GCB connection

5) Establish a TCP connection to the schedd

Just a proxy

User Job

1) Establish a persistent TCP connection to a GCB

TCP

Page 19: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 19

The use of GCB at CDFThe use of GCB at CDF

● Using a pre-production version of GCB– Waiting for Condor v6.7.20

● N.American CAF just opened to users– Condor collector and schedds at Fermilab– Using 2 GCBs, both at Fermilab– Gliding into Fermilab, MIT and San Diego– Rest of US and Canada Grid sites to follow

Page 20: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 20

Problems found along the roadProblems found along the road

● File delivery● Firewalls● Security concerns

Page 21: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 21

Security in the pull modelSecurity in the pull model

● Real user known only after glide-in starts– Cannot use user proxy when submitting to CE– Real user proxy delivered by Condor

● User job can use it for further authentication

● All glide-ins use a single service proxy– Site admin does not know about the real user– All jobs potentially run under the same UID

● No system protection between users

● Glide-in and job run under the same UID– User can steal service proxy

Page 22: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 22

Security problem at a glanceSecurity problem at a glance

SubmitterDaemons

Grid Pool

Globus

CAF (Portal)

User jobsUser jobs

ScheddBatch queue

Collector/Neg

ScheddGlid

eke

eper

Monit

or

Monit

or

Glid

ein

sG

lidein

s

Batch Slot

Glidein

User Job

Batch Slot

Glidein

User Job

Node

uidXProxyA

uidXProxyB

Pro

xyX

Page 23: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 23

Use gLExec to authenticate on the WN Use gLExec to authenticate on the WN

● Authenticate on WN before starting the user job– Can change UID on the fly

● Grid admin knows who is running

GlobusBatch queue

Batch Slot

GlideinuidX

Pro

xyX

User Job ProxyAuidA

gLExecPro

xyA

Pro

xyA

Schedd

Collector/Neg

Schedd

GridPool

ProxyX

Page 24: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 24

Status of gLExecStatus of gLExec

● Current gLExec works only on CE– Cannot use it as-is

● Work in progress to make it usable for the glideins– See later talk

● Condor team promised to make use of gLExec once available

Page 25: Using Condor Glide-Ins and GCB to run on the Grid The CDF ...€¦ · European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11 Condor binaries delivery

European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 25

SummarySummary

● Glide-ins and GCB took CDF to the Grid– Without leaving the Condor environment

● Using the pull model made our life easy– No need to select a site in advance– Matchmaking done at a global level– Policies kept in CDF hands

● No need for any add-on at Grid sites– Will use what is found

● gLExec and a local Squid desirable