Author
others
View
6
Download
0
Embed Size (px)
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 1
European Condor Week 2006European Condor Week 2006
Using Condor Glide-Ins and GCB to run on the Grid
The CDF experience
Elliot Lipeles, Matthew Norman, Frank Würthwein,University of California, San Diego
Subir Sarkar, Igor Sfiligoi,INFN
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 2
Traditional use of Condor in CDFTraditional use of Condor in CDF
Single Point of Submission system● CAF Daemons accept and authenticate user jobs, handle output.● Condor Schedd does the real job
CAF (Portal)
Worker Node (WN)
Worker Node (WN)
Startd
Startd
Collector
Schedd
SubmitterDaemons Negotiator
User jobsUser jobs
Legacyprotocol
MonitoringDaemons
CDF Analysis Farm (CAF)...
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 3
Had to move to the GridHad to move to the Grid
“The Grid”
●HEP moving to the Grid●Nobody wants to finance dedicated nodes, anymore
●Need to move●Want to preserve user interface
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 4
Why not plain Condor-G?Why not plain Condor-G?
Works, but...●No central matchmaking●No control over priorities●Site selection problem●Black holes can eat most of the user jobs
Grid Pool
Globus
SubmitterDaemons
Grid Pool
Globus
Batch queue
CAF (Portal)
User jobsUser jobs
Schedd
Batch queue
Batch Slot
User Job
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 5
All jobsdone
CDF decided to go with Condor glide-insCDF decided to go with Condor glide-ins
Grid Pool
Globus
SubmitterDaemons
Grid Pool
Globus
Batch queue
CAF (Portal)
User jobsUser jobs
Schedd
Batch queue
Collector/Neg
ScheddGlid
eke
eper
Monit
or
Monit
or
Glid
ein
sG
lidein
s Batch Slot
Glidein
User Job
Glide-ins solve all the problems
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 6
What is the GlidekeeperWhat is the Glidekeeper
● Just a simple script– Checks number jobs and number glide-ins– Makes sure there is always at least a glide-
in per CE, when idle jobs in the queue● Does not need to be complex
– Just blindly submit to all the CEs– But could be made as complex as needed
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 7
What is a glide-inWhat is a glide-in
condor_config
DAEMON_LIST = MASTER,STARTDNEGOTIATOR_HOST = $(HEAD_NODE)COLLECTOR_HOST = $(HEAD_NODE)
TMOUT=288000MaxJobRetirementTime=$(TMOUT)SHUTDOWN_GRACEFUL_TIMEOUT=$(TMOUT)
# How long will it wait in an unclaimed stateSTARTD_NOCLAIM_SHUTDOWN = 1200
HEAD_NODE = cdfhead.fnal.gov
glidein_startup.sh
validate_node()local_config()./condor_master -r $mins -dyn -f
● Glide-ins are properly configured startds– Same old binaries used
● CDF uses a startup script to validatethe node before starting startd– Prevent job failure
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 8
CDF using glide-ins for a year, now CDF using glide-ins for a year, now
CNAF one year history plot: Note over a thousand running jobs for long period of time.
SDSC six months history plot
CNAF (Bologna, Italy), the first GlideCAF SDSC (San Diego, CA, USA)
Fermilab (Batavia, IL, USA) IN2P3 (Lyon, France) MIT (Boston, MA, USA)
1k1k
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 9
Problems found along the roadProblems found along the road
● File delivery● Firewalls● Security concerns
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 10
Problems found along the roadProblems found along the road
● File delivery● Firewalls● Security concerns
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 11
Condor binaries deliveryCondor binaries delivery
glidein.submit
Universe = GlobusGlobusScheduler = mysite/jobmanager-mybatch
Executable = glidein_startup.shtransfer_Input_files = condor.tgz
Queue
Only the executable guaranteedto be transfered on EGEE.
Other files need gLite tools.Works fine on OSG
Cannot useCondor-G
transfer mechanism!
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 12
glidein_startup.sh
validate_node()local_config()tar -czf condor.tgz./condor_master -r $retmins -dyn -f
HTTPd for file transferHTTPd for file transfer
glidein_startup.sh
validate_node()wget http://cdfhead.fnal.gov/condor.tgzsha1sum knownSHA1 condor.tgzif [ $? -eq 0 ]; then tar -xzf condor.tgz local_config() ./condor_master -r $retmins -dyn -ffi
GlideCAF (Portal)
Collector/Neg
Main Schedd
Glidekeeper Glide-inSchedd
HTTPd (1)
(2)
(3)
(4)SubmitterDaemons
(2)
(3)
HTTPd serves files
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 13
Add HTTP Proxy to reduce loadAdd HTTP Proxy to reduce load
GlideCAF (Portal)
Collector/Neg
Main Schedd
Glidekeeper Glide-inSchedd
HTTPd (1) (2)
(3)
(4)SubmitterDaemons
glidein_startup.sh
validate_node()env http_proxy=proxy1.fnal.gov wget http://cdfhead.fnal.gov/condor.tgzsha1sum knownSHA1 condor.tgzif [ $? -eq 0 ]; then tar -xzf condor.tgz local_config() ./condor_master -r $retmins -dyn -ffi
HTTP Proxy(once)
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 14
Problems found along the roadProblems found along the road
● File delivery● Firewalls● Security concerns
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 15
Working with remote Grid sites - DumbWorking with remote Grid sites - Dumb
Collector
MainSchedd
Available Batch Slot
StartdGlobus
WAN ??
UDP packetsoften lostover WAN
Available Batch Slot
Startd
Firewall/NAT
Firewalls and NATsmake it impossible
GlobusGlide-inSchedd
Available Batch Slot
Startd
Globus
UD
P
UD
P
User Job User Job
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 16
Working with remote Grid sites - SmartWorking with remote Grid sites - Smart
Collector
MainSchedd
Available Batch Slot
StartdGlobus
WAN
TCPTC
P
Available Batch Slot
Startd
GlobusGlide-inSchedd
Available Batch Slot
Startd
Globus
UD
P
UD
P
User Job User Job
UDPis fast
TCP trafficis reliable
Firewall/NAT
Use GCB to bridge firewalls
GCB
Outgoing
TCP User Job
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 17
Generic Connection Brokering (GCB)Generic Connection Brokering (GCB)
● Condor proxy service– Fully integrated in
Condor– Just a configuration
parameter● For scalability, use
multiple instances– Point different
glide-ins to different GCBs
condor_config
DAEMON_LIST = MASTER,STARTDNEGOTIATOR_HOST = $(HEAD_NODE)COLLECTOR_HOST = $(HEAD_NODE)
TMOUT=288000MaxJobRetirementTime=$(TMOUT)SHUTDOWN_GRACEFUL_TIMEOUT=$(TMOUT)
# How long will it wait in an unclaimed stateSTARTD_NOCLAIM_SHUTDOWN = 1200
HEAD_NODE = cdfhead.fnal.gov
# GCB configurationNET_REMAP_INAGENT = cdfgcb1.fnal.govNET_REMAP_ENABLE = trueNET_REMAP_SERVICE = GCB
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 18
How GCB worksHow GCB works
Firewall/NAT
Collector
Schedd
Grid Pool
GCB Broker
Startd Globus
4) Schedd address forwarded
3) Send own address
6) A job can be sent to the startd
GCB must be on a public network
TCP
2) Advertise itself and the GCB connection
5) Establish a TCP connection to the schedd
Just a proxy
User Job
1) Establish a persistent TCP connection to a GCB
TCP
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 19
The use of GCB at CDFThe use of GCB at CDF
● Using a pre-production version of GCB– Waiting for Condor v6.7.20
● N.American CAF just opened to users– Condor collector and schedds at Fermilab– Using 2 GCBs, both at Fermilab– Gliding into Fermilab, MIT and San Diego– Rest of US and Canada Grid sites to follow
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 20
Problems found along the roadProblems found along the road
● File delivery● Firewalls● Security concerns
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 21
Security in the pull modelSecurity in the pull model
● Real user known only after glide-in starts– Cannot use user proxy when submitting to CE– Real user proxy delivered by Condor
● User job can use it for further authentication
● All glide-ins use a single service proxy– Site admin does not know about the real user– All jobs potentially run under the same UID
● No system protection between users
● Glide-in and job run under the same UID– User can steal service proxy
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 22
Security problem at a glanceSecurity problem at a glance
SubmitterDaemons
Grid Pool
Globus
CAF (Portal)
User jobsUser jobs
ScheddBatch queue
Collector/Neg
ScheddGlid
eke
eper
Monit
or
Monit
or
Glid
ein
sG
lidein
s
Batch Slot
Glidein
User Job
Batch Slot
Glidein
User Job
Node
uidXProxyA
uidXProxyB
Pro
xyX
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 23
Use gLExec to authenticate on the WN Use gLExec to authenticate on the WN
● Authenticate on WN before starting the user job– Can change UID on the fly
● Grid admin knows who is running
GlobusBatch queue
Batch Slot
GlideinuidX
Pro
xyX
User Job ProxyAuidA
gLExecPro
xyA
Pro
xyA
Schedd
Collector/Neg
Schedd
GridPool
ProxyX
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 24
Status of gLExecStatus of gLExec
● Current gLExec works only on CE– Cannot use it as-is
● Work in progress to make it usable for the glideins– See later talk
● Condor team promised to make use of gLExec once available
European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi 25
SummarySummary
● Glide-ins and GCB took CDF to the Grid– Without leaving the Condor environment
● Using the pull model made our life easy– No need to select a site in advance– Matchmaking done at a global level– Policies kept in CDF hands
● No need for any add-on at Grid sites– Will use what is found
● gLExec and a local Squid desirable