Upload
miracle
View
24
Download
0
Embed Size (px)
DESCRIPTION
Flexible Session Management in a Distributed System. Condor Communication Layer: CEDAR. CEDAR – Condor External DAta Representation C++ API for network sockets Cross-platform data representation Special attention to UDP issues Flexibility with private networks Strong focus on security. - PowerPoint PPT Presentation
Citation preview
Zach Miller ([email protected])Todd Tannenbaum ([email protected])
Dan Bradley ([email protected])Igor Sfiligoi ([email protected])
University of Wisconsin-Madisonhttp://www.cs.wisc.edu/condor
CHEP 2009, Prague
Flexible Session Management in a
Distributed System
CHEP 2009, Prague 2www.cs.wisc.edu/condor
Condor Communication Layer: CEDAR
› CEDAR – Condor External DAta Representation C++ API for network sockets Cross-platform data representation Special attention to UDP issues Flexibility with private networks Strong focus on security
CHEP 2009, Prague 3www.cs.wisc.edu/condor
Basic CEDAR Security Features
› Authentication
› Encryption
› Integrity Checks
› Credential Mapping
› Authorization Policy
CHEP 2009, Prague 4www.cs.wisc.edu/condor
Basic CEDAR Security Features
› Autonegotiation of supported features
Client wants:KERBEROS, NTSSPI, 3DES,MD5
Server wants:KERBEROS, GSI, OPENSSL, 3DES, BLOWFISH,MD5
CHEP 2009, Prague 5www.cs.wisc.edu/condor
Basic CEDAR Security Features
› Autonegotiation of supported features
Policy: KERBEROS3DESMD5
CHEP 2009, Prague 6www.cs.wisc.edu/condor
Strong Authentication
› Strong Authentication can be expensive! PKI (OpenSSL and GLOBUS) is
relatively CPU-intensive KERBEROS hits the KDC All require network round trips
CHEP 2009, Prague 7www.cs.wisc.edu/condor
Strong Authentication
› Network round trips in a distributed grid environment (like glideinWMS) may be over a wide area network having relatively large latency
Hi! I want to talk
OK, let’s try GSI
Here you go…
0.1s
0.1s
0.1s
CHEP 2009, Prague 8www.cs.wisc.edu/condor
Strong Authentication
› In a single-threaded client blocking on network, all of those 0.1s quickly become a problem for scalability
› Solutions? Don’t block on network.
• recent progress, but more to do When something is expensive, cache it
• fortunately, already supported
CHEP 2009, Prague 9www.cs.wisc.edu/condor
Security Session Cache
› Session is a semi-permanent information exchange which is set up and torn down
› Setup is costly, but done only once
› Resuming is done often, but is faster
› Tearing down is either explicit or based on expiration times
CHEP 2009, Prague 10www.cs.wisc.edu/condor
Session Management
› Session Set UpI want to talk
OK, let’s authenticate
Authentication
0.1s
0.1s
0.2s
CHEP 2009, Prague 11www.cs.wisc.edu/condor
Session Management
› Authentication results in secure key exchangeI want to talk
OK, let’s authenticate
Authentication
secret key:0x3E42
secret key:0x3E42(secret key is actually 192 bits)
CHEP 2009, Prague 12www.cs.wisc.edu/condor
› The secret key is associated with a session ID
Call this session 1234
Session Management
I want to talk
OK, let’s authenticate
Authentication
sess 12340x3E42
sess 12340x3E42
CHEP 2009, Prague 13www.cs.wisc.edu/condor
› Resuming a session• Sends the ID and uses the secret key
Use session 1234Here is my request, encrypted (with 0x3E42) to prove who I am.
Session Management
sess 12340x3E42
sess 12340x3E42
Here is your response
encrypted (with 0x3E42)to prove who I am.
CHEP 2009, Prague 14www.cs.wisc.edu/condor
› The session is stored by both sides
Session Management
sess 1234key: 0x3E42authn: KERBEROSuser: [email protected]: cobalt.cs.wisc.eduauthz: ALLOW *.wisc.eduvalid until: 2009.03.24.17.50.00
CHEP 2009, Prague 15www.cs.wisc.edu/condor
Basic Condor Operation
› User submits a job
› Condor schedules the job on an execute node
› Submit point connects to execute node and sends job
› Job runs to completion
› Execute node returns results
CHEP 2009, Prague 16www.cs.wisc.edu/condor
High Level Condor Operation
Central Manager
Job Submit Point
ExecuteNode
(with authentication turned on)
CHEP 2009, Prague 17www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Red lines represent authenticated connections
Execute node advertises itself.
High Level Condor Operation
(with authentication turned on)
CHEP 2009, Prague 18www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
User submitsa job.
% condor_submit work.sub
High Level Condor Operation
(with authentication turned on)
CHEP 2009, Prague 19www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Condor performs matchmaking.
Red lines represent authenticated connections
High Level Condor Operation
(with authentication turned on)
CHEP 2009, Prague 20www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Submit node sends job to execute node
Red lines represent authenticated connections
High Level Condor Operation
(with authentication turned on)
CHEP 2009, Prague 21www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Job runs tocompletion
High Level Condor Operation
(with authentication turned on)
CHEP 2009, Prague 22www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Execute node sends job results back
Red lines represent authenticated connections
High Level Condor Operation
(with authentication turned on)
CHEP 2009, Prague 23www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Job complete!
High Level Condor Operation
(with authentication turned on)
repeat ad-infinitum, reusingcached security sessions
CHEP 2009, Prague 24www.cs.wisc.edu/condor
Problems Crossing the Atlantic› Experience in CMS CCRC-08 with
glideinWMS (dynamic Condor pool on top of grid): Even with sessions, authentication cost is killing
performance. Why didn’t we notice this in previous scale tests
at Fermilab?! Because network latency adds significantly to
cost of authentication in Condor (blocking, single-threaded).
CHEP 2009, Prague 25www.cs.wisc.edu/condor
Larger Scale View
Job Submit Point
Execute NodesCentral Manager
CHEP 2009, Prague 26www.cs.wisc.edu/condor
Larger Scale View
Job Submit Point
Execute Nodes
100,000 jobs
Central Manager
User submits 100,000 jobs
CHEP 2009, Prague 27www.cs.wisc.edu/condor
Larger Scale View
Job Submit Point
Execute Nodes
100,000 jobs
Central Manager
Condor schedulesjobs and passes security session info for each match
CHEP 2009, Prague 28www.cs.wisc.edu/condor
Larger Scale View
Job Submit Point
Execute Nodes
100,000 jobs
Central Manager
Send jobs to execute nodes
Lots of authentications!
CHEP 2009, Prague 29www.cs.wisc.edu/condor
Meeting
› Igor: I will never give up on you guys (yet)
› Dan: all blocking network operations MUST be exterminated!
› Todd: don’t unwind all our code; use cooperative threads
› Miron: guys, listen
CHEP 2009, Prague 30www.cs.wisc.edu/condor
The Plan
› The Central Manager authenticates both the submit point and the execute nodes
› Using this trust relationship, the Central Manager can help establish a security session between the two, good for the duration of the match.
CHEP 2009, Prague 31www.cs.wisc.edu/condor
Integrated Security Sessions and Matchmaking
Central Manager
Job Submit Point
ExecuteNode
CHEP 2009, Prague 32www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Red lines represent authenticated connections
Execute node advertises itself AND match session info
match sess 1278key: 0x72A9…
Integrated Security Sessions and Matchmaking
(encrypted)
CHEP 2009, Prague 33www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
User submitsa job.
% condor_submit work.sub
Integrated Security Sessions and Matchmaking
CHEP 2009, Prague 34www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Condor schedulesthe job AND passes match session infoto submitter
Red lines represent authenticated connections
sess 1278key: 0x72A9…
Integrated Security Sessions and Matchmaking
(encrypted)
CHEP 2009, Prague 35www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Submit node sends job to execute node
Red line here represents resuming the match session, saving an authentication
Integrated Security Sessions and Matchmaking
sess 1278key: 0x72A9…
sess 1278key: 0x72A9…
CHEP 2009, Prague 36www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Job runs tocompletion
Integrated Security Sessions and Matchmaking
CHEP 2009, Prague 37www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Execute node sends job results back
Red line here represents resuming the match session, saving an authentication
Integrated Security Sessions and Matchmaking
sess 1278key: 0x72A9…
sess 1278key: 0x72A9…
CHEP 2009, Prague 38www.cs.wisc.edu/condor
Central Manager
Job Submit Point
ExecuteNode
Job complete!
Integrated Security Sessions and Matchmaking
CHEP 2009, Prague 39www.cs.wisc.edu/condor
What about Central Manager?
› Communication between submit node and execute node is now cheaper
› But Central Manager still has to authenticate everyone (at least once).
CHEP 2009, Prague 40www.cs.wisc.edu/condor
Execute Node Ads
Job Submit Point
Execute NodesCentral Manager
many authentications
CHEP 2009, Prague 41www.cs.wisc.edu/condor
Two Ideas
› Easy: 2-tier ClassAd collection done
› Hard: remove blocking network operations and/or use threads in progress
CHEP 2009, Prague 42www.cs.wisc.edu/condor
2-tier CollectorExecute Nodes
Central Manager
Authentication workload distributed across multiple collectors.
sub-collectors
Big benefit, even withcollectors all on justone machine.
CHEP 2009, Prague 43www.cs.wisc.edu/condor
Can we Cross the Atlantic?
› glideinWMS tests with one submit node: Before (condor 7.1.2)
• max 4,000 jobs/day• (500 simultaneously running)
After (condor 7.3.1)
• 200,000 jobs/day• (22k-25k simultaneously running)• now limited by port usage, not scheduler
throughput
CHEP 2009, Prague 44www.cs.wisc.edu/condor
Conclusion› Security sessions essential in Condor
› 2-tier collector works
› Establishing security sessions through matchmaking is a big win. one submit node much more convenient than
50
CHEP 2009, Prague 45www.cs.wisc.edu/condor
Conclusion
› Efficient delegation and caching of trust can be an important optimization in distributed systems.