44
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor RoadMap Condor Week 2007

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison [email protected] Condor RoadMap Condor

Embed Size (px)

Citation preview

Page 1: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

Todd TannenbaumComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Condor RoadMapCondor Week 2007

Page 2: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

2

Current Situation› Stable Series

Current: Condor ver 6.8.4. (Feb 5th)

› Development Series Current : Condor ver 6.9.2. (April 10th)

Page 3: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

3

Major v6.9 Changes› Virtual Machine Universe

See Jaeyoung’s talk

› Quill 2.0 See Jeff and Erik’s talk

› Scalability Improvements

› GCB Improvements

› Privilege Separation

Page 4: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

4

Team Scalability!

Dan “Two-Faces” Bradley (half CMS, half Condor), “Papa” Todd Tannenbaum, and “Uncle”

Greg Thain

Page 5: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

The new condor_q GUI?

Page 6: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 7: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

7

How we looked…› Intuition / Heated hallway discussion

› Wagering

› Examination of the log files

› Some Tools callgrind and kcachegrind

• Or Compuware DevPartner on Win32

gprof strace tcpdump

Page 8: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

8

What we found…› Inappropriate Data Structures

› Deadly embraces

› Implementation issues Bad hash function, table sizes Buffer copy, copy, copy, copy, copy,

copy

“I am sooo embarrassed!”

scheddschedd

Bill and Monica

Page 9: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

9

What we did…› More non-blocking I/O in critical areas to

eliminate timeouts/embraces› Cleansed a bunch o embarrassments› Reuse of a claim

Carefully cache candidate jobs Use “autoclusters”

› Auto-adapting parameters based on workload Old way: “do some bookeeping every 5

seconds” New way: “spend 5% of time doing

bookeeping”

Page 10: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

10

Creepy fork()Got This: 0.125sec/fork! Wanted this:

Page 11: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

11

Lets fix this one, please.(patch circa 2003)

Page 12: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

12

And last but not least, thanks you to

Buffered writes toschedd transaction log

(i.e. Francesco Prelz!)

Numerous scalabilityimprovements to Condor-C

Page 13: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

13

Let’s see some results to date

Page 14: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 15: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

15

SWEEEEET!!Todd, lets see that

graph again!!!

Page 16: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 17: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 18: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 19: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 20: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 21: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

21

condor_q performance(sans Quill)

› Already done batch sending of ads (eliminate latency, let

tcp window warm up) projection of attributes (note : now “condor_q

–l” more expensive than "condor_q -format").

› Still Todo ? i/o in another thread href protocol on the wire caching of parsed expressions -- classads are

very redundant same improvements into condor_status

Page 22: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 23: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

23

Collector Performance› Fixing Dropped updates

increased incoming buffer sizes; problems caused by synchronization via condor_reconfig -all etc.

Also, with Winsock, UDP sendto() is always successful. (!)

› Added DNS caching for unauthenticated connections in Condor. profiling was important; we had no suspicion this was

the problem. Collector was spending 20% of its time in the DNS resolver library!

› Todo: Ian Alderman discovered non-blocking communication assumptions violated by authentication methods that require round-trips.

Page 24: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 25: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

25

Negotiation Performance› v6.8 -> automatic “significant attributes”.

› v6.9 -> “resource request” ads Simple explanation: Resource request ad == a count

plus all significant attributes. Inserted into a schedd submitter ad. “Give me 400 resources like this, and 200 resources

like that, etc”.› Matchmaking algorithms remains the same,

just how it “learns” about jobs changes.› Disabled by default.› Possibilities, possibilities…

More robust against unresponsive schedds No startd Rank preemption? Others?

Page 26: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

26

Impact of negotiation changes

› UW CS Pool – Negotiation cycle times: 2583 seconds baseline Dropped to 366 secs w/ autoclustering Add matchlist caching, dropped to 223 secs Add resource request ads, drops again to

129 seconds. CM memory footprint increased by 80k.

Page 27: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor
Page 28: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

28

Team GCB!

Derek “Mr. Follow-CVS-rules-or-ELSE” Wright, Alan “Ask me about Social Security #s” DeSmet,

and Jaime “The GridMan” Frey

Page 29: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

29

› Improved Scalability: Only use the broker if required! Local Host Optimizations (6.9.1)

• Bypass GCB if two daemons are talking on the same host Local Network Optimizations (6.9.3)

• Two hosts on the same private net bypass the broker• Every network is assigned a unique network name• Daemons advertise (a) public accessible IP; (b) real IP; (c)

network name.• Names match ? use real ip : use public IP.

› Improved Robustness Broker dies -> master finds another broker and restarts. When master starts up, it pings a list o brokers and

randomly chooses from those that respond. Bug fixes

› Improved Logging – now they are helpful and sane.

Page 30: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

30

Team Privilege Separation!

“Cousin” Greg Quinn, Pete “Psilord” Keller, and Zach “When the grid relaxes, its Zmiller time” Miller

Page 31: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

31

Condor’s Privilege Separation› Apply principle of

least privilege to Condor

› No more root / super-user privilege required

› Currently completed on execute side (v6.9.3), “almost” on submit side

› Use glexec or Condor’s own sudo

› Can still run the “old way” if you want

› Refer to Greg Quinn’s Talk

Page 32: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

32

Minor v6.9 changes› Leases added to COD.› Simple best-fit algorithm added to dedicated

scheduler.› Can reference resource usage and quota

information in preemption policy.› condor_config_val –dump [-v]› Chirp improvements

Jobs can write messages into the user log Can use proc 0 ClassAd as a “scratch pad”

› Condor shutdown via expressions External Awareness Plug: Talk w/ Joe Meehan @ the Research BOF!

Page 33: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

33

Minor v6.9 changes, cont.

› More types of jobs can survive across a shutdown/crash of submit machine Such as jobs that stream stdout/err.

› User’s job log changes. Can have a centralized job log file. Get values of any job ad attribute in log.

Page 34: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

34

Hoping for v6.9, but no promises

› Rich Wolski’s prediction work

› Support for VOMS attributes

› Update condor binaries on job boundaries

› Secure install by default Via pool password?

Page 35: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

35

So 2-3 more developer releases, then new stable

series Condor 7.0 (… or Condor Vista? … )

And the next developer series after v6.9 ?

Page 36: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

Terms of LicenseAny and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into PayPal accounts registered to Todd Tannenbaum ….

Page 37: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

37

Beyond v6.9› For the next year, so far we have

identified the following intial focus areas: Continue our work w/ Storage Management

(MOPS)• Refer to Dan Fraser’s talk

Continue our work w/ Virtual Machines• Refer to Jaeyoung Yoon’s talk

Scheduling Work Startd Enhancements

Page 38: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

38

Scheduling in Condor Today

CM

schedd

CMschedd

scheddschedd

schedd

startdstartdstartdstartdstartd

startdstartdstartdstartdstartd

› Distributed Ownership› Settings reflect 3 separate viewpoints:

Pool manager, Resource Owner, Job Submitter

Page 39: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

39

But some sites want to use Condor like this:

schedd

startdstartdstartdstartdstartd

› Just one submission point (schedd)› All resources owned by one entity› We can do better for these sites.

Policy configurations are complicated. Some useful policies not present because they

are hard to do a wide-area distributed system. Today the dedicated “scheduler” only supports

FIFO and a naive Best Fit algorithms.

Page 40: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

40

So what to do?

schedd

startdstartdstartdstartdstartd

› Give the schedd more scheduling options. Examples: why can’t the schedd do

priority preemption without the matchmakers help? Or move jobs from slow to fast claimed resources ?

› Pluggable scheduler routines.

Page 41: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

41

StartD Enhancements: New sources of work

› Condor-G enabled the SchedD to talk to many different scheduling systems to run jobs…

› Now the StartD will be able to talk to different managers to fetch jobs and work.

› StartD configured to be “claimed at boot” so that you don’t need the overhead of match-making.

› Don’t necessarily need a SchedD -- fetch jobs (work units) from other systems (DB of jobs, etc).

Page 42: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

42

StartD Enhancements: Dynamic slots

› Currently, resource slots are static -- some changes require restarting the StartD.

› Would like to add dynamic computing slots: Dual-core machines are ubiquitous. 1 or 2 gigs of RAM is “commodity”. Instead of statically partitioning the RAM (1 gig

for each slot), it’d be nice to advertise “2 CPUs, 2 gigs of RAM”, and once 1 CPU is claimed for 1/2 gig, to advertise “1 CPU, 1.5 gigs of RAM” for the other slot.

Page 43: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

43

StartD Enhancements: Dynamic slots (cont’d)

› Can also be used to simplify complex policies: Currently “checkpoint to swap” implemented with

static slots and pre-configured policy. Would like to just dynamically allocate new slots, and

make it easier to have global, slot-wide policy expressions, not just per-slot policies.

› Could have implications for COD, GlideIn and other uses of the StartD… GlideIn under an existing Condor pool might just

allocate a new slot on the “parent” StartD, instead of spawning a whole new StartD under the parent StartD

COD claims could allocate new dynamic slots, too…

Page 44: Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor RoadMap Condor

44

Thank you!