48
Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor What’s New in Condor-G

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G

Embed Size (px)

DESCRIPTION

What Is Condor-G › Use Condor to run jobs on the Grid › Uses Globus Toolkit  GRAM (submit a remote job)  GASS (transfer job’s files) › Two components  Globus Universe  GlideIn

Citation preview

Page 1: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

What’s New in Condor-G

Page 2: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Outline› What is Condor-G› Released New Features› In Development

Page 3: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

What Is Condor-G› Use Condor to run jobs on the Grid› Uses Globus Toolkit

GRAM (submit a remote job) GASS (transfer job’s files)

› Two components Globus Universe GlideIn

Page 4: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Globus Universe› Run a job on a Grid resource› Features

Job management Fault tolerance Credential management

› Roughly equivalent to the vanilla universe

Page 5: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

How It Works

Schedd

LSF

Condor-G Grid Resource

Page 6: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

How It Works

Schedd

LSF

Condor-G Grid Resource

600 Globusjobs

Page 7: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

How It Works

Schedd

LSF

Condor-G Grid Resource

GridManager

600 Globusjobs

Page 8: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

How It Works

Schedd JobManager

LSF

Condor-G Grid Resource

GridManager

600 Globusjobs

Page 9: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

How It Works

Schedd JobManager

LSF

User Job

Condor-G Grid Resource

GridManager

600 Globusjobs

Page 10: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

GlideIn› Run the Condor daemons on Grid

resources as user jobs› Create your own personal Condor pool

from temporarily-acquired Grid resources

› Brings the full power of Condor to the Grid

Page 11: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Globus Grid

PBS LSF

Condor

Condor-G

Page 12: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Globus Grid

PBS LSF

Condor

600 Condorjobs

Condor-G

Page 13: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor

600 Condorjobs

Page 14: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor glide-ins

600 Condorjobs

Page 15: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor glide-ins

600 Condorjobs

Page 16: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor glide-ins

600 Condorjobs

Page 17: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Condor-G

Globus Grid

PBS LSF

Condor glide-ins

600 Condorjobs

Page 18: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Released New Features› Stuff we’ve added in the past year› Released and ready for use in

Condor 6.6

Page 19: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Globus ASCII Helper Protocol (GAHP)

› Encapsulates Globus libraries in separate process

› Simple ASCII protocol› Easy for legacy applications to use

Globus when they can’t link directly with the libraries

Page 20: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

How It Works - GAHP

Schedd JobManager

Condor-G Grid Resources

GridManager

JobManager

JobManagerGAHP Client

GAHP Server

Page 21: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

File Staging› Arbitrary input and output files can

be staged to and from execution site

› Same syntax as other universes› Limitation

Output files must be explicitly named

Page 22: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

File Staging (cont)› Input, Output, and Error can be

URLs Files will be transferred directly to

and from execution site› Output and Error can be staged or

streamed

Page 23: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Credential Refresh› Renewed credentials are used by

Condor-G and forwarded to the execution site automatically

› No processes need to be restarted

Page 24: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Better Credential Management

› One GridManager process can handle multiple credential files with same subject

› More efficient when you want to have different credential lifetimes for different jobs

Page 25: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Grid Match-Making› Globus jobs matched with Globus

resources by the Condor match-maker using ClassAds

› Current limitation User/admin must create resources

ads

Page 26: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Fault Tolerance› Condor-G does its best to automatically

recover from failures› User can guide decisions with job policy

expressions Periodic Release GlobusResubmit Rematch

Page 27: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

PeriodicRelease Expression

› Condor-G puts problematic jobs on hold

› This expression tells Condor-G when to release and retry such jobs

Page 28: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

GlobusResubmit Expression

› Tells Condor-G when a problematic job submission should be abandoned

› When this expression becomes true Best effort is made to clean up current

job submission New job submission is attempted

Page 29: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Rematch Expression› Tells Condor-G when a problematic

resource should be abandoned› Evaluated when GlobusResubmit

evaluates to true› When this expression becomes true

Best effort is made to clean up current job submission

Job is rematched

Page 30: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Job Ad ExampleGlobusContactString = TARGET.gatekeeper_urlRequirements = TARGET.Arch == “LINUX” &&

TARGET.OpSys == “LINUX”Rank = TARGET.MflopsPeriodicRelease = ((NumMatches < 10) &&

((CurrentTime-EnteredCurrentStatus) > 600))GlobusResubmit = NumSystemHolds >= NumMatchesRematch = True

Page 31: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Hardening› Regular testing on the CMS testbed

with real applications› Many bugs and integration issues

found and fixed Hostile Environment

Page 32: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Hostile Environment› Full disks› Machine crashes› File server lock-ups› Network outages› Power outages

Page 33: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

One CMS Dataset Run› 300 jobs› Last fall

~50 (16%) of the jobs stalled and required human recovery

Multiple service restarts (20 daemon crashes over 6 hours)

› Now 0 jobs stalled 0 service restarts

Page 34: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Integration Work› Dozens of Condor-G improvements

and bug fixes› Over 40 Globus “bugzilla”

incidents, many with patches Globus 2.2.4 has 21 “Advisories” as of

4/11/04› Use latest version of both

Page 35: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Scalability› Submitting several hundred jobs

produced high load on server Machine became unresponsive We saw a load average of 1000 at

one point› Caused Globus JobManager

processes

Page 36: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Grid Manager Monitor Agent

› New tool Condor-G can use to reduce this load

› Efficient job status polling program› Allows Condor-G to shut down

JobManager processes when they’re not needed

Page 37: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Load Reduced› 400 jobs (/bin/sleep 900)› Without Grid Monitor

42 hours to complete Peak load average of 610

› With Grid Monitor 40 minutes Peak load average of 104

Page 38: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Miscellaneous Stuff› Email notification on job

completion› Port range restrictions› Problem jobs put on hold

Page 39: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

In Development› Stuff we’re currently working on› Will be released sometime in the

next year

Page 40: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Job Policy Expressions› PeriodicHold› PeriodicRemove› OnExitHold› OnExitRemove

Page 41: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Improved GlideIn› MDS use optional

User specifies necessary information› Automatic setup

GlideIn job transfers and installs binaries if needed

Binaries can come from submit machine

Page 42: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

New Job Types› Submit jobs directly to other

schedulers (not through Globus)› Why?

Richer interface semantics Not supported by Globus

Page 43: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

NorduGrid› Grid batch system designed by

Nordic countries› Globus GRAM didn’t offer

necessary semantics Client control of file staging Automatic cleanup of abandoned jobs

Page 44: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Oracle› Oracle DBMS supports a job queue

Run this query in 5 hours Run this query every Monday

› Condor can add more management features

Page 45: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Generic Job Interface› Re-arrange GridManager to allow

easy addition of new job types› Define appropriate interface› Plug-ins for new job types?

Page 46: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Globus Toolkit 3.0› OGSA (Open Grid Services

Architecture)› Submit jobs to GT3 sites› Grid Service client interface to

Condor-G

Page 47: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Miscellaneous› Condor-G for Windows› MyProxy credential management› URLs for executable, staged files

Page 48: Jaime Frey Computer Sciences Department University of Wisconsin-Madison  What’s New in Condor-G

www.cs.wisc.edu/condor

Thank You!› Questions?› Also…

Condor-G & Globus Q/A session• Wednesday, 9am-12pm, room TBA

E-mail [email protected]