Miron LivnyComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/~miron
Condor-G -Your Window
to the Grid
www.cs.wisc.edu/condor
The Condor Project (Established ‘85)
Distributed systems CS research performed by a team that faces
software engineering challenges in a UNIX/Linux/NT environment,
active interaction with users and collaborators, daily maintenance and support challenges of a
distributed production environment, and educating and training students.
Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School
.
www.cs.wisc.edu/condor
National Grid Efforts
› National Technology Grid - NCSA Alliance (NSF-PACI)
› Information Power Grid (NASA)
› Particle Physics Data Grid (DoE)› Grid Physics Network (NSF-ITR)
www.cs.wisc.edu/condor
Driving Concepts
www.cs.wisc.edu/condor
“ … Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub-systems can be integrated into multi-computer ‘communities’. … “
Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983.
www.cs.wisc.edu/condor
Every Communityneeds a
Matchmaker!
www.cs.wisc.edu/condor
Why? Because ...
.. someone has to bring together members of the community who have requests for goods and services with members who offer them. Both sides are looking for each other Both sides have constraints Both sides have preferences
www.cs.wisc.edu/condor
High Throughput Computing
For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, they are less concerned about instantaneous computing power. Instead, what matters to them is the amount of computing they can harness over a month or a year --- they measure computing power in units of scenarios per day, wind patterns per week, instructions sets per month, or crystal configurations per year.
www.cs.wisc.edu/condor
HW is a Commodity
Raw computing power is everywhere - on desk-tops, shelves, and racks. It is cheap dynamic, distributively owned, heterogeneous and evolving.
www.cs.wisc.edu/condor
Master-Worker (MW) computing is common and
Naturally Parallel.It is by no means
Embarrassingly Parallel.
Doing it right is by no means trivial.
www.cs.wisc.edu/condor
The Tool
www.cs.wisc.edu/condor
Our Answer to High Throughput MW Computingon commodity resources
www.cs.wisc.edu/condor
The Condor System A High Throughput Computing system that supports
large dynamic MW applications on large collections of distributively owned resources developed, maintained and supported by the Condor Team at the University of Wisconsin - Madison since ‘86. Originally developed for UNIX workstations Based on matchmaking technology. Fully integrated NT version is available. Deployed world-wide by academia and industry. More than 1300 CPUs at the U of Wisconsin. Available at www.cs.wisc.edu/condor.
www.cs.wisc.edu/condor
0
200
400
600
800
1000
1200
1400
'88 '94 '99 '00
Other
CS
Condor CPUs on the UW Campus
www.cs.wisc.edu/condor
Some Numbers:UW-CS Pool
Total since 6/98 4,000,000 hours ~450 years“Real” Users 1,700,000 hours ~260 years
CS-Optimization 610,000 hoursCS-Architecture 350,000 hoursPhysics 245,000 hoursStatistics 80,000 hoursEngine Research Center 38,000 hoursMath 90,000 hoursCivil Engineering 27,000 hoursBusiness 970 hours
“External” Users 165,000 hours ~19 yearsMIT76,000 hoursCornell 38,000 hoursUCSD 38,000 hoursCalTech 18,000 hours
www.cs.wisc.edu/condor
I have a job parallel MW application with
600 workers. How can I benefit from
Condor?
www.cs.wisc.edu/condor
The Application …Study the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600) F takes on the average 3 hours to compute
on a “typical” workstation (total = 1800 hours) F requires a “moderate” (128MB) amount of
memory F performs “little” I/O - (x,y,z) is 15 MB and
F(x,y,z) is 40 MB
www.cs.wisc.edu/condor
Step I - get organized!› Turn your workstation into a “Personal
Condor”
› Write a script that creates 600 input files for each of the (x,y,z) combinations
› Submit a cluster of 600 jobs to your personal Condor
› Write a script that collects the data from the 600 output files
› Go on a long vacation … (2.5 months)
www.cs.wisc.edu/condor
executable = workerrequirement = ((OS == “Linux2.2”) && Memory >=
128))rank = KFlopsinitialdir = worker_dir.$(process)
input = inoutput = outerror = errlog = log
queue 600
executable = workerrequirement = ((OS == “Linux2.2”) && Memory >=
128))rank = KFlopsinitialdir = worker_dir.$(process)
input = inoutput = outerror = errlog = log
queue 600
A Condor Job-Parallel Submit File
www.cs.wisc.edu/condor
Your Personal Condor will ...
› ... keep an eye on your jobs and will keep you posted on their progress
› ... implement your policy on when the jobs can run on your workstation
› ... implement your policy on the execution order of the jobs
› .. add fault tolerance to your jobs
› … keep a log of your job activities
www.cs.wisc.edu/condor
yourworkstation
personalCondor
600 Condorjobs
www.cs.wisc.edu/condor
Resource
Local Resource Management
Owner Agent
Environment Agent
Customer Agent
Application Agent
Application
Condor Layers
Tasks
Jobs
www.cs.wisc.edu/condor
Step II - build your personal Grid
› Install Condor on the desk-top machine next door.
› Install Condor on the machines in the class room.
› Install Condor on the O2K in the basement.
› Configure these machines to be part of your Condor pool.
› Go on a shorter vacation ...
www.cs.wisc.edu/condor
yourworkstation
personalCondor
600 Condorjobs
GroupCondor
www.cs.wisc.edu/condor
Step III - Take advantage of your
friends› Get permission from “friendly”
Condor pools to access their resources
› Configure your personal Condor to “flock” to these pools
› reconsider your vacation plans ...
www.cs.wisc.edu/condor
yourworkstation
friendly Condor
personalCondor
600 Condorjobs
GroupCondor
www.cs.wisc.edu/condor
Think big.
Go to the Grid
www.cs.wisc.edu/condor
Condor-G
A Grid enabled version of Condor that uses the inter-domain services of Globus to bring Grid resources into the domain of your Personal-Condor
Supports Grid Universe jobs Uses GSIFTP to move glide-in software Uses MDS for submit information
www.cs.wisc.edu/condor
Condor-glide-in
Enable an application to dynamically turn allocated grid resources into members of a Condor pool for the duration of the allocation.
Easy to use on different platforms Robust Supports SMPs
www.cs.wisc.edu/condor
X509 Certificates
We are in the process of adding X509 based authentication capabilities to Condor services.
Job submission Local file access Access to Condor-glide-in software Resource authentication
www.cs.wisc.edu/condor
GSIFTP
Enable Condor I/O services to use remote GSIFTP servers.
Move glide-in tar files Read executables Move Data from/to data repositories Access disk caches
www.cs.wisc.edu/condor
Grid Universe
Grid Universe jobs submitted to Condor are transformed in the Globus jobs and submitted (via GlobusRun) to a grid resource.
Use MDS to locate resource Monitor status of job on remote resource Report status via Condor services Rewrite in progress with new Globus library.
www.cs.wisc.edu/condor
Step IV - Think big (Grid)!
› Get access (account(s) + certificate(s)) to a “Computational” Grid
› Submit 599 “Grid Universe” Condor- glide-in jobs to your personal Condor
› Take the rest of the afternoon off ...
www.cs.wisc.edu/condor
yourworkstation
friendly Condor
personalCondor
600 Condorjobs
Globus Grid
PBS LSF
Condor
GroupCondor
599 glide-ins
www.cs.wisc.edu/condor
Does it work?
www.cs.wisc.edu/condor
An example - NUG28 We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality.
The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing
system in a total wall time of approximately 4 days, 8 hours. During this time the number of active worker machines averaged
approximately 200. Machines from UW, UNM and (INFN) all participated in the computation.
www.cs.wisc.edu/condor
NUG30 Personal Condor …
For the run we will be flocking to
-- the main Condor pool at Wisconsin (600 processors)
-- the Condor pool at Georgia Tech (190 Linux boxes)
-- the Condor pool at UNM (40 processors)
-- the Condor pool at Columbia (16 processors)
-- the Condor pool at Northwestern (12 processors)
-- the Condor pool at NCSA (65 processors)
-- the Condor pool at INFN (200 processors)
We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA.
We will use "hobble_in" to access the Chiba City Linux cluster and Origin
2000 here at Argonne.
www.cs.wisc.edu/condor
It works!!!Date: Thu, 8 Jun 2000 22:41:00 -0500 (CDT) From: Jeff Linderoth <[email protected]> To: Miron Livny <[email protected]> Subject: Re: Priority
This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great…
Date: Fri, 9 Jun 2000 11:41:11 -0500 (CDT) From: Jeff Linderoth <[email protected]>
Still rolling along. Over three billion nodes in about 1 day!
www.cs.wisc.edu/condor
Up to a Point …
Date: Fri, 9 Jun 2000 14:35:11 -0500 (CDT) From: Jeff Linderoth <[email protected]> Hi Gang,
The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.
www.cs.wisc.edu/condor
Back in Business
Date: Fri, 9 Jun 2000 18:55:59 -0500 (CDT) From: Jeff Linderoth <[email protected]>
Hi Gang,
We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.
www.cs.wisc.edu/condor
The First 600K seconds …
www.cs.wisc.edu/condor
The First 35K seconds …
www.cs.wisc.edu/condor
We made it!!!
Sender: [email protected] Subject: Re: Let the festivities begin.
Hi dear Condor Team,
you all have been amazing. NUG30 required 10.9 years of
Condor Time. In just seven days !
More stats tomorrow !!! We are off celebrating !
condor rules !
cheers,
JP.
www.cs.wisc.edu/condor
Do not bepicky, be
agile!!!