Condor-G - Your Window to the Grid. The Condor Project (Established ‘85). Distributed systems CS research performed by a team that faces software engineering challenges in a UNIX/Linux/NT environment, active interaction with users and collaborators, - PowerPoint PPT Presentation
Condor - A Project and a SystemCondor-G -
Distributed systems CS research performed by a team that
faces
software engineering challenges in a UNIX/Linux/NT
environment,
active interaction with users and collaborators,
daily maintenance and support challenges of a distributed
production environment,
and educating and training students.
.
Information Power Grid (NASA)
Grid Physics Network (NSF-ITR)
www.cs.wisc.edu/condor
Driving
Concepts
www.cs.wisc.edu/condor
“ … Since the early days of mankind the primary motivation for the
establishment of communities has been the idea that by being part
of an organized group the capabilities of an individual are
improved. The great progress in the area of inter-computer
communication led to the development of means by which stand-alone
processing sub-systems can be integrated into multi-computer
‘communities’. … “
Miron Livny, “ Study of Load Balancing Algorithms for Decentralized
Distributed Processing Systems.”,
Ph.D thesis, July 1983.
Why? Because ...
.. someone has to bring together members of the community who have
requests for goods and services with members who offer them.
Both sides are looking for each other
Both sides have constraints
Both sides have preferences
High Throughput Computing
For many experimental scientists, scientific progress and quality
of research are strongly linked to computing throughput. In other
words, they are less concerned about instantaneous computing power.
Instead, what matters to them is the amount of computing they can
harness over a month or a year --- they measure computing power in
units of scenarios per day, wind patterns per week, instructions
sets per month, or crystal configurations per year.
www.cs.wisc.edu/condor
HW is a Commodity
Raw computing power is everywhere - on desk-tops, shelves, and
racks. It is
cheap
dynamic,
Naturally Parallel.
Doing it right is by no means trivial.
www.cs.wisc.edu/condor
The Condor System
A High Throughput Computing system that supports large dynamic MW
applications on large collections of distributively owned resources
developed, maintained and supported by the Condor Team at the
University of Wisconsin - Madison since ‘86.
Originally developed for UNIX workstations
Based on matchmaking technology.
More than 1300 CPUs at the U of Wisconsin.
Available at www.cs.wisc.edu/condor.
www.cs.wisc.edu/condor
“Real” Users 1,700,000 hours ~260 years
CS-Optimization 610,000 hours
CS-Architecture 350,000 hours
Physics 245,000 hours
Statistics 80,000 hours
Math 90,000 hours
MIT 76,000 hours
Cornell 38,000 hours
UCSD 38,000 hours
CalTech 18,000 hours
600 workers.
www.cs.wisc.edu/condor
The Application …
Study the behavior of F(x,y,z) for 20 values of x, 10 values of y
and 3 values of z (20*10*3 = 600)
F takes on the average 3 hours to compute on a “typical”
workstation (total = 1800 hours)
F requires a “moderate” (128MB) amount of memory
F performs “little” I/O - (x,y,z) is 15 MB and F(x,y,z) is 40
MB
www.cs.wisc.edu/condor
Turn your workstation into a “Personal Condor”
Write a script that creates 600 input files for each of the (x,y,z)
combinations
Submit a cluster of 600 jobs to your personal Condor
Write a script that collects the data from the 600 output
files
Go on a long vacation … (2.5 months)
www.cs.wisc.edu/condor
rank = KFlops
initialdir = worker_dir.$(process)
input = in
output = out
error = err
log = log
queue 600
Your Personal Condor will ...
... keep an eye on your jobs and will keep you posted on their
progress
... implement your policy on when the jobs can run on your
workstation
... implement your policy on the execution order of the jobs
.. add fault tolerance to your jobs
… keep a log of your job activities
www.cs.wisc.edu/condor
your
workstation
personal
Condor
Install Condor on the desk-top machine next door.
Install Condor on the machines in the class room.
Install Condor on the O2K in the basement.
Configure these machines to be part of your Condor pool.
Go on a shorter vacation ...
www.cs.wisc.edu/condor
your
workstation
personal
Condor
Get permission from “friendly” Condor pools to access their
resources
Configure your personal Condor to “flock” to these pools
reconsider your vacation plans ...
www.cs.wisc.edu/condor
Condor-G
A Grid enabled version of Condor that uses the inter-domain
services of Globus to bring Grid resources into the domain of your
Personal-Condor
Supports Grid Universe jobs
Uses MDS for submit information
www.cs.wisc.edu/condor
Condor-glide-in
Enable an application to dynamically turn allocated grid resources
into members of a Condor pool for the duration of the
allocation.
Easy to use on different platforms
Robust
X509 Certificates
We are in the process of adding X509 based authentication
capabilities to Condor services.
Job submission
Move glide-in tar files
Access disk caches
Grid Universe
Grid Universe jobs submitted to Condor are transformed in the
Globus jobs and submitted (via GlobusRun) to a grid resource.
Use MDS to locate resource
Monitor status of job on remote resource
Report status via Condor services
Rewrite in progress with new Globus library.
www.cs.wisc.edu/condor
Get access (account(s) + certificate(s)) to a “Computational”
Grid
Submit 599 “Grid Universe” Condor- glide-in jobs to your personal
Condor
Take the rest of the afternoon off ...
www.cs.wisc.edu/condor
your
workstation
An example - NUG28
We are pleased to announce the exact solution of the nug28
quadratic assignment problem (QAP). This problem was derived from
the well known nug30 problem using the distance matrix from a 4 by
7 grid, and the flow matrix from nug30 with the last 2 facilities
deleted. This is to our knowledge the largest instance from the
nugxx series ever provably solved to optimality.
The problem was solved using the branch-and-bound algorithm
described in the paper "Solving quadratic assignment problems using
convex quadratic programming relaxations," N.W. Brixius and K.M.
Anstreicher. The computation was performed on a pool of
workstations using the Condor high-throughput computing system in a
total wall time of approximately 4 days, 8 hours. During this time
the number of active worker machines averaged approximately 200.
Machines from UW, UNM and (INFN) all participated in the
computation.
www.cs.wisc.edu/condor
-- the Condor pool at Georgia Tech (190 Linux boxes)
-- the Condor pool at UNM (40 processors)
-- the Condor pool at Columbia (16 processors)
-- the Condor pool at Northwestern (12 processors)
-- the Condor pool at NCSA (65 processors)
-- the Condor pool at INFN (200 processors)
We will be using glide_in to access the Origin 2000 (through LSF )
at NCSA.
We will use "hobble_in" to access the Chiba City Linux cluster and
Origin
2000 here at Argonne.
From: Jeff Linderoth <
[email protected]>
To: Miron Livny <
[email protected]>
Subject: Re: Priority
This has been a great day for metacomputing! Everything is going
wonderfully. We've had over 900 machines (currently around 890),
and all the pieces are working great…
Date: Fri, 9 Jun 2000 11:41:11 -0500 (CDT)
From: Jeff Linderoth <
[email protected]>
Still rolling along. Over three billion nodes in about 1 day!
www.cs.wisc.edu/condor
From: Jeff Linderoth <
[email protected]>
Hi Gang,
The glory days of metacomputing are over. Our job just crashed. I
watched it happen right before my very eyes. It was what I was
afraid of -- they just shut down denali, and losing all of those
machines at once caused other connections to time out -- and the
snowball effect had bad repercussions for the Schedd.
www.cs.wisc.edu/condor
From: Jeff Linderoth <
[email protected]>
Hi Gang,
We are back up and running. And, yes, it took me all afternoon to
get it going again. There was a (brand new) bug in the QAP "read
checkpoint" information that was making the master coredump. (Only
with optimization level -O4). I was nearly reduced to tears, but
with some supportive words from Jean-Pierre, I made it
through.
www.cs.wisc.edu/condor
Hi dear Condor Team,
you all have been amazing. NUG30 required 10.9 years of Condor
Time. In just seven days !
More stats tomorrow !!! We are off celebrating !
condor rules !