58
Miron Livny Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/~miron Commodity Computing

Commodity Computing

  • Upload
    amiel

  • View
    36

  • Download
    2

Embed Size (px)

DESCRIPTION

Commodity Computing. Computing power is everywhere, how can we make it usable by anyone?. Here is what empowered scientists can do with commodity computing …. NUG28 - Solved!!!!. - PowerPoint PPT Presentation

Citation preview

Page 1: Commodity Computing

Miron LivnyComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/~miron

Commodity Computing

Page 2: Commodity Computing

www.cs.wisc.edu/condor

Computing power

is everywhere, how can we make it

usable by anyone?

Page 3: Commodity Computing

www.cs.wisc.edu/condor

Here is what

empowered scientists can do with commodity computing

Page 4: Commodity Computing

www.cs.wisc.edu/condor

NUG28 - Solved!!!! We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality.

The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing system in a total wall

time of approximately 4 days, 8 hours. During this time the

number of active worker machines averaged approximately 200.

Machines from UW, UNM and (INFN) all participated in

the computation.

Page 5: Commodity Computing

www.cs.wisc.edu/condor

NUG30 Personal Condor …

For the run we will be flocking to

-- the main Condor pool at Wisconsin (600 processors)

-- the Condor pool at Georgia Tech (190 Linux boxes)

-- the Condor pool at UNM (40 processors)

-- the Condor pool at Columbia (16 processors)

-- the Condor pool at Northwestern (12 processors)

-- the Condor pool at NCSA (65 processors)

-- the Condor pool at INFN (200 processors)

We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA.

We will use "hobble_in" to access the Chiba City Linux cluster and Origin

2000 here at Argonne.

Page 6: Commodity Computing

www.cs.wisc.edu/condor

It works!!!Date: Thu, 8 Jun 2000 22:41:00 -0500 (CDT) From: Jeff Linderoth <[email protected]> To: Miron Livny <[email protected]> Subject: Re: Priority

This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great…

Date: Fri, 9 Jun 2000 11:41:11 -0500 (CDT) From: Jeff Linderoth <[email protected]>

Still rolling along. Over three billion nodes in about 1 day!

Page 7: Commodity Computing

www.cs.wisc.edu/condor

Up to a Point …

Date: Fri, 9 Jun 2000 14:35:11 -0500 (CDT) From: Jeff Linderoth <[email protected]> Hi Gang,

The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.

Page 8: Commodity Computing

www.cs.wisc.edu/condor

Back in Business

Date: Fri, 9 Jun 2000 18:55:59 -0500 (CDT) From: Jeff Linderoth <[email protected]>

Hi Gang,

We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.

Page 9: Commodity Computing

www.cs.wisc.edu/condor

The First 600K seconds …

Page 10: Commodity Computing

www.cs.wisc.edu/condor

NUG30 - Solved!!!

Sender: [email protected] Subject: Re: Let the festivities begin.

Hi dear Condor Team,

you all have been amazing. NUG30 required 10.9 years of

Condor Time. In just seven days !

More stats tomorrow !!! We are off celebrating !

condor rules !

cheers,

JP.

Page 11: Commodity Computing

www.cs.wisc.edu/condor

What can

commodity computing do

for you?

Page 12: Commodity Computing

www.cs.wisc.edu/condor

I have a job parallel MW application with

600 workers. How can I benefit from Commodity Computing?

Page 13: Commodity Computing

www.cs.wisc.edu/condor

My Application …Study the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600) F takes on the average 3 hours to compute

on a “typical” workstation (total = 1800 hours) F requires a “moderate” (128MB) amount of

memory F performs “little” I/O - (x,y,z) is 15 MB and

F(x,y,z) is 40 MB

Page 14: Commodity Computing

www.cs.wisc.edu/condor

Step I - get organized!› Write a script that creates 600 input files for

each of the (x,y,z) combinations

› Write a script that collects the data from the 600 output files

› Turn your workstation into a “Personal Condor”

› Submit a cluster of 600 jobs to your personal Condor

› Go on a long vacation … (2.5 months)

Page 15: Commodity Computing

www.cs.wisc.edu/condor

Your Personal Condor will ...

› ... keep an eye on your jobs and will keep you posted on their progress

› ... implement your policy on when the jobs can run on your workstation

› ... implement your policy on the execution order of the jobs

› .. add fault tolerance to your jobs

› … keep a log of your job activities

Page 16: Commodity Computing

www.cs.wisc.edu/condor

yourworkstation

personalCondor

600 Condorjobs

Page 17: Commodity Computing

www.cs.wisc.edu/condor

Step II - build your personal Grid

› Install Condor on the desk-top machine next door.

› Install Condor on the machines in the class room.

› Install Condor on the O2K in the basement.

› Configure these machines to be part of your Condor pool.

› Go on a shorter vacation ...

Page 18: Commodity Computing

www.cs.wisc.edu/condor

yourworkstation

personalCondor

600 Condorjobs

GroupCondor

Page 19: Commodity Computing

www.cs.wisc.edu/condor

Resource

Local Resource Management

Owner Agent

Environment Agent

Customer Agent

Application Agent

Application

Condor Layers

Tasks

Jobs

Page 20: Commodity Computing

www.cs.wisc.edu/condor

Matchmaking in Condor

Submit Machine Execution Machine

Collector

CA[...A]

[...B]

[...C]

CN

RA

Negotiator

Customer Agent Resource Agent

Environment Agent

Page 21: Commodity Computing

www.cs.wisc.edu/condor

OwnerAgent

ExecutionAgent

ApplicationProcess

CustomerAgent

ApplicationProcess

ApplicationAgent

RequestQueue

Data &ObjectFiles

CkptFiles

ObjectFiles

RemoteI/O &Ckpt

ObjectFiles

Submission Execution

Page 22: Commodity Computing

www.cs.wisc.edu/condor

Step III - Take advantage of your

friends› Get permission from “friendly”

Condor pools to access their resources

› Configure your personal Condor to “flock” to these pools

› reconsider your vacation plans ...

Page 23: Commodity Computing

www.cs.wisc.edu/condor

yourworkstation

friendly Condor

personalCondor

600 Condorjobs

GroupCondor

Page 24: Commodity Computing

www.cs.wisc.edu/condor

Think big.

Go to the Grid

Page 25: Commodity Computing

www.cs.wisc.edu/condor

Upgrade to Condor-G

A Grid enabled version of Condor that uses the inter-domain services of Globus to bring Grid resources into the domain of your Personal-Condor

Supports Grid Universe jobs Uses GSIFTP to move glide-in software Uses MDS for submit information

Page 26: Commodity Computing

www.cs.wisc.edu/condor

Condor-glide-in

Enable an application to dynamically turn allocated grid resources into members of a Condor pool for the duration of the allocation.

Easy to use on different platforms Robust Supports SMPs

Page 27: Commodity Computing

www.cs.wisc.edu/condor

Step IV - Go for the Grid

› Get access (account(s) + certificate(s)) to a “Computational” Grid

› Submit 599 “Grid Universe” Condor- glide-in jobs to your personal Condor

› Take the rest of the afternoon off ...

Page 28: Commodity Computing

www.cs.wisc.edu/condor

yourworkstation

friendly Condor

personalCondor

600 Condorjobs

Globus Grid

PBS LSF

Condor

GroupCondor

599 glide-ins

Page 29: Commodity Computing

www.cs.wisc.edu/condor

Driving Concepts

Page 30: Commodity Computing

www.cs.wisc.edu/condor

HW is a Commodity

Raw computing power is everywhere - on desk-tops, shelves, and racks. It is cheap dynamic, distributively owned, heterogeneous and evolving.

Page 31: Commodity Computing

www.cs.wisc.edu/condor

“ … Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub-systems can be integrated into multi-computer ‘communities’. … “

Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983.

Page 32: Commodity Computing

www.cs.wisc.edu/condor

Every Communityneeds a

Matchmaker!

Page 33: Commodity Computing

www.cs.wisc.edu/condor

Why? Because ...

.. someone has to bring together community members who have requests for goods and services with members who offer them. Both sides are looking for each other Both sides have constraints Both sides have preferences

Page 34: Commodity Computing

www.cs.wisc.edu/condor

We use Matchmakersto build

Computing Communities

out of

Commodity Components

Page 35: Commodity Computing

www.cs.wisc.edu/condor

The Matchmaking ProcessAdvertising Protocol

Each party uses ClassAd to declare its type, target type, constraints on target, ranking of new match, ranking of current match ...

Matchmaking Algorithm Used by Matchmaker to create matches

Match Notification ProtocolUsed by Matchmaker to notify matched parties

Claiming ProtocolUsed by matched parties to claim each other

Page 36: Commodity Computing

www.cs.wisc.edu/condor

High Throughput Computing

For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, they are less concerned about instantaneous computing power. Instead, what matters to them is the amount of computing they can harness over a month or a year --- they measure computing power in units of scenarios per day, wind patterns per week, instructions sets per month, or crystal configurations per year.

Page 37: Commodity Computing

www.cs.wisc.edu/condor

High Throughput Computing

is a24-7-365activity

FLOPY (60*60*24*7*52)*FLOPS

Page 38: Commodity Computing

www.cs.wisc.edu/condor

Obstacles to HTC

› Ownership Distribution

› Customer Awareness

› Size and Uncertainties

› Technology Evolution

› Physical Distribution

(Sociology)

(Education)

(Robustness)

(Portability)

(Technology)

Page 39: Commodity Computing

www.cs.wisc.edu/condor

Basic HTC Mechanisms› Matchmaking - enables requests for services and

offers to provide services find each other (ClassAds).

› Checkpointing - enables preemptive resume scheduling (go ahead and use it as long as it is available!).

› Remote I/O - enables remote (from execution site) access to local (at submission site) data.

› Asynchronous API - enables management of dynamic (opportunistic) resources.

Page 40: Commodity Computing

www.cs.wisc.edu/condor

Master-Worker (MW) computing is

Naturally Parallel.It is by no means

Embarrassingly Parallel.

Doing it right is by no means trivial.

Page 41: Commodity Computing

www.cs.wisc.edu/condor

The Tool

Page 42: Commodity Computing

www.cs.wisc.edu/condor

Our Answer to High Throughput MW Computingon commodity resources

Page 43: Commodity Computing

www.cs.wisc.edu/condor

The Condor System A High Throughput Computing system that supports

large dynamic MW applications on large collections of distributively owned resources developed, maintained and supported by the Condor Team at the University of Wisconsin - Madison since ‘86. Originally developed for UNIX workstations Based on matchmaking technology. Fully integrated NT version is available. Deployed world-wide by academia and industry. More than 1300 CPUs at the U of Wisconsin. Available at www.cs.wisc.edu/condor.

Page 44: Commodity Computing

www.cs.wisc.edu/condor

0

200

400

600

800

1000

1200

1400

'88 '94 '99 '00

Other

CS

Condor CPUs on the UW Campus

Page 45: Commodity Computing

www.cs.wisc.edu/condor

Page 46: Commodity Computing

www.cs.wisc.edu/condor

Some Numbers:UW-CS Pool

6/98-6/00 4,000,000 hours ~450 years“Real” Users 1,700,000 hours ~260 years

CS-Optimization 610,000 hoursCS-Architecture 350,000 hoursPhysics 245,000 hoursStatistics 80,000 hoursEngine Research Center 38,000 hoursMath 90,000 hoursCivil Engineering 27,000 hoursBusiness 970 hours

“External” Users 165,000 hours ~19 yearsMIT76,000 hoursCornell 38,000 hoursUCSD 38,000 hoursCalTech 18,000 hours

Page 47: Commodity Computing

www.cs.wisc.edu/condor

Key Condor User Services› Local control - jobs are stored and managed

locally by a personal scheduler.

› Priority scheduling - execution order controlled by priority ranking assigned by user.

› Job preemption - re-linked jobs can be checkpointed, suspended, hold and resumed.

› Local executing environment preserved - re-linked jobs can have their I/O re-directed to submission site.

Page 48: Commodity Computing

www.cs.wisc.edu/condor

More Condor User Services

› Powerful and flexible means for selecting execution site (requirements and preferences)

› Logging of job activities.

› Management of large (10K) numbers of jobs per user.

› Support for jobs with dependencies - DAGMan (Directed Acyclic Graph Manager)

› Support for dynamic MW (PVM and File) applications

Page 49: Commodity Computing

www.cs.wisc.edu/condor

executable = workerrequirement = ((OS == “Linux2.2”) && (Memory

>= 64))rank = KFLOPinitialdir = worker_dir.$(process)

input = inoutput = outerror = errlog = Condor_log

queue 1000

executable = workerrequirement = ((OS == “Linux2.2”) && (Memory

>= 64))rank = KFLOPinitialdir = worker_dir.$(process)

input = inoutput = outerror = errlog = Condor_log

queue 1000

A Condor Job-Parallel Submit File

Page 50: Commodity Computing

www.cs.wisc.edu/condor

Task Parallel MW Application

Potential = startFOR cycle = 1 to 36

FOR location = 1 to 31totalEnergy =+ Energy(location,potential)

END

potential = F(totalEnergy)END

Potential = startFOR cycle = 1 to 36

FOR location = 1 to 31totalEnergy =+ Energy(location,potential)

END

potential = F(totalEnergy)END

Implemented as a PVM application with the Condor MW services. Two traces (execution and performance) visualized by DEVise.

WorkerTasks

MasterTasks

Page 51: Commodity Computing

www.cs.wisc.edu/condor

Logicalworker

ID

36*31Worker

Tasks

NodeUtilization

# ofWorkers

OneCycle(31

worker

tasks)

TaskDuration

vs.Location

Time(total 6 hours)

FirstAllocation

SecondAllocation

ThirdAllocation Preemption

Page 52: Commodity Computing

www.cs.wisc.edu/condor

We have customers who ...

› … have job parallel MW applications with more than 5000 jobs.

› … have task parallel MW applications with more than 1000 tasks.

› … run their job parallel MW applications for more than six months.

› … run their task parallel MW applications for more than four weeks.

Page 53: Commodity Computing

www.cs.wisc.edu/condor

Page 54: Commodity Computing

www.cs.wisc.edu/condor

Who are we?

Page 55: Commodity Computing

www.cs.wisc.edu/condor

The Condor Project (Established ‘85)

Distributed systems CS research performed by a team that faces

software engineering challenges in a UNIX/Linux/NT environment,

active interaction with users and collaborators, daily maintenance and support challenges of a

distributed production environment, and educating and training students.

Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School

.

Page 56: Commodity Computing

www.cs.wisc.edu/condor

Users and collaborators

› Scientists - Biochemistry, high energy physics, computer sciences, genetics, …

› Engineers - Hardware design, software building and testing, animation, ...

› Educators - Hardware design tools, distributed systems, networking, ...

Page 57: Commodity Computing

www.cs.wisc.edu/condor

National Grid Efforts

› National Technology Grid - NCSA Alliance (NSF-PACI)

› Information Power Grid - IPG (NASA)

› Particle Physics Data Grid - PPDG (DoE)

› Grid Physics Network GriPhyN (NSF-ITR)

Page 58: Commodity Computing

www.cs.wisc.edu/condor

Do not bepicky, be

agile!!!