22
An Introduction to An Introduction to Grid Computing Research Grid Computing Research at Notre Dame at Notre Dame Prof. Douglas Thain Prof. Douglas Thain University of Notre Dame University of Notre Dame http://www.cse.nd.edu/~d http://www.cse.nd.edu/~d

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame dthain

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

An Introduction toAn Introduction toGrid Computing ResearchGrid Computing Research

at Notre Dameat Notre Dame

Prof. Douglas ThainProf. Douglas Thain

University of Notre DameUniversity of Notre Dame

http://www.cse.nd.edu/~dthainhttp://www.cse.nd.edu/~dthain

What is Grid Computing?What is Grid Computing?

Grid computing is the idea that we can attack Grid computing is the idea that we can attack problems of enormous scale by harnessing lots problems of enormous scale by harnessing lots of machines to work on one problem.of machines to work on one problem.

When people refer to When people refer to The GridThe Grid, they are , they are imagining a future where computers all over the imagining a future where computers all over the globe are connected in one colossal system globe are connected in one colossal system open for use.open for use.

Today, we have a variety of large, useful grids, Today, we have a variety of large, useful grids, but we don’t yet have but we don’t yet have The GridThe Grid..

Campus Scale Grids at Notre DameCampus Scale Grids at Notre DameND BOB: Bunch of BoxesND BOB: Bunch of Boxes– A “closet grid” of conventional PCs.A “closet grid” of conventional PCs.– 212 CPUs in Stepan Hall212 CPUs in Stepan Hall– http://bob.nd.eduhttp://bob.nd.edu

ND Center for Research ComputingND Center for Research Computing– A “cluster grid” of dedicated rackmount A “cluster grid” of dedicated rackmount

computers downtown.computers downtown.– 900 CPUs in Union Station.900 CPUs in Union Station.– http://crc.nd.eduhttp://crc.nd.edu

ND Condor PoolND Condor Pool– A “workstation grid” of classroom and desktop A “workstation grid” of classroom and desktop

machines used when idle. machines used when idle. – 405 CPUs in Fitzpatrick/Nieuwland405 CPUs in Fitzpatrick/Nieuwland– http://www.nd.edu/~condorhttp://www.nd.edu/~condor

Volunteer GridsVolunteer GridsSimple Idea:Simple Idea:– Most computers are idle 90% of the day.Most computers are idle 90% of the day.– Can we harness their unused capacity for real work?Can we harness their unused capacity for real work?

Examples:Examples:– Pioneered by Condor in 1987 at the Univ Wisconsin.Pioneered by Condor in 1987 at the Univ Wisconsin.– Popularized by SETI@Home in 1999 at BerkeleyPopularized by SETI@Home in 1999 at Berkeley

Over 300,000 active participants today.Over 300,000 active participants today.Successor is the more general BOINC.Successor is the more general BOINC.

– Folding@HomeFolding@HomeAbout 200,000 CPUs today.About 200,000 CPUs today.Makes use of GPU cards: about 100x faster than CPU!Makes use of GPU cards: about 100x faster than CPU!

– Xgrid: deployed with every Macintosh today.Xgrid: deployed with every Macintosh today.

Challenge: The user must be Challenge: The user must be flexible!flexible!

NSF TeragridNSF Teragrid– Open to any NSF research.Open to any NSF research.– 21,972 CPUs / 220 TB / 6 sites21,972 CPUs / 220 TB / 6 sites

Open Science GridOpen Science Grid– Open to any university.Open to any university.– 21,156 CPUs / 83 TB / 61 sites21,156 CPUs / 83 TB / 61 sites

Condor Worldwide:Condor Worldwide:– Anyone can install a pool.Anyone can install a pool.– 96,352 CPUs / 1608 sites96,352 CPUs / 1608 sites

PlanetLabPlanetLab– Open to CS research sites.Open to CS research sites.– 753 CPUs / 363 sites753 CPUs / 363 sites

National Computing GridsNational Computing Grids

Who Needs Grid Computing?Who Needs Grid Computing?Anyone with Anyone with unlimitedunlimited computing needs! computing needs!

High Energy Physics:High Energy Physics:– Simulating the detector a particle accelerator before Simulating the detector a particle accelerator before

turning it on allows one to understand the output.turning it on allows one to understand the output.

Biochemistry:Biochemistry:– Simulate complex molecules under different forces to Simulate complex molecules under different forces to

understand how they fold/mate/react.understand how they fold/mate/react.

Biometrics:Biometrics:– Given a large database of human images, evaluate Given a large database of human images, evaluate

matching algorithms by comparing all to all.matching algorithms by comparing all to all.

Climatology:Climatology:– Given a starting global climate, simulate how climate Given a starting global climate, simulate how climate

develops under varying assumptions or events. develops under varying assumptions or events.

What are the Challenges?What are the Challenges?

Why don’t we have Why don’t we have The GridThe Grid yet? yet?

Technical Challenges:Technical Challenges:– Enforcing the wishes of all the owners.Enforcing the wishes of all the owners.– Automatically negotiating expectations.Automatically negotiating expectations.– Limiting what resources a user can consume.Limiting what resources a user can consume.– Performance and scalability.Performance and scalability.– Debugging and troubleshooting.Debugging and troubleshooting.– Managing access to data!Managing access to data!– Making it easy to use!Making it easy to use!

An Example ofAn Example ofa Workstation Grida Workstation Grid

at Notre Dameat Notre Dame

Computing EnvironmentComputing Environment

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU CPU CPU CPU

Disk Disk Disk Disk

Fitzpatrick Workstation Cluster

CCL Research ClusterCVRL Research Cluster

Miscellaneous CSE Workstations

CPU

CPU CPU

Disk

I will only run jobs when there is no-one working at

the keyboard

I will only run jobs between midnight and 8 AM

I prefer to run a job submitted by a CCL

student.

CondorMatchMakerJob

JobJob

Job

Job Job

Job

Job

CPU

Disk

JobJob

JobJob

Job Job Job Job

CPU HistoryCPU History

Storage HistoryStorage History

Flocking Between UniversitiesFlocking Between Universities

Notre Dame300 CPUs

Wisconsin1200 CPUs

Purdue A541 CPUs

Purdue B1016 CPUs

http://www.cse.nd.edu/~ccl/operations/condor/

http://www.cse.nd.edu/~ccl/viz

An Example ofAn Example ofGrid Computing ResearchGrid Computing Research

at Notre Dameat Notre Dame

Scalable I/O for BiometricsScalable I/O for Biometrics

Computer Vision Research Lab in CSEComputer Vision Research Lab in CSE– Goal: Develop robust algorithms for identifying Goal: Develop robust algorithms for identifying

humans from (non-ideal) images.humans from (non-ideal) images.– Technique: Collect lots of images. Think up Technique: Collect lots of images. Think up

clever new matching function. Compare them.clever new matching function. Compare them.

How do you test a matching function?How do you test a matching function?– For a set S of images,For a set S of images,– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.– Compare the result matrix to known functions.Compare the result matrix to known functions.

Credit: Patrick Flynn at Notre Dame CSE

Computing SimilaritiesComputing Similarities

11 00 .1.1 .8.8 00 .1.1

11 00 .1.1 .1.1 00

11 00 .1.1 .7.7

11 00 00

11 .1.1

11

F

A Big Data ProblemA Big Data Problem

Data Size: 10k images of 1MB = 10 GBData Size: 10k images of 1MB = 10 GB

Total I/O: 10k * 10k * 2 MB *1/2 = Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB100 TB

Would like to repeat many times!Would like to repeat many times!

In order to execute such a workload, we In order to execute such a workload, we must be careful to partition both the I/O must be careful to partition both the I/O and the CPU needs, taking advantage of and the CPU needs, taking advantage of distributed capacity. distributed capacity.

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

Conventional SolutionConventional Solution

DiskDisk

DiskDisk

Job JobJobJob Job JobJobJob

Move 200 TB at Runtime!

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

CPU

Disk

A More Scalable SolutionA More Scalable Solution

1. Break array into MB-size chunks.

3. Jobs find nearby data copy, and make full use before discarding.

Job Job Job Job

2. Replicate data to many disks.

Result: Biometric users can accomplish in three days what used to take one month!

The All-Pairs AbstractionThe All-Pairs Abstraction

All-Pairs:All-Pairs:– For a set S and a function F:For a set S and a function F:– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.

The end user provides:The end user provides:– Set S: A bunch of files.Set S: A bunch of files.– Function F: A self-contained program.Function F: A self-contained program.

Applies to lots of different problems:Applies to lots of different problems:– Comparing proteins for interactions.Comparing proteins for interactions.– Searching documents for similarities.Searching documents for similarities.– Any kind of optimization problems.Any kind of optimization problems.

An All-Pairs Facility at Notre DameAn All-Pairs Facility at Notre Dame

AllPairsWeb

Portal

CPU CPU CPU CPU

Disk Disk Disk Disk

100s-1000s of machines

2 – Backend decides where to run,how to partition, when to retry failures...

F F F F

F

1 – User uploads S and F into the system.

S

3 – Return result matrix to user.

Research OpportunitiesResearch Opportunities

Openings for undergraduate students.Openings for undergraduate students.– Research for class credit during the year.Research for class credit during the year.– Research for paycheck during the summer.Research for paycheck during the summer.– Must enjoy programming and making things work.Must enjoy programming and making things work.

Some Project Ideas:Some Project Ideas:– Build a easy-to-use web front-end for using a grid Build a easy-to-use web front-end for using a grid

computing system to process biometric data.computing system to process biometric data.– Find a way to get data from your workstation to 500 Find a way to get data from your workstation to 500

other machines as fast as possible.other machines as fast as possible.– Build and manage a filesystem that ties together 500 Build and manage a filesystem that ties together 500

disks at once to create one gigantic 20TB system.disks at once to create one gigantic 20TB system.

For more information...For more information...

To learn more about Condor@NDTo learn more about Condor@ND– http://www.nd.edu/~condorhttp://www.nd.edu/~condor

Prof. Douglas ThainProf. Douglas Thain– [email protected]@nd.edu– http://www.cse.nd.edu/~dthainhttp://www.cse.nd.edu/~dthain– 382 Fitzpatrick Hall382 Fitzpatrick Hall