Upload
anaya
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
An Introduction to Grid Computing Research at Notre Dame. Prof. Douglas Thain University of Notre Dame http://www.cse.nd.edu/~dthain. What is Grid Computing?. Grid computing is the idea that we can attack problems of enormous scale by harnessing lots of machines to work on one problem. - PowerPoint PPT Presentation
Citation preview
An Introduction toAn Introduction toGrid Computing ResearchGrid Computing Research
at Notre Dameat Notre Dame
Prof. Douglas ThainProf. Douglas Thain
University of Notre DameUniversity of Notre Dame
http://www.cse.nd.edu/~dthainhttp://www.cse.nd.edu/~dthain
What is Grid Computing?What is Grid Computing?
Grid computing is the idea that we can attack Grid computing is the idea that we can attack problems of enormous scale by harnessing lots problems of enormous scale by harnessing lots of machines to work on one problem.of machines to work on one problem.
When people refer to When people refer to The GridThe Grid, they are , they are imagining a future where computers all over the imagining a future where computers all over the globe are connected in one colossal system globe are connected in one colossal system open for use.open for use.
Today, we have a variety of large, useful grids, Today, we have a variety of large, useful grids, but we don’t yet have but we don’t yet have The GridThe Grid..
Campus Scale Grids at Notre DameCampus Scale Grids at Notre DameND BOB: Bunch of BoxesND BOB: Bunch of Boxes– A “closet grid” of conventional PCs.A “closet grid” of conventional PCs.– 212 CPUs in Stepan Hall212 CPUs in Stepan Hall– http://bob.nd.eduhttp://bob.nd.edu
ND Center for Research ComputingND Center for Research Computing– A “cluster grid” of dedicated rackmount A “cluster grid” of dedicated rackmount
computers downtown.computers downtown.– 900 CPUs in Union Station.900 CPUs in Union Station.– http://crc.nd.eduhttp://crc.nd.edu
ND Condor PoolND Condor Pool– A “workstation grid” of classroom and desktop A “workstation grid” of classroom and desktop
machines used when idle. machines used when idle. – 405 CPUs in Fitzpatrick/Nieuwland405 CPUs in Fitzpatrick/Nieuwland– http://www.nd.edu/~condorhttp://www.nd.edu/~condor
Volunteer GridsVolunteer GridsSimple Idea:Simple Idea:– Most computers are idle 90% of the day.Most computers are idle 90% of the day.– Can we harness their unused capacity for real work?Can we harness their unused capacity for real work?
Examples:Examples:– Pioneered by Condor in 1987 at the Univ Wisconsin.Pioneered by Condor in 1987 at the Univ Wisconsin.– Popularized by SETI@Home in 1999 at BerkeleyPopularized by SETI@Home in 1999 at Berkeley
Over 300,000 active participants today.Over 300,000 active participants today.Successor is the more general BOINC.Successor is the more general BOINC.
– Folding@HomeFolding@HomeAbout 200,000 CPUs today.About 200,000 CPUs today.Makes use of GPU cards: about 100x faster than CPU!Makes use of GPU cards: about 100x faster than CPU!
– Xgrid: deployed with every Macintosh today.Xgrid: deployed with every Macintosh today.
Challenge: The user must be Challenge: The user must be flexible!flexible!
NSF TeragridNSF Teragrid– Open to any NSF research.Open to any NSF research.– 21,972 CPUs / 220 TB / 6 sites21,972 CPUs / 220 TB / 6 sites
Open Science GridOpen Science Grid– Open to any university.Open to any university.– 21,156 CPUs / 83 TB / 61 sites21,156 CPUs / 83 TB / 61 sites
Condor Worldwide:Condor Worldwide:– Anyone can install a pool.Anyone can install a pool.– 96,352 CPUs / 1608 sites96,352 CPUs / 1608 sites
PlanetLabPlanetLab– Open to CS research sites.Open to CS research sites.– 753 CPUs / 363 sites753 CPUs / 363 sites
National Computing GridsNational Computing Grids
Who Needs Grid Computing?Who Needs Grid Computing?Anyone with Anyone with unlimitedunlimited computing needs! computing needs!
High Energy Physics:High Energy Physics:– Simulating the detector a particle accelerator before Simulating the detector a particle accelerator before
turning it on allows one to understand the output.turning it on allows one to understand the output.
Biochemistry:Biochemistry:– Simulate complex molecules under different forces to Simulate complex molecules under different forces to
understand how they fold/mate/react.understand how they fold/mate/react.
Biometrics:Biometrics:– Given a large database of human images, evaluate Given a large database of human images, evaluate
matching algorithms by comparing all to all.matching algorithms by comparing all to all.
Climatology:Climatology:– Given a starting global climate, simulate how climate Given a starting global climate, simulate how climate
develops under varying assumptions or events. develops under varying assumptions or events.
What are the Challenges?What are the Challenges?
Why don’t we have Why don’t we have The GridThe Grid yet? yet?
Technical Challenges:Technical Challenges:– Enforcing the wishes of all the owners.Enforcing the wishes of all the owners.– Automatically negotiating expectations.Automatically negotiating expectations.– Limiting what resources a user can consume.Limiting what resources a user can consume.– Performance and scalability.Performance and scalability.– Debugging and troubleshooting.Debugging and troubleshooting.– Managing access to data!Managing access to data!– Making it easy to use!Making it easy to use!
An Example ofAn Example ofa Workstation Grida Workstation Grid
at Notre Dameat Notre Dame
Computing EnvironmentComputing Environment
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU CPU CPU CPU
Disk Disk Disk Disk
Fitzpatrick Workstation Cluster
CCL Research ClusterCVRL Research Cluster
Miscellaneous CSE Workstations
CPU
CPU CPU
Disk
I will only run jobs when there is no-one working at
the keyboard
I will only run jobs between midnight and 8 AM
I prefer to run a job submitted by a CCL
student.
CondorMatchMakerJob
JobJob
Job
Job Job
Job
Job
CPU
Disk
JobJob
JobJob
Job Job Job Job
CPU HistoryCPU History
Storage HistoryStorage History
Flocking Between UniversitiesFlocking Between Universities
Notre Dame300 CPUs
Wisconsin1200 CPUs
Purdue A541 CPUs
Purdue B1016 CPUs
http://www.cse.nd.edu/~ccl/operations/condor/
http://www.cse.nd.edu/~ccl/viz
An Example ofAn Example ofGrid Computing ResearchGrid Computing Research
at Notre Dameat Notre Dame
Scalable I/O for BiometricsScalable I/O for Biometrics
Computer Vision Research Lab in CSEComputer Vision Research Lab in CSE– Goal: Develop robust algorithms for identifying Goal: Develop robust algorithms for identifying
humans from (non-ideal) images.humans from (non-ideal) images.– Technique: Collect lots of images. Think up Technique: Collect lots of images. Think up
clever new matching function. Compare them.clever new matching function. Compare them.
How do you test a matching function?How do you test a matching function?– For a set S of images,For a set S of images,– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.– Compare the result matrix to known functions.Compare the result matrix to known functions.
Credit: Patrick Flynn at Notre Dame CSE
Computing SimilaritiesComputing Similarities
11 00 .1.1 .8.8 00 .1.1
11 00 .1.1 .1.1 00
11 00 .1.1 .7.7
11 00 00
11 .1.1
11
F
A Big Data ProblemA Big Data Problem
Data Size: 10k images of 1MB = 10 GBData Size: 10k images of 1MB = 10 GB
Total I/O: 10k * 10k * 2 MB *1/2 = Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB100 TB
Would like to repeat many times!Would like to repeat many times!
In order to execute such a workload, we In order to execute such a workload, we must be careful to partition both the I/O must be careful to partition both the I/O and the CPU needs, taking advantage of and the CPU needs, taking advantage of distributed capacity. distributed capacity.
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
Conventional SolutionConventional Solution
DiskDisk
DiskDisk
Job JobJobJob Job JobJobJob
Move 200 TB at Runtime!
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
A More Scalable SolutionA More Scalable Solution
1. Break array into MB-size chunks.
3. Jobs find nearby data copy, and make full use before discarding.
Job Job Job Job
2. Replicate data to many disks.
Result: Biometric users can accomplish in three days what used to take one month!
The All-Pairs AbstractionThe All-Pairs Abstraction
All-Pairs:All-Pairs:– For a set S and a function F:For a set S and a function F:– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.
The end user provides:The end user provides:– Set S: A bunch of files.Set S: A bunch of files.– Function F: A self-contained program.Function F: A self-contained program.
Applies to lots of different problems:Applies to lots of different problems:– Comparing proteins for interactions.Comparing proteins for interactions.– Searching documents for similarities.Searching documents for similarities.– Any kind of optimization problems.Any kind of optimization problems.
An All-Pairs Facility at Notre DameAn All-Pairs Facility at Notre Dame
AllPairsWeb
Portal
CPU CPU CPU CPU
Disk Disk Disk Disk
100s-1000s of machines
2 – Backend decides where to run,how to partition, when to retry failures...
F F F F
F
1 – User uploads S and F into the system.
S
3 – Return result matrix to user.
Research OpportunitiesResearch Opportunities
Openings for undergraduate students.Openings for undergraduate students.– Research for class credit during the year.Research for class credit during the year.– Research for paycheck during the summer.Research for paycheck during the summer.– Must enjoy programming and making things work.Must enjoy programming and making things work.
Some Project Ideas:Some Project Ideas:– Build a easy-to-use web front-end for using a grid Build a easy-to-use web front-end for using a grid
computing system to process biometric data.computing system to process biometric data.– Find a way to get data from your workstation to 500 Find a way to get data from your workstation to 500
other machines as fast as possible.other machines as fast as possible.– Build and manage a filesystem that ties together 500 Build and manage a filesystem that ties together 500
disks at once to create one gigantic 20TB system.disks at once to create one gigantic 20TB system.
For more information...For more information...
To learn more about Condor@NDTo learn more about Condor@ND– http://www.nd.edu/~condorhttp://www.nd.edu/~condor
Prof. Douglas ThainProf. Douglas Thain– [email protected]@nd.edu– http://www.cse.nd.edu/~dthainhttp://www.cse.nd.edu/~dthain– 382 Fitzpatrick Hall382 Fitzpatrick Hall