56
1 Scaling Up Data Scaling Up Data Intensive Scientific Intensive Scientific Applications to Campus Applications to Campus Grids Grids Douglas Thain Douglas Thain University of Notre Dame University of Notre Dame LSAP Workshop LSAP Workshop Munich, June 2009 Munich, June 2009

1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

Embed Size (px)

Citation preview

Page 1: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

11

Scaling Up Data Intensive Scaling Up Data Intensive Scientific Applications to Scientific Applications to

Campus GridsCampus Grids

Douglas ThainDouglas ThainUniversity of Notre DameUniversity of Notre Dame

LSAP WorkshopLSAP WorkshopMunich, June 2009Munich, June 2009

Page 2: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

22

OverviewOverview

Challenges in Using Campus GridsChallenges in Using Campus Grids

Solution: AbstractionsSolution: Abstractions

Examples and ApplicationsExamples and Applications– All-Pairs: Biometrics, Data MiningAll-Pairs: Biometrics, Data Mining– Wavefront: Genomics and EconomicsWavefront: Genomics and Economics– Some-Pairs: GenomicsSome-Pairs: Genomics

Abstractions, Workflows, and LanguagesAbstractions, Workflows, and Languages

Page 3: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

33

What is a Campus Grid?What is a Campus Grid?

A campus grid is an aggregation of all A campus grid is an aggregation of all available computing power found in an available computing power found in an institution:institution:– Idle cycles from desktop machines.Idle cycles from desktop machines.– Unused cycles from dedicated clusters.Unused cycles from dedicated clusters.

Examples of campus grids:Examples of campus grids:– 600 CPUs at the University of Notre Dame600 CPUs at the University of Notre Dame– 2000 CPUs at the University of Wisconsin2000 CPUs at the University of Wisconsin– 13,000 CPUs at Purdue University13,000 CPUs at Purdue University

Cluster, cloud grid are all similar concepts.Cluster, cloud grid are all similar concepts.

Page 4: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

44

Page 5: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

55

Page 6: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

66

Page 7: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

77

Campus grids can give us access Campus grids can give us access to more machines than we can to more machines than we can

possibly use.possibly use.

But are they easy to use?But are they easy to use?

Page 8: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

88

Example: Biometrics ResearchExample: Biometrics Research

Goal: Design robust face comparison function.Goal: Design robust face comparison function.

F

0.05

F

0.97

Page 9: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

99

Similarity Matrix ConstructionSimilarity Matrix Construction

1.0 0.8 0.1 0.0 0.0 0.1

1.0 0.0 0.1 0.1 0.0

1.0 0.0 0.1 0.3

1.0 0.0 0.0

1.0 0.1

1.0

Challenge Workload:

60,000 iris images1MB each.02s per F833 CPU-days600 TB of I/O

Page 10: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1010

I have 60,000 iris images acquired in my research lab. I want to reduce each one to a feature space, and then compare all of them to each other. I want to spend my time doing science, not struggling with computers.

I have a laptop.

I own a few machines. We have access to a campus grid.

What should I do?

Page 11: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1111

We said:We said:Try using our campus grid!Try using our campus grid!

(How hard could it be?)(How hard could it be?)

Page 12: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1212

Non-Expert User Using 500 CPUsNon-Expert User Using 500 CPUsTry 1: Each F is a batch job.Failure: Dispatch latency >> F runtime.

HN

CPU CPU CPU CPUF F F FCPUF

Try 2: Each row is a batch job.Failure: Too many small ops on FS.

HN

CPU CPU CPU CPUF F F FCPUFFFF FF

FFFF

FFF

FFF

Try 3: Bundle all files into one package.Failure: Everyone loads 1GB at once.

HN

CPU CPU CPU CPUF F F FCPUFFFF FF

FFFF

FFF

FFF

Try 4: User gives up and attemptsto solve an easier or smaller problem.

Page 13: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1313

Why are Grids Hard to Use?Why are Grids Hard to Use?System Properties:System Properties:– Wildly varying resource availability.Wildly varying resource availability.– Heterogeneous resources.Heterogeneous resources.– Unpredictable preemption.Unpredictable preemption.– Unexpected resource limits.Unexpected resource limits.

User Considerations:User Considerations:– Jobs can’t run for too long... but, they can’t run too Jobs can’t run for too long... but, they can’t run too

quickly, either!quickly, either!– I/O operations must be carefully matched to the I/O operations must be carefully matched to the

capacity of clients, servers, and networks.capacity of clients, servers, and networks.– Users often do not even have access to the necessary Users often do not even have access to the necessary

information to make good choices!information to make good choices!

Page 14: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1414

OverviewOverview

Challenges in Using Campus GridsChallenges in Using Campus Grids

Solution: AbstractionsSolution: Abstractions

Examples and ApplicationsExamples and Applications– All-Pairs: Biometrics, Data MiningAll-Pairs: Biometrics, Data Mining– Wavefront: Genomics and EconomicsWavefront: Genomics and Economics– Some-Pairs: GenomicsSome-Pairs: Genomics

Abstractions, Workflows, and LanguagesAbstractions, Workflows, and Languages

Page 15: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1515

ObservationObservation

In a given field of study, many people In a given field of study, many people repeat the same repeat the same pattern of work many of work many times, making slight changes to the data times, making slight changes to the data and algorithms.and algorithms.If the system knows the overall pattern in If the system knows the overall pattern in advance, then it can do a better job of advance, then it can do a better job of executing it reliably and efficiently.executing it reliably and efficiently.If the user knows in advance what patterns If the user knows in advance what patterns are allowed, then they have a better idea are allowed, then they have a better idea of how to construct their workloads.of how to construct their workloads.

Page 16: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1616

What’s the Most Successful What’s the Most Successful Parallel Programming Language?Parallel Programming Language?

OpenGL:OpenGL:

– A declarative specification of a workload.A declarative specification of a workload.– Ported to a wide variety of HW over 20 years.Ported to a wide variety of HW over 20 years.– The graphics pipeline is very specific:The graphics pipeline is very specific:

Transform points to coordinate space.Transform points to coordinate space.Connect polygons to transformed points.Connect polygons to transformed points.Stretch textures across polygons.Stretch textures across polygons.Sort everything by Z-depth.Sort everything by Z-depth.

– Can we apply the same idea to grids?Can we apply the same idea to grids?

Page 17: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1717

AbstractionsAbstractionsfor Distributed Computingfor Distributed Computing

Abstraction: a Abstraction: a declarative specificationdeclarative specification of the computation and data of a workload.of the computation and data of a workload.

A A restricted patternrestricted pattern, not meant to be a , not meant to be a general purpose programming language.general purpose programming language.

UsesUses data structures instead of files. instead of files.

Provide users with a Provide users with a bright path..

Regular structure makes it tractable to Regular structure makes it tractable to model and model and predict performance.predict performance.

Page 18: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1818

Abstractions asAbstractions asHigher-Order FunctionsHigher-Order Functions

AllPairs( set A, set B, function F )AllPairs( set A, set B, function F )– returns M[i,j] = F( A[i], B[j] )returns M[i,j] = F( A[i], B[j] )

SomePairs( set A, list(i,j) L, function F )SomePairs( set A, list(i,j) L, function F )– returns list of F( A[i], A[j] ) returns list of F( A[i], A[j] )

Wavefront( matrix R, function F )Wavefront( matrix R, function F )– returns R[i,j] = F( R[i-1,j], R[I,j-1] )returns R[i,j] = F( R[i-1,j], R[I,j-1] )

Page 19: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

1919

Working with AbstractionsWorking with AbstractionsF

A1A2

An

AllPairs( A, B, F )

Campus Grid

A1A2

Bn

CustomWorkflow

Engine

Compact Data Structure

Page 20: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2020

OverviewOverview

Challenges in Using Campus GridsChallenges in Using Campus Grids

Solution: AbstractionsSolution: Abstractions

Examples and ApplicationsExamples and Applications– All-Pairs: Biometrics, Data MiningAll-Pairs: Biometrics, Data Mining– Wavefront: Genomics and EconomicsWavefront: Genomics and Economics– Some-Pairs: GenomicsSome-Pairs: Genomics

Abstractions, Workflows, and LanguagesAbstractions, Workflows, and Languages

Page 21: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2121

All-Pairs AbstractionAll-Pairs Abstraction

AllPairs( set A, set B, function F )AllPairs( set A, set B, function F )

returns matrix M wherereturns matrix M where

M[i][j] = F( A[i], B[j] ) for all i,jM[i][j] = F( A[i], B[j] ) for all i,j

B1

B2

B3

A1 A2 A3

F F F

A1A1

An

B1B1

Bn

F

AllPairs(A,B,F)F

F F

F F

F

allpairs A B F.exe

Page 22: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2222

How Does the Abstraction Help?How Does the Abstraction Help?

The custom workflow engine:The custom workflow engine:– Chooses right data transfer strategy.Chooses right data transfer strategy.– Chooses the right number of resources.Chooses the right number of resources.– Chooses blocking of functions into jobs.Chooses blocking of functions into jobs.– Recovers from a larger number of failures.Recovers from a larger number of failures.– Predicts overall runtime accurately.Predicts overall runtime accurately.

All of these tasks are nearly impossible for All of these tasks are nearly impossible for arbitrary workloads, but are tractable (not arbitrary workloads, but are tractable (not trivial) to solve for a trivial) to solve for a specificspecific abstraction. abstraction.

Page 23: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2323

Page 24: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2424

Distribute Data Via Spanning TreeDistribute Data Via Spanning Tree

Page 25: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2525

Choose the Right # of CPUsChoose the Right # of CPUs

Page 26: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2626

Conventional vs AbstractionConventional vs Abstraction

Page 27: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2727

All-Pairs in ProductionAll-Pairs in ProductionOur All-Pairs implementation has provided over Our All-Pairs implementation has provided over 57 CPU-years57 CPU-years of computation to the ND of computation to the ND biometrics research group over the last year.biometrics research group over the last year.

Largest run so far: Largest run so far: 58,396 irises58,396 irises from the Face from the Face Recognition Grand Challenge. The largest Recognition Grand Challenge. The largest experiment ever run on publically available data.experiment ever run on publically available data.

Competing biometric research relies on samples Competing biometric research relies on samples of 100-1000 images, which can miss important of 100-1000 images, which can miss important population effects. population effects.

Reduced computation time from 833 days to 10 Reduced computation time from 833 days to 10 days, making it feasible to repeat multiple times for days, making it feasible to repeat multiple times for a graduate thesis. (We can go faster yet.)a graduate thesis. (We can go faster yet.)

Page 28: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2828

Page 29: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

2929

OverviewOverview

Challenges in Using Campus GridsChallenges in Using Campus Grids

Solution: AbstractionsSolution: Abstractions

Examples and ApplicationsExamples and Applications– All-Pairs: Biometrics, Data MiningAll-Pairs: Biometrics, Data Mining– Wavefront: Genomics and EconomicsWavefront: Genomics and Economics– Some-Pairs: GenomicsSome-Pairs: Genomics

Abstractions, Workflows, and LanguagesAbstractions, Workflows, and Languages

Page 30: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3030

M[4,2]

M[3,2] M[4,3]

M[4,4]M[3,4]M[2,4]

M[4,0]M[3,0]M[2,0]M[1,0]M[0,0]

M[0,1]

M[0,2]

M[0,3]

M[0,4]

Fx

yd

Fx

yd

Fx

yd

Fx

yd

Fx

yd

Fx

yd

F

F

y

y

x

x

d

d

x F Fx

yd yd

Wavefront( matrix M, function F(x,y,d) )Wavefront( matrix M, function F(x,y,d) )

returns matrix M such thatreturns matrix M such that

M[i,j] = F( M[i-1,j], M[I,j-1], M[i-1,j-1] )M[i,j] = F( M[i-1,j], M[I,j-1], M[i-1,j-1] )

F

Wavefront(M,F)

M

Page 31: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3131

Applications of WavefrontApplications of Wavefront

Bioinformatics:Bioinformatics:– Compute the alignment of two large DNA strings in Compute the alignment of two large DNA strings in

order to find similarities between species. Existing order to find similarities between species. Existing tools do not scale up to complete DNA strings.tools do not scale up to complete DNA strings.

Economics:Economics:– Simulate the interaction between two competing firms, Simulate the interaction between two competing firms,

each of which has an effect on resource consumption each of which has an effect on resource consumption and market price. E.g. When will we run out of oil?and market price. E.g. When will we run out of oil?

Applies to any kind of optimization problem Applies to any kind of optimization problem solvable with dynamic programming.solvable with dynamic programming.

Page 32: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3232

Problem: Dispatch LatencyProblem: Dispatch Latency

Even with an infinite number of CPUs, Even with an infinite number of CPUs, dispatch latency controls the total dispatch latency controls the total execution time: O(n) in the best case.execution time: O(n) in the best case.However, job dispatch latency in an However, job dispatch latency in an unloaded grid is about unloaded grid is about 30 seconds30 seconds, which , which may outweigh the runtime of F.may outweigh the runtime of F.Things get worse when queues are long!Things get worse when queues are long!Solution:Solution: Build a lightweight task dispatch Build a lightweight task dispatch system. (Idea from Falkon@UC)system. (Idea from Falkon@UC)

Page 33: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3333

worker

workerworker

workerworker

workerworker

workqueue

FIn.txt out.txt

put F.exeput in.txtexec F.exe <in.txt >out.txtget out.txt

1000s of workersdispatched viaCondor/SGE/SSH

wavefrontengine

queuetasks

tasksdone

Page 34: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3434

Problem: Performance VariationProblem: Performance Variation

Tasks can be delayed for many reasons:Tasks can be delayed for many reasons:– Heterogeneous hardware.Heterogeneous hardware.– Interference with disk/network.Interference with disk/network.– Policy based suspension.Policy based suspension.

Any delayed task in Wavefront has a cascading Any delayed task in Wavefront has a cascading effect on the rest of the workload.effect on the rest of the workload.

Solution - Fast Abort:Solution - Fast Abort: Keep statistics on task Keep statistics on task runtimes, and abort those that lie significantly runtimes, and abort those that lie significantly outside the mean. Prefer to assign jobs to outside the mean. Prefer to assign jobs to machines with a fast history.machines with a fast history.

Page 35: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3535

500x500 Wavefront on ~200 CPUs500x500 Wavefront on ~200 CPUs

Page 36: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3636

Wavefront on a 200-CPU ClusterWavefront on a 200-CPU Cluster

Page 37: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3737

Wavefront on a 32-Core CPUWavefront on a 32-Core CPU

Page 38: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3838

Performance Prediction is PossiblePerformance Prediction is Possible

Often, users have no idea whether a task will Often, users have no idea whether a task will take one day or one year -> better to find out at take one day or one year -> better to find out at the beginning!the beginning!

Allows the system to choose automatically Allows the system to choose automatically whether to run locally or on the campus grid.whether to run locally or on the campus grid.

Of course, performance prediction is technically Of course, performance prediction is technically and philosophically dangerous: we simply argue and philosophically dangerous: we simply argue that abstractions are more predictable than that abstractions are more predictable than general programs.general programs.

Page 39: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

3939

OverviewOverview

Challenges in Using Campus GridsChallenges in Using Campus Grids

Solution: AbstractionsSolution: Abstractions

Examples and ApplicationsExamples and Applications– All-Pairs: Biometrics, Data MiningAll-Pairs: Biometrics, Data Mining– Wavefront: Genomics and EconomicsWavefront: Genomics and Economics– Some-Pairs: GenomicsSome-Pairs: Genomics

Abstractions, Workflows, and LanguagesAbstractions, Workflows, and Languages

Page 40: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4040

The Genome Assembly ProblemThe Genome Assembly Problem

AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA

AGTCGATCGATCGAT

AGCTAGCTACGA TCGATAATCGATCCTAGCTA

Chemical Sequencing

Computational Assembly

AGTCGATCGATCGAT

AGCTAGCTACGA TCGATAATCGATCCTAGCTA

Millions of “reads”100s bytes long.

Page 41: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4141

Sample GenomesSample Genomes

ReadsReads DataData PairsPairs

SequentialSequential

TimeTime

A. gambiaeA. gambiae

scaffoldscaffold

101K101K 80MB80MB 738K738K 12 hours12 hours

A. gambiaeA. gambiae

completecomplete

180K180K 1.4GB1.4GB 12M12M 6 days6 days

S. BicolorS. Bicolor

simulatedsimulated

7.9M7.9M 5.7GB5.7GB 84M84M 30 days30 days

Page 42: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4242

Genome Assembly TodayGenome Assembly Today

Several commercial firms provide an Several commercial firms provide an assembly service that takes weeks on a assembly service that takes weeks on a dedicated cluster, costs O($10K), and is dedicated cluster, costs O($10K), and is based on human genome heuristics.based on human genome heuristics.

Genome researchers would like to be able Genome researchers would like to be able to perform custom assemblies using their to perform custom assemblies using their own data and heuristics.own data and heuristics.

Can this be done on a campus grid? Can this be done on a campus grid?

Page 43: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4343

Some-Pairs AbstractionSome-Pairs Abstraction

SomePairs( set A, list (i,j), function F(x,y) )SomePairs( set A, list (i,j), function F(x,y) )

returnsreturns

list of F( A[i], A[j] )list of F( A[i], A[j] )

A1

A2

A3

A1 A2 A3

F

A1A1

An

F

SomePairs(A,L,F)F F

F

(1,2)(2,1)(2,3)(3,3)

Page 44: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4444

worker

workerworker

workerworker

workerworker

workqueue

in.txt out.txt

put align.exeput in.txtexec F.exe <in.txt >out.txtget out.txt

100s of workersdispatched to

Notre Dame,Purdue, and

Wisconsin

somepairsmaster

queuetasks

tasksdone

F

detail of a single worker:

Distributed Genome AssemblyDistributed Genome Assembly

A1A1

An F

(1,2)(2,1)(2,3)(3,3)

Page 45: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4545

Small Genome (101K reads)Small Genome (101K reads)

Page 46: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4646

Medium Genome (180K reads)Medium Genome (180K reads)

Page 47: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4747

Large Genome (7.9M)Large Genome (7.9M)

Page 48: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4848

From Workstation to GridFrom Workstation to Grid

Page 49: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

4949

What’s the Upshot?What’s the Upshot?

We can do full-scale assemblies as a routine We can do full-scale assemblies as a routine matter on existing conventional machines.matter on existing conventional machines.

Our solution is faster (wall-clock time) than the Our solution is faster (wall-clock time) than the next faster assembler run on 1024x BG/L.next faster assembler run on 1024x BG/L.

You could almost certainly do better with a You could almost certainly do better with a dedicated cluster and a fast interconnect, but dedicated cluster and a fast interconnect, but such systems are not universally available.such systems are not universally available.

Our solution opens up research in assembly to Our solution opens up research in assembly to labs with “NASCAR” instead of “Formula-One” labs with “NASCAR” instead of “Formula-One” hardware.hardware.

Page 50: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

5050

OverviewOverview

Challenges in Using Campus GridsChallenges in Using Campus Grids

Solution: AbstractionsSolution: Abstractions

Examples and ApplicationsExamples and Applications– All-Pairs: Biometrics, Data MiningAll-Pairs: Biometrics, Data Mining– Wavefront: Genomics and EconomicsWavefront: Genomics and Economics– Some-Pairs: GenomicsSome-Pairs: Genomics

Abstractions, Workflows, and LanguagesAbstractions, Workflows, and Languages

Page 51: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

5151

Other Abstractions for ComputingOther Abstractions for ComputingDirected Graph

Bag of TasksMap-Reduce

M

R

R

Page 52: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

5252

Partial Lattice of AbstractionsPartial Lattice of Abstractions

Directed Graph

Bag of Tasks Wavefront Map Reduce

Some Pairs

All-Pairs

Lambda Calculus

RobustPerformance

ExpressivePower

Page 53: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

5353

Two Abstractions ComparedTwo Abstractions Compared

AllPairs( set A, set B, F(x,y) )AllPairs( set A, set B, F(x,y) )

SomePairs( set S, list L, F(x,y) )SomePairs( set S, list L, F(x,y) )

Assuming that A = B = S…Assuming that A = B = S…

Can you express AllPairs using SomePairs?Can you express AllPairs using SomePairs?– Yes, but you must enumerate all pairs explicitly. It is not Yes, but you must enumerate all pairs explicitly. It is not

trivial for SomePairs to minimize the amount of data trivial for SomePairs to minimize the amount of data transferred to each node.transferred to each node.

Can you express SomePairs using AllPairs?Can you express SomePairs using AllPairs?– Yes, but only by doing excessive amounts of work, and Yes, but only by doing excessive amounts of work, and

then winnowing the results.then winnowing the results.

Page 54: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

5454

Abstractions as a Social ToolAbstractions as a Social Tool

Collaboration with outside groups is how we Collaboration with outside groups is how we encounter the most interesting, challenging, and encounter the most interesting, challenging, and important problems, in computer science.important problems, in computer science.However, often neither side understands which However, often neither side understands which details are essential or non-essential:details are essential or non-essential:– Can you deal with files that have upper case letters?Can you deal with files that have upper case letters?– Oh, by the way, we have 10TB of input, is that ok?Oh, by the way, we have 10TB of input, is that ok?– (A little bit of an exaggeration.)(A little bit of an exaggeration.)

An abstraction is an excellent chalkboard tool:An abstraction is an excellent chalkboard tool:– Accessible to anyone with a little bit of mathematics.Accessible to anyone with a little bit of mathematics.– Makes it easy to see what must be plugged in.Makes it easy to see what must be plugged in.– Forces out essential details: data size, execution time.Forces out essential details: data size, execution time.

Page 55: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

5555

ConclusionConclusion

Grids, clouds, and clusters provide enormous Grids, clouds, and clusters provide enormous computing power, but are very challenging to computing power, but are very challenging to use effectively.use effectively.

An abstraction provides a robust, scalable An abstraction provides a robust, scalable solution to a narrow category of problems; each solution to a narrow category of problems; each requires different kinds of optimizations.requires different kinds of optimizations.

Limiting expressive powerLimiting expressive power, results in systems , results in systems that are usable, predictable, and reliable.that are usable, predictable, and reliable.

Is there a menu of abstractions that would Is there a menu of abstractions that would satisfy many consumers of grid computing?satisfy many consumers of grid computing?

Page 56: 1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009

5656

AcknowledgmentsAcknowledgments

Cooperative Computing LabCooperative Computing Lab– http://www.cse.nd.edu/~cclhttp://www.cse.nd.edu/~ccl

Grad StudentsGrad Students– Chris MorettiChris Moretti– Hoang Bui Hoang Bui – Li YuLi Yu– Mike OlsonMike Olson– Michael AlbrechtMichael Albrecht

Faculty:Faculty:– Patrick FlynnPatrick Flynn– Nitesh ChawlaNitesh Chawla– Kenneth JuddKenneth Judd– Scott EmrichScott Emrich

NSF Grants CCF-0621434, CNS-0643229NSF Grants CCF-0621434, CNS-0643229

UndergradsUndergrads– Mike KellyMike Kelly– Rory CarmichaelRory Carmichael– Mark PasquierMark Pasquier– Christopher LyonChristopher Lyon– Jared BulosanJared Bulosan