26
1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain University of Notre Dame

High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui,

  • Upload
    kineks

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain University of Notre Dame. Computing’s central challenge, “How not to make a mess of it,” has not yet been met. -Edsger Dijkstra. Overview. - PowerPoint PPT Presentation

Citation preview

Page 1: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

1Christopher Moretti – University of Notre Dame4/30/2008

High Level Abstractions for

Data-Intensive Computing

Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain

University of Notre Dame

Page 2: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

2Christopher Moretti – University of Notre Dame4/30/2008

Computing’s central challenge,“How not to make a mess of it,”

has not yet been met.

-Edsger Dijkstra

Page 3: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

3Christopher Moretti – University of Notre Dame4/30/2008

Overview

Many systems today give end users access to hundreds or thousands of CPUs.

But, it is far too easy for the naive user to create a big mess in the process.

Our Solution: Deploy high-level abstractions that describe both data

and computation needs. Some examples of current work:

All-Pairs: An abstraction for biometric workloads. Distributed Ensemble Classification DataLab: A system and language for data-parallel

computation.

Page 4: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

5Christopher Moretti – University of Notre Dame4/30/2008

Distributed Computing is Hard!

How do I fit my

workload into jobs?

Which resources

?

What happens when things

fail?

How Many?What is Condor?

What do I do with the results?

How can I measure job

stats?

What about job input

data?

How long will it take?

Page 5: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

6Christopher Moretti – University of Notre Dame4/30/2008

Distributed Computing is Hard!

How do I fit my

workload into jobs?

Which resources

?

What happens when things

fail?

How Many?What is Condor?

What do I do with the results?

How can I measure job

stats?

What about job input

data?

How long will it take?

ARGH!

Page 6: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

11Christopher Moretti – University of Notre Dame4/30/2008

The All-Pairs Problem

All-Pairs(

Set S1,

Set S2,

Function F

)

yields a matrix M:

Mij = F(S1i,S2j)

60K 20KB images >1GB

3.6B comparisons

@ 50/s = 2.3 CPUYrs

x 8B output = 29GB

S2 1

S2 2

S2 3

S2 4

S2 5

S2 6

S2 7

S1 1

S1 2

S1 3

S1 4

S1 5

S1 6

S1 7

Page 7: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

12Christopher Moretti – University of Notre Dame4/30/2008

Biometric All-Pairs Comparison

1 .8 .1 0 0 .1

1 0 .1 .1 0

1 0 .1 .7

1 0

1 .1

1

F

Page 8: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

13Christopher Moretti – University of Notre Dame4/30/2008

Naïve Mistakes

Computing Problem: Even expert users don’t know how to tune jobs optimally, and can make 100 CPUs even slower than one by overloading the file server, network, or resource manager.

CPU CPU CPU CPU

fileserver

Bat

ch S

yste

mEach CPU

reads 10TB!

For all $X :For all $Y :cmp $X to $Y

Page 9: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

14Christopher Moretti – University of Notre Dame4/30/2008

Consequences of Naïve Mistakes

Page 10: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

15Christopher Moretti – University of Notre Dame4/30/2008

All Pairs Abstraction

set S of filesbinary function F

F

M = AllPairs(F,S)

invocation

Page 11: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

17Christopher Moretti – University of Notre Dame4/30/2008

Web Portal300 active storage units500 CPUs, 40TB disk

F G H

S T

All-PairsEngine

2 - AllPairs(F,S)

F F F

F F F

3 - O(log n) distributionby spanning tree.

6 - Return resultmatrix to user.

1 - Upload F and Sinto web portal.

5 - Collect andassemble results.

4 – Choose optimal partitioningand submit batch jobs.

All-Pairs Production System at Notre Dame

Page 12: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

18Christopher Moretti – University of Notre Dame4/30/2008

Page 13: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

19Christopher Moretti – University of Notre Dame4/30/2008

Page 14: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

20Christopher Moretti – University of Notre Dame4/30/2008

Page 15: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

21Christopher Moretti – University of Notre Dame4/30/2008

Returning the Result Matrix

4.374.37

6.016.01

2.222.22

4.37

7.13

8.94

6.72

1.34

0.98

4.37

7.13

8.94

6.72

1.34

0.98

Too many files.

Hard to do prefetching.

Too large files.

Must scan entire file.

Row/Column ordered.

How can we build it?

Page 16: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

22Christopher Moretti – University of Notre Dame4/30/2008

Chirp_array allows users to create, manage, modify large arrays without having to realize underlying form.

Operations on chirp_array: create a chirp_array open a chirp_array set value A[i,j] get value A[i,j]

get row A[i] get column A[j] set row A[i] set column A[j]

Result Storage by Abstraction

CPU

Disk

CPU

Disk

CPU

Disk

X

X

Page 17: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

23Christopher Moretti – University of Notre Dame4/30/2008

CPU

Disk

CPU

Disk

CPU

Disk

Result Storage with chirp_array

chirp_array_get(i,j)

Page 18: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

24Christopher Moretti – University of Notre Dame4/30/2008

CPU

Disk

CPU

Disk

CPU

Disk

Result Storage with chirp_array

chirp_array_get(i,j)

Page 19: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

25Christopher Moretti – University of Notre Dame4/30/2008

CPU

Disk

CPU

Disk

CPU

Disk

Result Storage with chirp_array

chirp_array_get(i,j)

Page 20: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

26Christopher Moretti – University of Notre Dame4/30/2008

Data Mining on Large Data Sets

Problem: Supercomputers are expensive, not all scientists have access to them for completing very large memory problems. Classification on large data sets without sufficient memory can degrade throughput, degrade accuracy, or fail outright.

Page 21: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

27Christopher Moretti – University of Notre Dame4/30/2008

trainingdata

partitioning/sampling(optional)

algorithm 1 algorithm n

classifier 1 classifier n

testinstance

voting

classification

Data Mining Using Ensembles

(From Steinhaeuser and Chawla, 2007)

Page 22: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

28Christopher Moretti – University of Notre Dame4/30/2008

trainingdata

partitioning/sampling(optional)

algorithm 1 algorithm n

classifier 1 classifier n

testinstance

voting

classification

Data Mining Using Ensembles

(From Steinhaeuser and Chawla, 2007)

Page 23: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

29Christopher Moretti – University of Notre Dame4/30/2008

CPU CPU CPU CPU

Abs

trac

tion

Eng

ine

Here are my algorithms.Here is my data set.Here is my test set.

Abstraction for Ensembles Using Natural Parallelism

Local Votes

Choose optimal partitioningand submit batch jobs.

Return local votes for tabulation and final prediction.

Page 24: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

30Christopher Moretti – University of Notre Dame4/30/2008

unixfilesys

chirpserver

unixfilesys

chirpserver

unixfilesys

chirpserver

chirpserver

tcshemacs

perl

parrot

set S

chirpserver

X Y

F

A B C

file F

distributed data structures

Y = F(X)

job_startjob_commitjob_waitjob_remove

file system function evaluation

DataLab Abstractions

Page 25: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

31Christopher Moretti – University of Notre Dame4/30/2008

apply F on S into T

chirpserver

chirpserver

chirpserver

chirpserver

set S

chirpserver

A B C

set T

A B C

F F F

F

DataLab Language Syntax

Page 26: High Level Abstractions  for  Data-Intensive Computing Christopher Moretti, Hoang Bui,

32Christopher Moretti – University of Notre Dame4/30/2008

For More Information

Christopher Moretti [email protected]

Douglas Thain [email protected]

Cooperative Computing Lab http://cse.nd.edu/~ccl