High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui,

1Christopher Moretti – University of Notre Dame4/30/2008

High Level Abstractions for

Data-Intensive Computing

Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain

University of Notre Dame


Computing’s central challenge,“How not to make a mess of it,”

has not yet been met.

-Edsger Dijkstra


Overview

Many systems today give end users access to hundreds or thousands of CPUs.

But, it is far too easy for the naive user to create a big mess in the process.

Our Solution: Deploy high-level abstractions that describe both data

and computation needs. Some examples of current work:

All-Pairs: An abstraction for biometric workloads. Distributed Ensemble Classification DataLab: A system and language for data-parallel

computation.


Distributed Computing is Hard!

How do I fit my

workload into jobs?

Which resources

?

What happens when things

fail?

How Many?What is Condor?

What do I do with the results?

How can I measure job

stats?

What about job input

data?

How long will it take?


Distributed Computing is Hard!

How do I fit my

workload into jobs?

Which resources

?

What happens when things

fail?

How Many?What is Condor?

What do I do with the results?

How can I measure job

stats?

What about job input

data?

How long will it take?

ARGH!


The All-Pairs Problem

All-Pairs(

Set S1,

Set S2,

Function F

)

yields a matrix M:

Mij = F(S1i,S2j)

60K 20KB images >1GB

3.6B comparisons

@ 50/s = 2.3 CPUYrs

x 8B output = 29GB

S2 1

S2 2

S2 3

S2 4

S2 5

S2 6

S2 7

S1 1

S1 2

S1 3

S1 4

S1 5

S1 6

S1 7


Biometric All-Pairs Comparison

1 .8 .1 0 0 .1

1 0 .1 .1 0

1 0 .1 .7

1 0

1 .1

1

F


Naïve Mistakes

Computing Problem: Even expert users don’t know how to tune jobs optimally, and can make 100 CPUs even slower than one by overloading the file server, network, or resource manager.

CPU CPU CPU CPU

fileserver

Bat

ch S

yste

mEach CPU

reads 10TB!

For all $X :For all $Y :cmp $X to $Y


Consequences of Naïve Mistakes


All Pairs Abstraction

set S of filesbinary function F

F

M = AllPairs(F,S)

invocation


Web Portal300 active storage units500 CPUs, 40TB disk

F G H

S T

All-PairsEngine

2 - AllPairs(F,S)

F F F

F F F

3 - O(log n) distributionby spanning tree.

6 - Return resultmatrix to user.

1 - Upload F and Sinto web portal.

5 - Collect andassemble results.

4 – Choose optimal partitioningand submit batch jobs.

All-Pairs Production System at Notre Dame





Returning the Result Matrix

4.374.37

6.016.01

2.222.22

4.37

7.13

8.94

6.72

1.34

…

…

…

0.98

4.37

7.13

8.94

6.72

1.34

…

…

…

0.98

Too many files.

Hard to do prefetching.

Too large files.

Must scan entire file.

Row/Column ordered.

How can we build it?


Chirp_array allows users to create, manage, modify large arrays without having to realize underlying form.

Operations on chirp_array: create a chirp_array open a chirp_array set value A[i,j] get value A[i,j]

get row A[i] get column A[j] set row A[i] set column A[j]

Result Storage by Abstraction

CPU

Disk

CPU

Disk

CPU

Disk

X

X


CPU

Disk

CPU

Disk

CPU

Disk

Result Storage with chirp_array

chirp_array_get(i,j)


CPU

Disk

CPU

Disk

CPU

Disk




CPU

Disk

CPU

Disk

CPU

Disk




Data Mining on Large Data Sets

Problem: Supercomputers are expensive, not all scientists have access to them for completing very large memory problems. Classification on large data sets without sufficient memory can degrade throughput, degrade accuracy, or fail outright.


trainingdata

partitioning/sampling(optional)

algorithm 1 algorithm n

classifier 1 classifier n

testinstance

voting

classification

Data Mining Using Ensembles

(From Steinhaeuser and Chawla, 2007)


trainingdata

partitioning/sampling(optional)

algorithm 1 algorithm n

classifier 1 classifier n

testinstance

voting

classification

Data Mining Using Ensembles

(From Steinhaeuser and Chawla, 2007)


CPU CPU CPU CPU

Abs

trac

tion

Eng

ine

Here are my algorithms.Here is my data set.Here is my test set.

Abstraction for Ensembles Using Natural Parallelism

Local Votes

Choose optimal partitioningand submit batch jobs.

Return local votes for tabulation and final prediction.


unixfilesys

chirpserver

unixfilesys

chirpserver

unixfilesys

chirpserver

chirpserver

tcshemacs

perl

parrot

set S

chirpserver

X Y

F

A B C

file F

distributed data structures

Y = F(X)

job_startjob_commitjob_waitjob_remove

file system function evaluation

DataLab Abstractions


apply F on S into T

chirpserver

chirpserver

chirpserver

chirpserver

set S

chirpserver

A B C

set T

A B C

F F F

F

DataLab Language Syntax


For More Information

Christopher Moretti [email protected]

Douglas Thain [email protected]

Cooperative Computing Lab http://cse.nd.edu/~ccl

Documents

High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui,