Distributed Tera-Mining R. L. Grossman Laboratory for Advanced Computing University of Illinois...

Preview:

Citation preview

Distributed Tera-Mining

R. L. Grossman

Laboratory for Advanced Computing

University of Illinois &

Magnify, Inc.

Trend 1. Explosion of Data …

… All in the Wrong Format

With no one to analyze it.

The Data Gap

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

New Ph.D.s

Most data comes a GB and a TB at a time.

Trend 2. Sonet is dead. Lambda Rules.

Gigabytes can be moved in seconds.

Trend 3: Most Data is Distributed

Bush’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.

Example 1: ENSO & Cholera

El Nino Data at NCAR Cholera Data at WHO

Example 2: Voting

County BUCHANANALACHUA 263BAKER 73BAY 248BRADFORD 65BREVARD 570BROWARD 788 Table 1

County ReformAlachua 91Baker 4Bay 55Bradford 3Brevard 148Broward 332

Table 2

Correlation: Reform Voters vs Votes for Buchanan

0

500

1000

1500

2000

2500

3000

3500

4000

0 50 100 150 200 250 300 350 400 450

Palm Beach

DataSpace – One Approach to Making Data Useful

16 terabytes of documents4 billion documents

Today’sMulti-media

Web

Tomorrow’sData Web

petabytes of data tens of billions to

trillions of records

• html• http• search by keyword• workstations servers

• pmml & dtml • dstp• correlate & mine• data & compute clusters

Complementary to the grid, which we view as a distributed computer.

attributes [aid]

UCK [uckid]

k[i], y[j]

k[i], x[i]

DSTP Server 1

DSTP Server 2

Click to obtain graph

Terra Mining TestbedOptical testbed for distributed tera miningof scientific data.

Goal also to be testbed forbroadband based business services.

Lessons Learned

1. It’s the data stupid. Cycles, cylinders & lambdas are all commodities.

2. The fundamental challenge: lower the cost to make data useful.

3. The emergence of internet infrastructure for data is inevitable.

Opens up possibilities for new

types of scientific discoveries.

For More Information DataSpace

http://www.dataspaceweb.nethttp://www.ncdm.uic.edu

DataSpace Standardshttp://www.dmg.org

Selected articleshttp://www.twocultures.net

Magnify – http://www.magnify.com

End of Slides

FTP Still Lives

Trend 2. Bandwidth is a Commodity

OC-3 OC-12 OC-48

El Nina Anomalies

Indonesia Cholera Cases

Cholera Cases

Distributed Exabytes (New Disks)

0

2000

4000

6000

8000

10000

12000

14000

1995 1996 1997 1998 1999 2000 2001 2002 2003

Source: IDC (1999) "1999 Winchester Disk Drive Market Forecast and Review"

Petabytes1 Exabyte

Trend 3: Most Data is Distributed

W’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.

Example 2: Voting

Database 1: Total Votes for Buchanan by County

County BUCHANANALACHUA 263BAKER 73BAY 248BRADFORD 65BREVARD 570BROWARD 788

Database 2: Total Registered Reform Voters by County

County ReformAlachua 91Baker 4Bay 55Bradford 3Brevard 148Broward 332