37
The Future of Data Management or The Structure of (Computer) Scientific Revolutions EECS BEARS Conference February 2007 Michael Franklin UC Berkeley & Amalgamated Insight, Inc.

The Future of Data Management or The Structure of (Computer) Scientific Revolutions

Embed Size (px)

DESCRIPTION

The Future of Data Management or The Structure of (Computer) Scientific Revolutions. Michael Franklin UC Berkeley & Amalgamated Insight, Inc. EECS BEARS Conference February 2007. Semi-Structured (schema-later). Unstructured (schema-never). Structured (schema-first). - PowerPoint PPT Presentation

Citation preview

Page 1: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

The Future of Data Managementor

The Structure of (Computer) Scientific Revolutions

EECS BEARS ConferenceFebruary 2007

Michael Franklin

UC Berkeley&

Amalgamated Insight, Inc.

Page 2: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

The Structure Spectrum

Structured (schema-

first)

Relational

DatabaseFormatted Messages

Semi-Structured (schema-later)

XML

Tagged Text/Medi

a

Unstructured (schema-

never)

Plain TextMedia

Page 3: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Structured Data Management

Page 4: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

A “Modern” View of Data Management

Page 5: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Whither Structured Data?• Conventional

Wisdom: only 20% of data

is structured.

• Decreasing due to:• Consumer

applications• Enterprise search• Media applications

Page 6: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Structured Data Management Two reasons why this is where the

future is:

• The Data Integration quagmire: The perennial IT problem. Structure provides crucial cues.

Page 7: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Structured Data Management Two reasons why this is where the future

is:

• The Data Integration quagmire: The perennial IT problem. Structure provides crucial cues.

• The “Data Industrial Revolution*”: Data used to be hand-crafted, now it’s machine-generated!

* Credit to Prof. Joe Hellerstein for this analogy.

Page 8: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Reason 1: Data Integration

• The ultimate schema-first problem.

• In the future, required for all applications.

• Structure is both an enabler and a key impediment.

wrapperwrapperwrapperwrapperwrapper

Mediated Schema

Semantic mappings

Courtesy of Alon Halevy

Page 9: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Why Structure?

What if you wanted to find out which actors donated to John Kerry’s 2004 presidential campaign…

Page 10: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Why Structure?

Page 11: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Why Structure?

What if you wanted to find out which actors donated to John Kerry’s 2004 presidential campaign…

Page 12: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Why Structure?

• Text “Search” can return only what’s been previously “stored”.

Page 13: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

What if you wanted to…

• find out the average donation of actors to each candidate?

• compare actor donations this campaign to the last one?

• find out who gave the most to each candidate?

• organize the information by source or age?

Page 14: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

A“Deep-Web” Query Approach

SELECT y.name,f.occupation,…FROM Yahoo_Actors y, FECInfo fWHERE y.name = f.name

Page 15: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Did it Work?

Page 16: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

What’s Missing?

• Common Schema • Any Schema• Strong Identifiers (keys)• Data Independence• Metadata• Consistency Guarantees• Access Control

Page 17: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

The Fundamental Tradeoff

Functionality

Time (and cost)

Structured(schema-first)

Unstructured (schema-less)

Semi-Structured(schema-later)

Structure enables computers to help users manipulate and maintain the data.

Page 18: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

“Flexible” Structure: Dataspaces*

• Deal with all the data from an enterprise – in whatever form

• Data co-existenceno integrated schema, no single warehouse

• Pay-as-you-go services• Keyword search is bare minimum.• Data manipulation and increased consistency as you add work.

* “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.

Page 19: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Databases vs. Dataspaces

• Data Coexistence

• Autonomous Sources

• Search, Browse, Approximate Answer Structured Query

• Best Effort Guarantees

• Single Schema• Centralized

Administration

• Structured Query

• Strict Integrity Constraints

Page 20: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

The World of Dataspaces

High Low

Near

Far

Desktop Search

Web SearchVirtual

Organization

Federated DBMS

DBMS

Semantic Integration

AdministrativeProximity

Page 21: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

DataSpace Technology

• Probabilistic Databases• Schema Matching• Judicious use of User Input • Approx. Query Answering• Probabilistic Reasoning• Uncertainty Management• Data Model Learning• Structured & Unstructured Search

Page 22: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Reason 2: Data Industrial Revolution

Bell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect

• Mainframes 1960s• Minicomputers 1970s• Microcomputers/PCs 1980s• Web-based computing 1990s• Devices (Cell phones, PDAs, wireless sensors,

RFID) 2000’s

Enabling a new generation of applications forOperational Visibility, monitoring, and alerting.

Page 23: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Data Streams Data Flood

Clickstream

BarcodesPoS System

SensorsRFID

Telematics

Inventory

• Exponential data growth

• New challenges: continuous, inter-connected, distributed, physical

• Shrinking business cycles

• More complex decisions

Phones

TransactionalSystems

Page 24: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Device Data Management• Devices generate streams of structured data.

• Wide-spread deployment will lead to huge data volumes.

• Can we develop the right infrastructure to support large-scale data streaming apps?

• Can we incorporate devices into existing (legacy) IT infrastructure?

Page 25: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

High Fan In Systems*

• A data management infrastructure for large-scale data streaming environments.

• Uniform Declarative Framework • Every node is a SQL data stream processor stream-oriented queries at all levels• Hierarchical, stream-based views as an

organizing principle.• Can impose a “view” over messy devices.

*Design Considerations for High Fan In Systems - The HiFi Approach; CIDR 2005

Page 26: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

HiFi - Taming the Data Flood

Receptors

Warehouses, Stores

Dock doors, Shelves

Regional Centers

Headquarters

Hierarchical Aggregation

• Spatial• TemporalIn-network StreamQuery Processing and Storage

Fast DataPath vs.Slow DataPath

Page 27: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

“Virtual Device(VICE)API”

Vice API is a natural placeto hide much of the complexity arising from physical devices.

VICE: Virtual Device Interface [Jeffery et al., Pervasive 2006, VLDBJ 07]

Page 28: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Device Issues: example

Shelf RIFD Test - Ground Truth

Page 29: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Actual RFID Readings

“Restock every time inventory goes below 5”

Page 30: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Query-based Data Cleaning

Point

Smooth

CREATE VIEW smoothed_rfid_stream AS(SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)

Page 31: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Query-based Data Cleaning

Point

Smooth

ArbitrateCREATE VIEW arbitrated_rfid_stream AS(SELECT receptor_id, tag_idFROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’]GROUP BY receptor_id, tag_idHAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))

Page 32: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

After Query-based Cleaning

“Restock every time inventory goes below 5”

Page 33: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

SQL Abstraction Makes it Easy

• “Soft Sensors”• Quality and lineage• Optimization (power, etc.)• Pushdown of external validation

information• Data archiving• Imperative processing• …

Page 34: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Co

mp

lexi

ty

Performance

Cen

tral

ized

Dis

trib

ute

d

Event-Driven

Query-Driven

Next-Generation Business Intelligence

Amalgamated Insight: The Company

RDBMS

Data Warehouse

Appliance

In-MemoryAccelerators

Database/Data WarehouseProducts

Reporting

Analysis

Predictive Analytics

Data Mining

“Operational”BI/BAM

DataAnalyticsProducts

Page 35: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Stream Query Processing is the Key

Integrated Event Handling

and Alerting

VisibilityInterfaces to Operational

Systems

Notification

Learning

Intelligent Action

Drill Down, Replay, Reports

“What’s happening

now?”

“Tell me when something happens.”

“Why is it happening and how

to improve it?”

“Automatically react when

things happen.”

Page 36: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Company Overview

• Breakthrough technology for stream query processing• Proven software base – leveraging open source platform• Used in demanding high-volume networked applications

Boyd Pearce, President and CEO Michael Franklin, Ph.D., CTO Michael Trigg, EVP, Marketing Sailesh Krishnamurthy, Ph.D., Chief Architect Robert Krauss, VP, Business Development

Key Team Members

Technology

Founded November 2005 Headquarters in Foster City, CA Series A Financing: May 2006 10 Employees (and growing!)

Page 37: The Future of Data Management or The  Structure  of (Computer) Scientific Revolutions

Michael FranklinEECS BEARS Conference - February

2007

Conclusions• Structured data increasingly important.

• In fact, there will be lots more of it.• and it must be processed as fast as it is created.

• Traditional (structured) database technology is not up to the task.

• Great opportunities for innovation.• HiFi, Dataspaces (and Amalgamated Insight!) are examples.

http://www.cs.berkeley.edu/~franklin