33
What's Next for Database? Jim Gray Microsoft http://research.microsoft.com/~Gray

What's Next for Database? Jim Gray Microsoft Gray

Embed Size (px)

Citation preview

What's Next for Database? Jim GrayMicrosofthttp://research.microsoft.com/~Gray

Keynote ▪ 30 September 2005 ▪ 9:00

Outline

Looking at the past: old problems now look easy

Looking forward:data avalanche hereintegrate ALL kinds of data

Watershed: The new world Programs + data: Info Ecosystem All data classes (Objectifying Information) Approximate answers

Keynote ▪ 30 September 2005 ▪ 9:00

Old Problems Now Look Easy

1985 goal: 1,000 transactions per second Couldn’t do it at the time At the time:

100 transactions/second 50 M$ for the computer

(y2005 dollars)

Keynote ▪ 30 September 2005 ▪ 9:00

Old Problems Now Look Easy

1985 goal: 1,000 transactions per second Couldn’t do it at the time At the time:

100 transactions/second 50 M$ for the computer

(y2005 dollars)

Now: easy Laptop does 8,200 debit-

credit tps ~$400 desktop

Thousands of DebitCredit Transactions-Per-Second: Easy and Inexpensive, Gray & Levine, MSR-TR-2005-39, ftp://ftp.research.microsoft.com/pub/tr/TR-2005-39.doc

Keynote ▪ 30 September 2005 ▪ 9:00

Hardware & Software Progress Throughput 2x per 2 years tracks MHz

X86&X64 tpmC per CPU over time

100

1,000

10,000

100,000

1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

tpm

C/c

pu

30x in 10 years41%/yearDouble every 2 years

X86&X64 tpmC per Mhz over time

0

5

10

15

20

1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Throughput/$ 2x per 1.5 years 40%/y hardware, 20%/y software

A Measure of Transaction Processing 20 Years Later ftp://ftp.research.microsoft.com/pub/tr/TR-2005-57.doc IEEE Data Engineering Bulletin, V. 28.2, pp. 3-4, June 2005

TPC-A and TPC-C tps/$ Trends

0.01

0.10

1.00

10.00

100.00

1000.00

1990 1992 1994 1996 1998 2000 2002 2004

Th

rou

gh

pu

t /

k$

TPC-CTPC A

~100x in 10 years~2x per 1.5 years

No obvious end in sight!

~2x / 1.5 years

Keynote ▪ 30 September 2005 ▪ 9:00

100x Improvement Every Decade $1B job becomes $10M job $1M job becomes 10K$ job Terabytes common now (~500$ today) Petabytes in a decade.

Challenge: We can capture & store everything. What’s interesting? What can you tell me about X?

Keynote ▪ 30 September 2005 ▪ 9:00

Q: How Much is “Everything”A: About 15 Exabytes

Q: How much is digital?A: 70% and growing

Q: Where does it come from?A: Video, voice, sensors,

Q: How fast is it growing?A: Growing 10%/y now, 55%/y when ALL digital

PB/y CAG

print 0.2 2%

film 427 4%

video 300 5%

computer 1,693 55%

Information Growth vsStorage Media

Source: Larson & Varian, “How Much Information”: as of 2003 http://www.sims.berkeley.edu/research/projects/how-much-info/

Keynote ▪ 30 September 2005 ▪ 9:00

Where is the Data?Smart Objects Everywhere Phones, PDAs, Cameras,… have small DBs. Disk drives have enough cpu, memory

to run a full-blown DBMS. All these devices want-need to share data. Need a simple-but-complete dbms They need an Esperanto:

a data exchange language and paradigm.

Billions of Clients Millions of Servers

Keynote ▪ 30 September 2005 ▪ 9:00

The Perfect System Knows everything Knows what you want to know Tells you the answer…

in a an easy-to-understand way; just before you ask

Tells you what you should have asked And…

It is inexpensive to buy It is inexpensive to own.

Well, maybe not everyone wants this… but every organization does.

Keynote ▪ 30 September 2005 ▪ 9:00

Oh! And the PEOPLE COSTS are HUGE! People costs have always exceeded IT capital. But now that hardware is “free” … Self-managing, self-configuring, self-healing, self-

organizing and … is key goal. No DBAs for cell phones or cameras. Requires

Clear and simple knobs on modules Software manages these knobs

Keynote ▪ 30 September 2005 ▪ 9:00

Our Challenge Capture, Store, Organize, Search, Display

All information. Personal Organizational Societal

There is a huge gap between what we have today and what we need.

Data capture is relatively easy Curate, Organize, Search, Display still too hard.

Keynote ▪ 30 September 2005 ▪ 9:00

Outline

Looking at the past: old problems now look easy

Looking forward:data avalanche hereintegrate ALL kinds of data

Watershed: The new world Programs + data: Info Ecosystem All data classes (Objectifying Information) Approximate answers

Keynote ▪ 30 September 2005 ▪ 9:00

DBMS Re-conceptualization Re-Unification of Programs & Data Allows Objectification of Information

eg: what is a gene? What properties&methods?

what is a person? What properties&methods?

What is an X? What properties&methods?

Need to “glue” all these models together Time, Space, text,… are core types Person, event, document, gene,.. are extensions. The “Action” is in these extensions.

Keynote ▪ 30 September 2005 ▪ 9:00

Code and Data: Separated at Birth

COBOL IDENTIFICATION: document

ENVIRONMENT: OS

DATA: Files/Records

PROCEDURE: code

AUTHOR, PROGRAM-ID, INSTALLATION, SOURCE-COMPUTER, OBJECT-COMPUTER, SPECIAL-NAMES, FILE-CONTROL, I-O-CONTROL, DATE-WRITTEN, DATE-COMPILED, SECURITY.

CONFIGURATION SECTION. INPUT-OUTPUT SECTION.

FILE SECTION. FILE SECTION. WORKING-STORAGE SECTION. WORKING-STORAGE SECTION. LINKAGE SECTION. LINKAGE SECTION. REPORT SECTION. REPORT SECTION. SCREEN SECTION.SCREEN SECTION.

CODASYL - DBTG COnference on DAta SYstems Languages Data Base Task GroupDefined DDL for a network data model Set-Relationship semantics Cursor Verbs

Isolated from procedures.

No encapsulation“knowledge”

“data”

Keynote ▪ 30 September 2005 ▪ 9:00

The Object-Relational Worldmarry programming languages and DBMSs

Stored procedures evolve to “real” languagesVB, Java, C#,.. With real object models.

Data encapsulated: a class with methods Tables are enumerable & indexable

record sets with foreign keys Records are vectors of objects Opaque or transparent types Set operators on transparent classes Transactions:

Preserve invariants A composition strategy An exception strategy

Ends Inside-DB Outside-DB dichotomy

Klaus Wirth: Programs = Algorithms + Data Structures

Business Business ObjectsObjects

Keynote ▪ 30 September 2005 ▪ 9:00

Ask not “How to add objects to databases?”,Ask “What kind of object is a database?”

Q: Given an object model, what is a DB?A: DataSet class and methods

(nested relation with metadata)The basis for the ecosystem

Distributed DBExtensible DBInteroperable DB….

implicit in ODBC, OleDBexplicit within the DBMS ecosystem

Input: Command (any language) Output: Dataset

Tablesor Textor cubeOr…..

Question

Dataset

Keynote ▪ 30 September 2005 ▪ 9:00

DB System Architecture

The classic DBMS model

os

records

sets

utilities

Added:+Text, Time, Space

+ Triggers and queues + Replication, Pub/sub + Extract-Transform-Load + Cubes, Data mining

+ XML, XQuery+ Programming Languages+ Many more extensions coming

Replicatio

n

ET

LT

extC

ubesD

ata Mine

Tim

eS

paceN

otification

Procedure

s

QueuesX

ML

os

records

sets

utilities

A Mess?

but applications need to query other data types

Keynote ▪ 30 September 2005 ▪ 9:00

Evolving to be Information Services Container

develop, deploy, and execution environment Classic ++

+ Programming Languages + Triggers and queues + Replication, Pub/sub + Extract-Transform-Load + Text, Time, Space + Cubes, Data mining + XML, XQuery + Many more extensions coming

DBMS is an ecosystemOO is the key structuring strategy: Everything is a class Database is a complex object Core object is DataSet Classes publish/consume them Depends on strong Object Model

os

records

sets

utilities

DataSet

Keynote ▪ 30 September 2005 ▪ 9:00

Internet

Our API

catalogs

Query Processor

data

Applications

Competio

r1

Other us

Other us

Other us

Other us

itterators

Buffer Pool

Remote Node Remote Node

Competitor2

What’s Outside?

Keynote ▪ 30 September 2005 ▪ 9:00

Classic: What’s Outside? Three Tier Computing

Clients gather input, do presentation do some workflow (script)

Send high-level requests to ORB (Object Request Broker)

ORB dispatches workflows, orchestrate flows & queues

Workflows invoke business objects Business object read/write database

DatabasesDatabases

Business Business ObjectsObjects

workflowsworkflows

PresentationPresentation

Keynote ▪ 30 September 2005 ▪ 9:00

DatabasesDatabases

Business Business ObjectsObjects

workflowsworkflows

PresentationPresentation

DBMS is Web Service!Client/server is back; the revenge of TP-lite

Web servers and runtimes (Apache, IIS, J2EE, .NET) displaced TP monitors & ORBS Give persistent objects Holistic programming model & environment

Web services (soap, wsdl, xml)are displacing current brokers

DBMS listening to Port 80publishing WSDL, DISCO,WS-Sec Servicing SOAP calls.DBMS is a web service

Basis for distributed systems. A consequence of OR DBMS

DB

MS

DB

MS

Keynote ▪ 30 September 2005 ▪ 9:00

Queues & Workflows Apps are loosely connected via

Queued messages Queues are databases. Basis for workflow Queues: the first class to add to

an OR DBMS Queues fire triggers.

Active databases Synergy with DBMS

security, naming, persistence, types, query,…

Workflow:Script Execute Administer &

Expedite all built on queues

Keynote ▪ 30 September 2005 ▪ 9:00

What’s new here? DBMS have tight-integration with

language classes (Java, C#, VB,.. )

The DB is a class You can add classes to DB. Adding indices is “easy”

If you have a new idea. Now have solid queue systems

Adding workflow is “easy”If you have a new idea.

This is a vehicle for publishing data on the Web.

Tablesor Textor cubeOr…..

Question

Dataset

Internet

Internet

Web serviceTablesor Textor cubeOr…..

Keynote ▪ 30 September 2005 ▪ 9:00

Text, Temporal, and Spatial Data Access Q: What comes after queues?

A: Basic types: text, time, space,… Great application of OR technology Key idea:

table valued functions == indicesAn index is a table, organized differentlyQuery executor uses index to map: Key → set (aka sequence of rows)

Table valued function can do this mapOptimizer can use it.

+extras: cost function, cardinality,…

BIG DEAL: Approximate answers: Rank and Support

select Title, Abstract, T.Rank from Books join FreeTextTable(Title, Abstract, 'XML semistructured') Ton BookID = T.Key

select store, holiday, sum(sales) from Sales join HolidayDates(2004) Ton Sales.day = T.daygroup by store, holiday

select galaxy, distance from GetNearbyObjEQ(22,37)

Keynote ▪ 30 September 2005 ▪ 9:00

Data Mining and Machine Learning

Tasks: classification, association, prediction Tools: Decision trees, Bayes, A Priori,

clustering, regression, Neural net,… now unified with DBs

Create table T (x,y,z,u,v,w)Learn “x,y,z” from “u,v,w” using <algorithm>

Train T with data. Then can ask:

Probability x,y,z,u,v,w What are the u,v,w probabilities given x,y,z

Example: Learn height from age. Anyone with a data mining algorithm has

full access to the DBMS infrastructure. Challenge: Better learning algorithms.

Keynote ▪ 30 September 2005 ▪ 9:00

Notification:Stream and Sensor Processing

Traditionally: Query billions of facts

Streams: millions of queries one new fact New protein compare to all DNA Change in price or time

Implications New aggregation operators (extension) New programming style Streams in products:

Queries represented as records New query optimizations.

Sensor networks push queries out to sensors. Simpler programming model Optimizes power & bandwidth

facts

Q?

A!

QQ

QQ

QQ

Qfact, fact, fact…

Notification

Keynote ▪ 30 September 2005 ▪ 9:00

Semi-Structured Data “Everyone starts with the same schema:

<stuff/>.” Then they refine it.” J. Widom

“Strong schema” has pros-and-cons.

Files <stuff/> and XML <<foo/> <bar/>>are here to stay. Get over it!

File directories are databases; Pivot on any attribute Folders are standing queries. Freetext+schema search (better precision/recall)

Cohabit with row-stores

Keynote ▪ 30 September 2005 ▪ 9:00

Publish-Subscribe, ReplicationExtract-Transform-Load (ETL) Data has many users Replicas for availability and/or performance Mobile users do local updates synchronize later. Classic Warehouse

Replicate to data warehouse Data marts subscribe to publications

Disaster Recovery geoplex ETL is a major application & component

Data loading Data scrubbing Publish/subscribe workflows.

Key to data integration (capture / scrub)

Keynote ▪ 30 September 2005 ▪ 9:00

Restatement: DB Systems evolved to be containers for information servicesdevelop, deploy, and execution environment

DBMS is an ecosystemKey structuring strategy: Everything is a class Database is a complex object Core object is DataSet Approximate answers

This architecture lets you add your new ideas.

DataSet

os

records

sets

utilities

Keynote ▪ 30 September 2005 ▪ 9:00

Summary:

Looking at the past: old problems now look easy

Looking forward:data avalanche hereintegrate ALL kinds of data

Watershed: The new world Programs + data: Info Ecosystem All data classes (Objectifying Information) Approximate answers

Keynote ▪ 30 September 2005 ▪ 9:00

Additional Resources Papers at: http://research.microsoft.com/~gray/JimGrayPublications.htm Talks at:

http://research.microsoft.com/~gray/JimGrayTalks.htm

Basis for this talk: “The Revolution in Database Architecture”http://research.microsoft.com/research/pubs/view.aspx?tr_id=735

Very interesting & related: David Campbell“Service Oriented Database Architecture: App Server-Lite?”http://research.microsoft.com/research/pubs/view.aspx?tr_id=983

Thank you!Thank you for attending this session and the 2005 PASS

Community Summit in Grapevine! Please help us improve the quality of our conference by completing your

session evaluation form. Completed evaluation forms may be given to the room monitor as you exit or to staff at the

registration desk.