67
Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

Embed Size (px)

DESCRIPTION

3 Motivation Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate,...) Many of the same applications need to track the lineage of their data Neither uncertainty nor lineage are supported by conventional Database Management Systems (DBMSs) Coincidence or Fate?

Citation preview

Page 1: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS

Martin TheobaldJennifer Widom

Stanford University

Page 2: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

2

Outline

• Motivation and Applications• Working Model for ULDBs• Trio-One System Architecture• Efficient Confidence Computations

Page 3: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

3

Motivation

• Many applications involve data that is uncertain

(approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate, ...)• Many of the same applications need to track

the lineage of their data

Neither uncertainty nor lineage are supported by conventional Database Management Systems (DBMSs)

Coincidence or Fate?

Page 4: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

4

Sample Applications

Deduplication / Entity Resolution• Uncertainty: Match and merge• Lineage: Source records

Information extraction• Uncertainty: Extracted labels and values• Lineage: Original context

Information integration• Uncertainty: Inconsistent information• Lineage: Original sources

Page 5: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

5

Sample Applications

Scientific experiments• Uncertainty: Captured (and derived) data• Lineage: Layers of views

Sensor data• Uncertainty: Sensor values, aggregations,

missing readings• Lineage: Original readings, views

Page 6: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

6

Claim

The connection between uncertainty and lineage goes deeper than just a shared need by several applications

Coincidence or Fate?

Page 7: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

7

Lineage and Uncertainty

Lineage...• Enables simple and consistent representation of

uncertain data• Allows for intuitive and complete working models• Correlates uncertainty in query results with

uncertainty in the input data• Can make confidence computations over

uncertain data more efficient

Applications use lineage to reduce or resolve uncertainty

Page 8: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

8

Goal

A new kind of DBMS in which:1. Data2. Uncertainty3. Lineage

are all first-class interrelated concepts

Trio }

Page 9: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

9

The Trio Trio

1. Data Model Simplest extension to relational model that’s

sufficiently expressive

2. Query Language Simple extension to SQL with well-defined

semantics and intuitive behavior

3. System A complete open-source DBMS that people

want to use

Page 10: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

10

The Present

1. Data Model Uncertainty-Lineage Databases (ULDBs)

2. Query Language TriQL

3. System Trio-One: Layered prototype deployable

on top of any standard relational DBMS

Page 11: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

11

Running Example: Crime-Solving

Saw(witness,car) // may be uncertainDrives(person,car) // may be uncertain

Suspects(person) = πperson(Saw ⋈car Drives)

Page 12: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

12

Data Model: Uncertainty

An uncertain database represents a set ofpossible instances of data values

• Amy saw either a Honda or a Toyota• Jimmy drives a Toyota, a Mazda, or both• Betty saw an Acura with confidence 0.5 or a

Toyota with confidence 0.3• Hank is a suspect with confidence 0.7

Page 13: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

13

Our Model for Uncertainty

1. Alternatives (mutually exclusive)2. ‘?’ (Maybe) Annotations3. Confidences

Page 14: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

14

Our Model for Uncertainty

1. Alternatives: uncertainty about data value(s)

2. ‘?’ (Maybe) Annotations3. Confidences

Saw (witness,car)(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy,

Mazda)

witness carAmy { Honda, Toyota,

Mazda }=

Three possibleinstances

Page 15: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

15

Six possibleinstances

Our Model for Uncertainty

1. Alternatives2. ‘?’ (Maybe): uncertainty about presence3. Confidences

Saw (witness,car)(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy,

Mazda)(Betty, Acura)

?

Page 16: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

16

Our Model for Uncertainty

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences: weighted uncertainty

Saw (witness,car)(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy,

Mazda): 0.2(Betty, Acura): 0.6

?

Six possible instances, each with a probability

Page 17: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

17

Models for Uncertainty

• Our model (so far) is not especially new• We spent some time exploring the space of

models for uncertainty• Tension between understandability and

expressiveness– Our model is understandable– But it is not complete, or even closed under

common operations

Page 18: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

18

Closure and Completeness

Completeness Can represent any finite set of possible

instancesClosure Can represent any result of relational

operators

Note: Completeness Closure

Page 19: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

19

Models for Uncertainty• A-Tuples – Uncertainty on attribute values

• (Amy, {Toyota, Mazda})• Not closed

• X-Tuples – Uncertainty on entire tuples• { (Amy, Toyota), (Amy, Mazda) }• Still not closed

• C-Tables – Tuples + Boolean constraints• { (Amy, Toyota, X=1),

(Amy, Mazda, X<>1), (Cathy, Honda, X<>1) }

• Closed and complete!• Understandable?

Page 20: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

20

Our Model is Not Closed

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jimmy, Toyota) ∥ (Jimmy,

Mazda)(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

SuspectsJimmy

Billy ∥ FrankHank

Suspects = πperson(Saw ⋈car Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 21: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

21

to the Rescue

Lineage (provenance): “where data came from”• Internal lineage – pointers to base data• External lineage – files, URLs, web portals, etc.

In Trio: A Boolean function λ from alternatives to other alternatives (or external sources)

Lineage

Page 22: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

22

Example with Lineage

ID Saw (witness,car)11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects31

Jimmy

32

Billy ∥ Frank

33

Hank

???

Suspects = πperson(Saw ⋈car Drives)λ(31) = (11,2),(21,2)

λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)λ(33) = (11,1), 23

Correctly captures possible instances inthe result

Page 23: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

23

Trio Data Model

1. Alternatives (mutually exclusive) 2. ‘?’ (Maybe) Annotations 3. Confidences 4. Lineage

ULDBs are closed and complete

Uncertainty-Lineage Databases (ULDBs)

Page 24: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

24

ULDBs: Lineage

Conjunctive lineage sufficient for most operations• Disjunctive lineage for duplicate-elimination• Negative lineage for difference

Page 25: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

25

ULDBs: Minimality

A ULDB relation R represents a set of possible

instancesDoes every tuple in R appear in some

possible instance? (no extraneous tuples)

Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)

Also

Data-minimality

Lineage-minimality

Page 26: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

26

Data Minimality ExamplesExtraneous ‘?’

. . .

10

(Billy, Honda) ∥ (Frank, Honda)

. . .

. . .

40

Billy ∥ Frank

. . .

?λ(40,1)=(10,1); λ(40,2)=(10,2)extraneous

. . .

20

(Billy, Honda)

. . .

?. . .

30

(Frank, Honda)

. . .

?

Page 27: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

27

Data Minimality ExamplesExtraneous tuple

(Diane, Mazda) ∥ (Diane, Acura)

Dianeextraneous

(Diane, Mazda)

(Diane, Acura)

?

??

Page 28: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

28

ULDBs: Membership Questions

Does a given tuple t appear in some (all) possible instance(s) of R ?

Is a given table T one of (all of) the possible instances of R ?

Polynomial algorithms based on data-minimization

NP-Hard

Page 29: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

29

Non-Theorists...

Wake Back Up!

Page 30: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

30

Querying ULDBs

• Simple extension to SQL• Formal semantics, intuitive meaning• Query uncertainty, confidences, and

lineage

TriQL

Page 31: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

31

Initial TriQL Example

ID Saw (witness,car)11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car

ID Suspects31

Jimmy

32

Billy ∥ Frank

33

Hank

???

λ(31) = (11,2),(21,2)λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2)

λ(33) = (11,1), 23

Page 32: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

32

Formal Semantics

Query Q on ULDB D

D

D1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)

D’implementation of Q

operational semanticsD + Result

Page 33: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

33

Operational Semantics

Over conventional relational database:For each tuple in cross-product of X1, X2, ..., Xn

1. Evaluate the predicate2. If true, project attr-list to create result tuple3. If INTO clause, insert into table

SELECT attr-list [ INTO table ]FROM X1, X2, ..., XnWHERE predicate

Page 34: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

34

Operational Semantics

Over ULDB:For each tuple in cross-product of X1, X2, ..., Xn

1. Create “super tuple” T from all combinations of alternatives

2. Evaluate predicate on each alternative in T ; keep only the true ones

3. Project attr-list on each alternative to create result tuple

4. Details: ‘?’, lineage, confidences

SELECT attr-list [ INTO table ]FROM X1, X2, ..., XnWHERE predicate

Page 35: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

35

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

Page 36: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

36

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 37: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

37

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 38: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

38

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 39: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

39

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 40: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

40

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 41: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

41

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 42: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

42

Operational Semantics: Example

SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

SuspectsJim ∥ Bill

Hank??

λ( ) = ...λ( ) = ...

Page 43: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

43

Confidences

Confidences supplied with base dataTrio computes confidences on query results

• Default probabilistic interpretation• Can choose to plug in different arithmetic

Saw (witness,car)(Cathy, Honda): 0.6 ∥ (Cathy,

Mazda): 0.4

Drives (person,car)(Jim, Mazda): 0.3 ∥ (Bill, Mazda):

0.6(Hank, Honda) ): 1.0

SuspectsJim: 0.12 ∥ Bill: 0.24

Hank: 0.6??

?

0.3 0.4 0.6

Probabilistic Min

Page 44: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

44

TriQL: Querying Confidences

Built-in function: Conf() SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8

SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(*) > 0.4

Page 45: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

45

TriQL: Querying Lineage

Built-in join predicate: Lineage() SELECT Saw.witness INTO AccusesHank FROM Suspects, Saw WHERE Lineage(Suspects,Saw) AND Suspects.person = ‘Hank’

Also works for transitive lineage: SELECT witness FROM AccusesHank WHERE Lineage(Saw,AccusesHank)

Page 46: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

46

Additional Query Constructs

• “Horizontal subqueries”Refer to tuple alternatives as a relation

• Unmerged (allow horizontal duplicates)• Flatten, GroupAlts

• NoLineage, NoConf, NoMaybe

• Query-computed confidences• SQL-like data modification statements

Page 47: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

47

Final Example Query

PrimeSuspect (crime#, accuser, suspect)(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy,

Hank)(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10Betty 15Cathy 5

SuspectsJimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

List suspects with conf values based on accuser credibility

Page 48: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

48

Final Example Query

PrimeSuspect (crime#, accuser, suspect)(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy,

Hank)(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10Betty 15Cathy 5

SuspectsJimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

Page 49: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

49

Final Example Query

PrimeSuspect (crime#, accuser, suspect)(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy,

Hank)(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10Betty 15Cathy 5

SuspectsJimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

Page 50: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

50

Final Example Query

PrimeSuspect (crime#, accuser, suspect)(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy,

Hank)(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10Betty 15Cathy 5

SuspectsJimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

“Scaled as conf”

“Uniform as conf”

Page 51: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

51

The Trio System

Version 1Entirely on top of a conventional DBMSSurprisingly easy and complete, reasonably

efficient

Page 52: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

52

The Trio System: Version 1

Lin:RaidF

r tableaidT

o

R aid xid C

Relational DBMS

create trio table T(A,B)select C into R ...

TrioMetadat

a

Trio APISQL commands

• Result cursors• Traverse lineage

T aid xid A B

Page 53: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

53

Relation Encoding

aid xid

conf

person

car

21 1 2 Jim Mazda22 1 2 Bill Mazda23 2 1 Hank Honda

SuspectsJim ∥ Bill

Hank

aid xid conf person

31 1 3 Jim32 1 3 Bill33 2 2 Hank

Saw (witness,car)(Cathy, Honda) ∥ (Cathy,

Mazda)

Drives (person,car)(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

aid xid conf witness

car

11 1 2 Cathy Honda12 1 2 Cathy Mazda

??

aidFr

table aidTo

31 Saw 1231 Drives 2132 Saw 1232 Drives 2233 Saw 1133 Drives 23

Saw Drives

SuspectsLin:Suspects

Page 54: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

54

Relation Encoding with Confidences

aid xid

conf

person car

21 1 0.3 Jim Mazda22 1 0.6 Bill Mazda23 2 1.0 Hank Honda

aid xid conf person

31 1 NULL Jim32 1 NULL Bill33 2 NULL Hank

Saw (witness,car)(Cathy, Honda): 0.6 ∥ (Cathy,

Mazda): 0.4

Drives (person,car)(Jim, Mazda): 0.3 ∥ (Bill, Mazda):

0.6(Hank, Honda)

aid xid conf witness

car

11 1 0.6 Cathy Honda12 1 0.4 Cathy Mazda

Saw Drives

Suspects

0.12

0.24

0.6

Page 55: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

55

Query TranslationPersistent results: Query Q into result table R

1. Run query Q’ to produce “super-result” R Q’ ≈ Q but adds aid/xid’s of source tuples, joins lineage tables for lineage() predicates

2. Group R into alternatives, generate new xid’s3. Materialze lineage data into lin:R4. Compute confidences? (optional)5. Add/update metadata: schemas, confidence info, lineage structure

Transient results: stop at 2, return cursor over new xid’s

Page 56: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

56

The Trio System: Version 1

R aid xid C

Relational DBMS

create trio table T(A,B)select C into R ...

TrioMetadat

a

Trio APISQL commands

• Result cursors• Traverse lineage

T aid xid A B

Lin:RaidF

r tableaidT

o

Page 57: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

57

The Trio System: Version 1

Trio API and translator(TriQL parser in Python)

TrioMetadat

a

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL

• Data & system columns• A-IDs for alternatives• “Verticalized” through shared X-IDs• Confidences

• One lineage table per derived table• Connecting A-IDs

• Table types• Schema-level lineage structure

• conf()• lineage()

• TriQL queries• DDL commands• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

Command-lineclient

Relational DBMS

SPI-Plugin for

PostgreSQLReplacing

“Fetch-All”

Page 58: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

58

The Trio System: Version 2

Trio API and translator

TrioExplorer(GUI client)Command-line

client

Trio DBMSSpecialized

Trio Data Structures

SpecializedTrio Processing & Optimizing

Standard SQL Direct function calls

Page 59: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

59

Efficient Confidence Computation

Previous approach (probabilistic databases)• Each operator computes confidences during

query execution• Only certain (“safe”) query plans allowed

In ULDBs Confidence of alternative A is function of

confidences in λ(A)

Our approach• Decoupled data and confidence computation• Allow any query plan for data computation• Compute confidences on-demand based on

lineage

Page 60: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

60

CIDs & Confidence Memoization

Identify set of Closest Independent Descendants (CIDs) for each relation in the schema-level lineage DAG relations with no shared ancestors

Batched confidence computations in SQL• Select unions of distinct base alternatives from

all root-to-leaf lineage paths• With CIDs and memoization enabled

– if confidences are memoized then stop at CIDs – else traverse down to the leafs (base data) and memoize at CIDs

(ongoing work)

Page 61: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

61

“Probabilistic TPC-H” Setup

• Horizontal split of TPC-H tables

• Uniform-random number of alternatives [1-10]

• Uniform confidences

PartsuppsLineitemsOrders

O2O1

O L P

OL LP

OLP1.5M

1.0M 0.1M

1.0M 0.5M 0.7M

O3 L2L1 L3 L4 P2P1 P3

O L P

CID(OLP) = {O, L, P}

0.8M6M1.5M

Page 62: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

62

Confidence Computation Times

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

5,000

Exec

utio

n Ti

me

(sec

onds

)

O L P OL LP OLP Total

No CIDs

CIDs

Page 63: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

63

Per-Tuple vs. Batching (no CIDs)

Page 64: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

64

Memoization & Join Multiplicity

Page 65: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

65

Current Topics

System• Full query language• Nice interface• Performance experiments• Demo applications

Algorithms: confidence & lineage computation,

extraneous data, membership questions• Minimize lineage traversal• Confidence memoization• Efficient batch computations

Page 66: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

66

Future Directions

Theory, Model, Algorithms• Unlimited opportunities

System• Storage, indexing, partitioning• Statistics and query optimization

More features• Top-k by confidence • Incomplete relations; continuous uncertainty;

correlated uncertainty• External lineage; update lineage; versioning

Page 67: Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University

Search “stanford trio”Current Trio team:

Parag Agrawal, Anish Das Sarma, Raghotham Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe

Sugihara, Martin Theobald, Jennifer Widom

Special thanks to:Ashok Chandra, Alon Halevy, Jeff Ullman,

Omar Benjelloun

but don’t forgetthe lineage…