UncertaintyLineage
DataBases
VeryLargeDataBases
19752006
ULDBs: Databases with ULDBs: Databases with Uncertainty and LineageUncertainty and Lineage
Omar Benjelloun, Anish Das Sarma,
Alon Halevy, Jennifer WidomStanford InfoLab
33
MotMotivivationation
• Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete,
imprecise, fuzzy, inaccurate, ...)
• Many of the same applications need to track the lineage of their data
Neither uncertainty nor lineage are supported by conventional DBMSs
Coincidence or Fate?Coincidence or Fate?
44
Sample Applications Needing Sample Applications Needing Uncertainty and LineageUncertainty and Lineage
•Scientific databases
•Sensor databases
•Data cleaning
•Data integration
• Information extraction
55
Trio ProjectTrio Project
Building a new kind of DBMS in which:1. Data2. Uncertainty3. Lineage
are all first-class interrelated concepts
66
Lineage and UncertaintyLineage and Uncertainty
•Lots of independent work in lineage and uncertainty (related work at end of talk)
•Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications
Coincidence or Fate?Coincidence or Fate?
77
Lineage and UncertaintyLineage and Uncertainty
• Lineage...1) Enables simple and consistent
representation of uncertain data2) Correlates uncertainty in query results
with uncertainty in the input data3) Can make computation over uncertain
data more efficient
88
Outline of the TalkOutline of the Talk
•The ULDB data model
•Querying ULDBs
•ULDB properties
•Membership and extraction operations
•Confidences
•Current, related, and future work
99
Running Example: Crime Running Example: Crime SolverSolver
Saw(witness,car) Drives(person,car)
Suspects(person) = πperson(Saw ⋈ Drives)
1010
UncertaintyUncertainty
•An uncertain database represents a set of possible instances. Examples:– Amy saw either a Honda or a Toyota
– Jimmy drives a Toyota, a Mazda, or both
– Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3
– Hank is a suspect with confidence 0.7
1111
Uncertainty in a ULDBUncertainty in a ULDB
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences
1212
Uncertainty in a ULDBUncertainty in a ULDB
1. Alternatives: uncertainty about value
2. ‘?’ (Maybe) Annotations3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
Three possibleinstances
witness carAmy { Honda, Toyota,
Mazda }
=
1313
Six possibleinstances
Uncertainty in a ULDBUncertainty in a ULDB
1. Alternatives2. ‘?’ (Maybe): uncertainty about
presence3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
(Betty, Acura)?
1414
Uncertainty in a ULDBUncertainty in a ULDB
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences: weighted
uncertaintySaw (witness,car)
(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2
(Betty, Acura): 0.6?
Six possible instances, each with a probability
1515
Data Models for Data Models for UncertaintyUncertainty•Our model (so far) is not especially new•We spent some time exploring the
space of models for uncertainty [ICDE 2006]
•Tension between understandability and expressiveness– Our model is understandable– But it is not complete, or even closed under
common operations
1616
Closure and Closure and CompletenessCompleteness
•Completeness Can represent all sets of possible
instances
•Closure Can represent results of operations
•Note: Completeness Closure
1717
Model (so far) Not ClosedModel (so far) Not Closed
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
(Billy, Honda) ∥ (Frank, Honda)
(Hank, Honda)
Suspects
Jimmy
Billy ∥ Frank
Hank
Suspects = πperson(Saw ⋈ Drives)
?
??
Does not correctlycapture possibleinstances in theresult
CANNOT
1818
LineageLineage to the Rescue to the Rescue
•Lineage: “where data came from”– Internal lineage– External lineage (not covered in this
talk)
• In ULDBs: A function λ from alternatives to sets of alternatives (or external sources)
1919
Example with LineageExample with Lineage
ID Saw (witness,car)
11
(Cathy, Honda) ∥ (Cathy, Mazda)
ID Drives (person,car)
21
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
22
(Billy, Honda) ∥ (Frank, Honda)
23
(Hank, Honda)
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
?
?
?
Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)
λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)
λ(33) = (11,1), 23
Correctly captures possible instances inthe result
2020
ULDBsULDBs
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences4. Lineage
ULDBs are Closed and Complete
2121
Outline of the TalkOutline of the Talk
•The ULDB data model
•Querying ULDBs
•ULDB properties
•Membership and extraction operations
•Confidences
•Current, related, and future work
2222
Querying ULDBsQuerying ULDBs
•Query Q on ULDB D
DD
D1, D2, …, DnD1, D2, …, Dn
possibleinstances
Q on eachinstance
representationof instances
Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)
D’D’implementation of Q
D + ResultD + Result
2323
Well-Behaved ULDBsWell-Behaved ULDBs
• If we start with a well-behaved ULDB and perform standard queries, it remains well-behaved
• Intuitively (details in paper):– Acyclic: No cycles in the lineage– Deterministic: Non-empty lineages of distinct
alternatives are distinct– Uniform: Alternatives of same tuple are derived
from the same set of tuples
2424
ULDB MinimalityULDB Minimality
• Data-minimality• Does every alternative appear in some
possible instance? (no extraneous alternatives)
• Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)
•Lineage-minimality
2525
Data-Minimality ExamplesData-Minimality ExamplesExtraneous ‘?’
. . .
10 (Billy, Honda) ∥ (Frank, Honda)
. . .
. . .
20
Billy ∥ Frank
. . .
?λ(20,1)=(10,1); λ(20,2)=(10,2)
extraneous
2626
Data-Minimality ExamplesData-Minimality ExamplesExtraneous alternative
(Diane, Mazda) ∥ (Diane, Acura)
Dianeextraneous
(Diane, Mazda)
(Diane, Acura)
?
? ?
2727
Data-MinimizationData-Minimization
• Extraneous alternative theorem:– An alternative is extraneous iff it is (possibly
transitively) derived from multiple alternatives of the same tuple.
• Extraneous “?” theorem– A “?” on tuple t is extraneous iff
1. it is derived from base tuples without “?”2. t has as many alternatives as the product of the
number in its base tuples
• Minimization algorithm based on the theorems
(see paper)
2828
Data-minimalLineage-minimal
Lineage-minimal
Data-minimal
Data-minimize
Queries
Lineage-minimize
ULDB Properties and ULDB Properties and OperationsOperations
Mem
bers
hip
Extraction
2929
Membership QuestionsMembership Questions
• Does a given tuple t appear in some (all) possible instance(s) of R ?– Polynomial algorithms based on Data-
minimization
• Is a given table T one of (all of) the possible instances of R ?– NP-Hard
t?, T?
RR
I1, I2, …, InI1, I2, …, In
possibleinstances
3030
ExtractionExtraction
•Extraction algorithm in paper
Suspects Eats
Saw Drives
3131
Outline of the TalkOutline of the Talk
•The ULDB data model
•Querying ULDBs
•ULDB properties
•Membership and extraction operations
•Confidences
•Current, related, and future work
3232
ConfidencesConfidences
• Confidences supplied with base data
• Trio computes confidences on query results– Default probabilistic interpretation– Can choose to plug in different arithmetic
Saw (witness,car)
(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4
Drives (person,car)
(Jimmy, Mazda): 0.3 ∥ (Bill, Mazda): 0.6
(Hank, Honda): 1.0
Suspects
Jimmy: 0.12 ∥ Bill: 0.24
Hank: 0.6
?
?
?
0.3 0.4
0.6
ProbabilisticProbabilistic Min
3333
Query Processing with Query Processing with ConfidencesConfidences• Previous approach (probabilistic
databases)– Each operator computes confidences during
query execution– Only certain query plans allowed
• In ULDBs– Confidence of alternative A is function of
confidences in its transitive lineage
• Our approach: Decouple data and confidence computation– Use any query plan for data computation– Compute confidences on-demand using
lineage– Can give arbitrarily large improvements
3434
Current Work: AlgorithmsCurrent Work: Algorithms
•Algorithms: confidence computation, extraneous data, membership questions– Minimize lineage traversal– Memoization– Batch computations
3535
The Trio TrioThe Trio Trio
• Data Model– ULDBs (Coming: incomplete relations;
continuous uncertainty; correlation uncertainty)
• Query Language– Simple extension to SQL– Query uncertainty, confidences, and lineage
• System– Did you see our demo? – Version 1: Entirely on top of conventional DBMS– Surprisingly easy and complete, reasonably
efficient
TriQL
3636
Brief Related WorkBrief Related Work
•Uncertainty– Modeling
•C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90]
– Systems•ProbView [LLRS97], MYSTIQ [BDM+05],
ORION [CSP05], Trio [BDHW05]
•Lineage– DBNotes [CTV05], Data Warehouses
[CW03]
but don’t forgetthe lineage…
Thank YouThank You
Search “stanford trio”
(or, http://i.stanford.edu/trio)