33
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington

Top-K Query Evaluation on Probabilistic Data

  • Upload
    jared

  • View
    31

  • Download
    2

Embed Size (px)

DESCRIPTION

Top-K Query Evaluation on Probabilistic Data. Christopher Ré , Nilesh Dalvi and Dan Suciu University of Washington. High Level Overview. DBMS: Precise answers over clean data Data are often imprecise Information Integration Information Extraction - PowerPoint PPT Presentation

Citation preview

Page 1: Top-K Query Evaluation on Probabilistic Data

Top-K Query Evaluation on Probabilistic DataChristopher Ré, Nilesh Dalvi and Dan SuciuUniversity of Washington

Page 2: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 212/8/2006

High Level Overview

DBMS: Precise answers over clean data Data are often imprecise

Information Integration Information Extraction

Probabilistic DB (PDB) handle imprecisionMany low quality answersTop-K ranked by probability

This talk: Compute Top-K Efficiently

Page 3: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 312/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Page 4: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 412/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Page 5: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 512/8/2006

Example Application

IMDB• Lots of interesting data above

movies (e.g. actors, directors)• Well maintained and clean• But no reviews!

On the web there are lots of reviews

How will I know which movie they

are about?

Alice needs to do information extraction and object reconcillation.

Is a movie good or bad?

Alice wants to do sentiment analysis.

A probabilistic database can help Alice store and query her uncertain data.

Find all years where ‘Anthony Hopkins’ starred in a good

movie

Page 6: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 612/8/2006

Imprecision is out there…Object Reconciliation

RID Titler124 12 Monkeysr155 Twelve Monkeysr175 2 Monkeyr194 Monk

MID Titlem232 12 Monkeysm143 Monkey Love

Our Approach: Convert scores to probabilities

Data extracted from Reviews

Clean IMDB Data

Output: (RID,MID) pairs

12/8/2006

MatchNo Match

t’ t

Felligi-Sunter Approach: Score (s) each (RID,MID)

Page 7: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 712/8/2006

Imprecision is out there…

Object Reconciliation

RID Titler124 12 Monkeysr155 Twelve Monkeysr175 2 Monkeyr194 Monk

MID Titlem232 12 Monkeysm143 Monkey Love

RID MID Probr175 m232 0.8r175 m143 0.2

Felligi-Sunter Approach: Score (s) each (RID,MID)

MatchNo Match

t’ t

Page 8: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 812/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Page 9: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 912/8/2006

Query Processing Background

RID MID Probr175 m232 0.8r175 m143 0.2

Query Processing builds event expression

• Intensional Query Processing [FR97]

• Associate to each tuple an event

• Probability event is satisfied = query valueTechnical Point: Projection as last operator implies result is a DNF

Page 10: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1012/8/2006

DNF Sampling at a High Level

Estimate p(t),probability DNF sat satisfied Do for each output tuple, t#P-Hard [Valiant79] even if only conjunctive

queries [RDS06,DS04]Randomized Approximation [LK84]

Simulation reduces uncertainty

0.0 1.0Uncertain about p(t)

Page 11: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1112/8/2006

Naïve Query Processing

Naïve algorithm (PTIME): Simulate until all small “Epsilon”-small

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

Can we do better?

Page 12: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1212/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Page 13: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1312/8/2006

A Better Method: Multisimulation Separate Top-K with few simulations

Concentrate on intervals in Top-K Asymptotically, confidence intervals are nested

Compare against OPT “knows” which intervals to simulate

Evaluating Complex SQL on PDBs 1312/8/2006

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

Page 14: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1412/8/2006

The Critical Region

The critical region is the interval (kth-highest min, k+1st higest max) For k = 2

0.0 1.0

Page 15: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1512/8/2006

Three Simple Rules: Rule 1

0.0 1.0

Pick a “Double Crosser” OPT must pick this too

Page 16: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1612/8/2006

Three Simple Rules: Rule 2

All lower/upper crossers then maximal OPT must pick this too

0.0 1.0

Page 17: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1712/8/2006

Three Simple Rules: Rule 3

Pick an upper and a lower crosser OPT may only pick 1 of these two

0.0 1.0

Page 18: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1812/8/2006

Multisimulation is a 2-Approx

Thm: Multisimulation performs at most twice as many simulations as OPT And, no deterministic algorithm can do better on every

instance. Extensions

Top-K Set (shown) Anytime (produce from 1 to k) Rank (produce top k ranked) All ( rank all intervals )

Page 19: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 1912/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Page 20: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2012/8/2006

Experiment Details: Uncertain tuples

Table # TuplesStringMatch 339k

ActorMatch 6,758k

DirectorMatch 18k

Table # TuplesReviews 292k

Page 21: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2112/8/2006

Running Time

Page 22: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2212/8/2006

Running Time

“Find all years in which Anthony Hopkins was in a highly rated movie” (SS)Small Number of Tuples Output (33)

Small DNFs per Output

(Avg. 20.4, Max 63)

Page 23: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2312/8/2006

Running Time

“Find all directors who have a highly rated drama but low rated comedy” (LL)Large #Tuples Output (1415)

Large DNFs per Output

(Avg. 234.8, Max. 9088)

Page 24: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2412/8/2006

Conclusions

Mystiq is a general purpose probabilistic database

Multisimulation and Logical Optimization key to performance on large data sets

Advert: Demo on my laptop

Page 25: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2512/8/2006

Running Time“Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction” (SL)Small Number of Tuples Output (33)

Large DNFs per Output

(Avg. 117.7,Max 685)

Page 26: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2612/8/2006

Running Time“Find all directors in the 80s who had a highly rated movie” (LS)Large #Tuples Output (3259)

Small DNFs per Output

(Avg 3.03, Max 30)

Page 27: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2712/8/2006

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

Page 28: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2812/8/2006

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

Page 29: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 2912/8/2006

0.0 1.0

Page 30: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 3012/8/2006

0.0 1.0

Page 31: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 3112/8/2006

0.0 1.0

Page 32: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 3212/8/2006

0.0 1.0

Page 33: Top-K Query Evaluation on Probabilistic Data

Evaluating Complex SQL on PDBs 3312/8/2006

0.0 1.0