70
System Aspects of Probabilistic Data Management Magdalena Balazinska, Christopher Ré and Dan Suciu University of Washington

System Aspects of Probabilistic Data Management

  • Upload
    velika

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

System Aspects of Probabilistic Data Management. Magdalena Balazinska , Christopher Ré and Dan Suciu University of Washington. One slide overview of motivation. Data are uncertain in many applications Business: Dedup , Info. Extraction Data from physical-world: RFID. - PowerPoint PPT Presentation

Citation preview

Page 1: System Aspects of Probabilistic Data Management

System Aspects of Probabilistic Data Management

Magdalena Balazinska, Christopher Ré and Dan Suciu

University of Washington

Page 2: System Aspects of Probabilistic Data Management

One slide overview of motivation

• Data are uncertain in many applications– Business: Dedup, Info. Extraction– Data from physical-world: RFID

2

Probabilistic DBs (pDBs) manage uncertainty

Integrate, Query, and Build Applications on uncertain data

Value: Higher recall, without loss of precision

DB Niche: Community that knows scale

Page 3: System Aspects of Probabilistic Data Management

Overview of tutorial

• Part I: Basic Query Processing (Today)– Two Scenarios for pDBs– A Basic Query & Data Model – Basic Query Processing Techniques

• Highlights:1. The intuition behind and how to compile safe plans2. Process any SELECT-FROM-WHERE (SFW) query3. Process top-k queries4. Aggregation: Top-k + measures, OLAP, HAVING

3

Page 4: System Aspects of Probabilistic Data Management

Overview of tutorial

• Part II: Advanced Techniques (Tomorrow)– Correlations – Advanced Representation & QP– Discussion and Open Problems

• Highlights:1. Lineage and View Processing (GBs of data)2. Events on Correlated Streams (GBs of Streams)3. Sophisticated Factor Evaluation (Highly Correlated)4. Continuous DBs

4

Page 5: System Aspects of Probabilistic Data Management

Hasn’t this been solved? (an analogy to keep in mind)

5

AI Databases

Deterministic Theorem prover

Query processing

Probabilistic Probabilistic inference [this talk]

Impact: Fortune 500 companies rely on DBs, but how many have theorem provers?

SCALE

Page 6: System Aspects of Probabilistic Data Management

Ancillary Material

• pDBs have a long history – Cavallo&Pitarelli ’87– ProbView [Lakshmanan et al’97]– Many active projects today: Mystiq, Lahar, Trio,

MayBMS, Maryland, Orion, MCDB, Wisconsin, IBM, BayesStore, UMass, Waterloo, SFU and more

• Many important topics omitted – Query languages– XML

6

Page 7: System Aspects of Probabilistic Data Management

Overview of tutorial

• Part I: Basic Query Processing (Today)– Two Scenarios for pDBs– A Basic Query & Data Model – Basic Query Processing Techniques

• Highlights:1. The intuition behind and how to compile safe plans2. Process any SELECT-FROM-WHERE (SFW) query3. Process top-k queries4. Aggregation: Top-k + measures, OLAP, HAVING

7

Page 8: System Aspects of Probabilistic Data Management

Example 1: Querying RFID8

C B

A

DE

• Apps: UbiComp, Diary, Social Applications,..• In general, Event queries [Cayuga, Sase]

Joe entered office 422 at t=8

Query: “Alert when Joe enters 422”

i.e. Joe outside 422, inside 422

[R,Letchner,B&S’07] [http://rfid.cs.washington.edu][R,Letchner,B&S’07] [http://rfid.cs.washington.edu]

Page 9: System Aspects of Probabilistic Data Management

Challenges: Tracking Joe’s Location

9

6th Floor in CS building

Blue ring is Joe’s Location

Antennas

[RFID Ecosystem @ UW][RFID Ecosystem @ UW]

Page 10: System Aspects of Probabilistic Data Management

6th Floor in CS building

Challenges: Tracking Joe’s Location

10

Blue ring is Joe’s Location

Antennas Two Problems:1. Missed Readings2. Granularity Mismatch

• Model Based View (Probabilistic)– [Deshpande et al 04, Kanagal & Deshpande’08]

[Re et al ‘08, Kanagal & Deshpande’08][Re et al ‘08, Kanagal & Deshpande’08]

Page 11: System Aspects of Probabilistic Data Management

Probabilities via particle filter

11

Each orange particle is a guess of Joe’s location

Blue ring is ground truth

Antennas

Particles guess many locations per timestep, so data are uncertain

6th Floor in CS building

[Doucet et al’01][Doucet et al’01]

Page 12: System Aspects of Probabilistic Data Management

Probabilities via particle filter

12

6th Floor in CS building

[R et al ’08] [Kanagal & Deshpande’08][R et al ’08] [Kanagal & Deshpande’08]

Tag t Loc P

Joe 7 422 0.4

Hall3 0.4

Hall4 0.2

Joe 8 422 0.6

Hall3 0.2

Hall4 0.2

Sue 7 … …

“Joe entered 422 at t=8 with probability 0.36”

Shameless Ad: Markov Correlations on Day 2

Query Particle Filter output via At, a model based view

At(tag,loc)

Page 13: System Aspects of Probabilistic Data Management

13

IMDB

IMDB:

• Lots of data !

• Well maintained and clean

• But no reviews!

Example 2: Alice Looks for Movies

I’d like to know whichmovies are really good…

[R,Dalvi&S’07][R,Dalvi&S’07]

Page 14: System Aspects of Probabilistic Data Management

14

IMDB

On the web thereare lots of reviews…

Which movie is the review about?

…is the reviewpositive or negative ?

…should I trustthe reviewer ?

Alice needs:• Information Extraction• Fuzzy joins• Sentiment analysis• Social networksForced to deal with uncertainty

Page 15: System Aspects of Probabilistic Data Management

15

Find actors in Pulp Fiction whoappeared in two bad moviesfive years earlier

Find years when‘Anthony Hopkins’starred in a goodmovie

IMDB

A probabilisticdatabase canhelp Alice storeand query heruncertain data

Alice’s workflow:1. Download reviews2. Information Extraction3. Fuzzy Joins4. Query pDB

IE FJ pDB

Page 16: System Aspects of Probabilistic Data Management

16

Alice needs Information Extraction

ID House-No Street City P

1 52 Goregaon West Mumbai 0.1

1 52-A Goregaon West Mumbai 0.4

1 52 Goregaon West Mumbai 0.2

1 52-A Goregaon West Mumbai 0.2

2 . . . . . . . . . . . . . . . .

2 . . . .

...52 A Goregaon West Mumbai ...

Here probabilities are meaningful

Here probabilities are meaningful

Addressp

[Gupta&Sarawagi’2006][Gupta&Sarawagi’2006] IE FJ pDB

Page 17: System Aspects of Probabilistic Data Management

Queries on IE

SELECT DISTINCT x.nameFROM Person x, Addressp yWHERE x.ID = y.ID and y.city = ‘West Mumbai’

SELECT DISTINCT x.nameFROM Person x, Addressp yWHERE x.ID = y.ID and y.city = ‘West Mumbai’

Find people living in ‘West Mumbai’

IE FJ pDB

ID House-No Street City P

1 52 Goregaon West Mumbai 0.1

1 52-A Goregaon West Mumbai 0.4

1 52 Goregaon West Mumbai 0.2

1 52-A Goregaon West Mumbai 0.2

By PJoe 0.4

If kept only most likely extraction, would return empty set

Page 18: System Aspects of Probabilistic Data Management

18

Queries on IE

SELECT DISTINCT x.nameFROM Person x, Addressp yWHERE x.ID = y.ID and y.city = ‘West Mumbai’

SELECT DISTINCT x.nameFROM Person x, Addressp yWHERE x.ID = y.ID and y.city = ‘West Mumbai’

Find people living in ‘West Mumbai’

Today: keep only the most likely extraction: low recall.pDBs keeps all extractions: higher recall.Today: keep only the most likely extraction: low recall.pDBs keeps all extractions: higher recall.

SELECT DISTINCT x.name, u.nameFROM Person x, Addressp y, Person u, Addressp vWHERE x.ID = y.ID and y.city = v.city and u.ID = v.ID

SELECT DISTINCT x.name, u.nameFROM Person x, Addressp y, Person u, Addressp vWHERE x.ID = y.ID and y.city = v.city and u.ID = v.ID

Find people of the same age, living in the same city

IE FJ pDB

Page 19: System Aspects of Probabilistic Data Management

19

Alice needs Fuzzy Joins

IMDB Reviews

Title Year

Twelve Monkeys 1995

Monkey Love 1997 1997

Monkey Love 1935 1935

Monkey Love Panet 2005

titles don’tmatch

Review By Rating

12 Monkeys Joe 4

Monkey Boy Jim 2

Monkey Love Joe 2

IE FJ pDB

Page 20: System Aspects of Probabilistic Data Management

20

Result of a Fuzzy Join

TitleReviewMatchp

Movie Review P

Twelve Monkeys 12 Monkeys 0.7

Monkey Love 1997 12 Monkeys 0.45

Monkey Love 1935 Monkey Love 0.82

Monkey Love 1935 Monkey Boy 0.68

Monkey Love Planet Monkey Love 0.8

[Gravano et al’01,Arasu’06][Gravano et al’01,Arasu’06] IE FJ pDB

Higher scores, more likely to match

Page 21: System Aspects of Probabilistic Data Management

21

Queries over Fuzzy JoinsMovieTitle Year

Twelve Monkeys 1995

Monkey Love 97 1997

Monkey Love 35 1935

Monkey Love PL 2005

Review By Rating

12 Monkeys Joe 4

Monkey Boy Jim 2

Monkey Love Joe 2

Movie Review P

Twelve Monkeys 12 Monkeys 0.7

Monkey Love 97 12 Monkeys 0.45

Monkey Love 35 Monkey Love 0.82

Monkey Love 35 Monkey Boy 0.68

Monkey Love Planet Monkey Love 0.8

Who reviewed movies made in 1935 ? By PJoe 0.73

Fred 0.68

Jim 0.43. . . 0.12

IMDB ReviewsTitleReviewMatchp

SELECT DISTINCT z.ByFROM IMDB x, TitleReviewMatchp y, Amazon zWHERE x.title=y.title and x.year=1935 and y.review=z.review

SELECT DISTINCT z.ByFROM IMDB x, TitleReviewMatchp y, Amazon zWHERE x.title=y.title and x.year=1935 and y.review=z.review

Ranked !

Find movies reviewed by Jim and Joe

SELECT DISTINCT x.TitleFROM IMDB x, TitleReviewMatchp y1, Amazon z1, TitleReviewMatchp y2, Amazon z2WHERE . . .z1.By=‘Joe’ . . . . z2.By=‘Jim’ . . .

SELECT DISTINCT x.TitleFROM IMDB x, TitleReviewMatchp y1, Amazon z1, TitleReviewMatchp y2, Amazon z2WHERE . . .z1.By=‘Joe’ . . . . z2.By=‘Jim’ . . .

Title PGone with… 0.73

Amadeus 0.68

. . . 0.43

Answer:

Answer:

IE FJ pDB

Page 22: System Aspects of Probabilistic Data Management

Application Summary

• pDBs can manage outputs of great techniques• Value over standard RDBMs: Recall• To keep precision high, need ranking (by prob)

Major Theme: Get high quality efficiently!

RFID: Particle Filters, HMMS

Alice needs:• Fuzzy Joins• IE • Sentiment Analysis

22

Page 23: System Aspects of Probabilistic Data Management

Overview of tutorial

• Part I: Basic Query Processing – Two Scenarios for pDBs– A Basic Query & Data Model – Basic Query Processing Techniques

23

Page 24: System Aspects of Probabilistic Data Management

24

Simple Probabilistic DB (pDB)

Object Time Person P

Laptop77 9:07John 0.62

Jim 0.34

Book302 9:18

Mary 0.45

John 0.33

Fred 0.11

HasObjectp

What does it mean ?

Keys ProbabilityNon-keys

[Barbara et al. ‘92][Barbara et al. ‘92]

Page 25: System Aspects of Probabilistic Data Management

25

Possible Worlds Semantics

Object Time Person P

Laptop77 9:07John p1

Jim p2

Book302 9:18

Mary p3

John p4

Fred p5

Object Tim Person

Laptop77 9:07 John

Book302 9:18 Mary

Object Tim Person

Laptop77 9:07 John

Book302 9:18 John

Object Tim Person

Laptop77 9:07 John

Book302 9:18 Fred

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 Mary

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 John

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 Fred

Object Tim Person

Laptop77 9:07 JohnObject Tim Person

Laptop77 9:07 JimObject Tim Person

Book302 9:18 MaryObject Tim Person

Book302 9:18 JohnObject Tim Person

Book302 9:18 FredObject Tim Person

p1p3p1p4

p1(1- p3-p4-p5)

Possibleworlds

PDB

HasObjectp

HasObject

[Fagin,Halpern,Megido’90][Fagin,Halpern,Megido’90]

Distribution over possible worlds

Page 26: System Aspects of Probabilistic Data Management

26

Two Approaches to Queries

• Standard queries, probabilistic answers– Query: “find all movies with rating > 4”– Answers: list of tuples with probabilities

• Queries with explicit probabilities– Query: find all Movie-review matches with

probability in [0.3, 0.8]– Answer: …

This tutorial

[Koch ’08] MayBMS

Page 27: System Aspects of Probabilistic Data Management

Object Tim Person

Laptop77 9:07 John

Book302 9:18 Mary

27

Object Tim Person

Laptop77 9:07 John

Book302 9:18 John

Possible Worlds Query Semantics

Object Time Person P

Laptop77 9:07John p1

Jim p2

Book302 9:18

Mary p3

John p4

Fred p5

Object Tim Person

Laptop77 9:07 John

Book302 9:18 Fred

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 Mary

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 John

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 Fred

Object Tim Person

Laptop77 9:07 JohnObject Tim Person

Laptop77 9:07 JimObject Tim Person

Book302 9:18 MaryObject Tim Person

Book302 9:18 JohnObject Tim Person

Book302 9:18 FredObject Tim Person

PDB

HasObjectp

HasObject“John has laptop77 and doesn’t have book302”

p1p3

p1p5

p1(1- p3-p4-p5)

= p1(1-p4)

QP Goal: Compute cleverly, directly

Page 28: System Aspects of Probabilistic Data Management

Overview of Part I

• Part I: Basic Query Processing (TODAY)– Motivating Applications – A Simple Data Model (Representation)– Basic Query Processing Techniques

28

Page 29: System Aspects of Probabilistic Data Management

Basic Query Processing Outline

• SELECT-FROM-WHERE Queries– Compiling Safe Queries– Unsafe Queries (Sampling)– Top-K

• Aggregation Queries + Probabilities– Top-K + Measures– OLAP Queries – HAVING Queries

Natural start, workhorse RDMS queries.

Believe these are very important for applications

29

Page 30: System Aspects of Probabilistic Data Management

30

Extensional Query EvaluationGoal: Make relational ops compute probabilities

s

v p

v p

JOIN

v1 p1

v1 v2 p1 p2

v2 p2

P

v p1

v p2

v 1-(1-p1)(1-p2)…

Why? It’s SQL–scale and SQL-fastWhy? It’s SQL–scale and SQL-fast

[Fuhr&Roellke’97, Dalvi & S ‘04][Fuhr&Roellke’97, Dalvi & S ‘04]

“Not all are false”

Removes Duplicates

Page 31: System Aspects of Probabilistic Data Management

Extensional Plan to SQL

Person Loc p

Bob SEA p1

Joe NYC p2

Jon SEA p3

Jeff SEA p4

SELECT DISTINCT locFROM HomeOffice

Loc P

SEA 1-(1-p1)(1-p3)(1-p4)NYC p2

SELECT loc, 1 – PRODUCT(1-p) as pFROM HomeOfficeGROUP BY loc

Important point: Extensional Evaluation is SQL – so SQL fast

HomeOffice

[Fuhr&Roellke’97, Dalvi & S ‘04][Fuhr&Roellke’97, Dalvi & S ‘04]

So pDBs are just SQL, but…

NB: Remove attribute

P{-person}Translation

31

Page 32: System Aspects of Probabilistic Data Management

32

Jon Sea p1

Jon q1

Jon q2

Jon q3

SELECT DISTINCT x.CityFROM Personp x, Purchasep yWHERE x.Name = y.Cust and y.Product = ‘Gadget’

SELECT DISTINCT x.CityFROM Personp x, Purchasep yWHERE x.Name = y.Cust and y.Product = ‘Gadget’

Jon Sea p1q1

Jon Sea p1q2

Jon Sea p1q3

Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)

Jon Sea p1 Jon q1

Jon q2

Jon q3

Jon 1-(1-q1)(1-q2)(1-q3)

Sea p1(1-(1-q1)(1-q2)(1-q3))

Wrong !

Correct

Depends on plan !!!Depends on plan !!!

[Dalvi&S’04][Dalvi&S’04]

JOIN

PJOIN

PNot independent!

Page 33: System Aspects of Probabilistic Data Management

Safe Plans

• A plan that correctly computes probabilities is called a safe plan

• Query Compilation = finding this condition• Q: When are projected tuples independent?

Intuition: A plan is safe if it only multiplies independent probabilities.

[Dalvi&S’04][Dalvi&S’04]

33

Page 34: System Aspects of Probabilistic Data Management

A Definition of Independence

No tuple used by both qa and qb.

Query q is independent on variable x if q{x ←`a’} and q{x ← `b’} are independent

events for any distinct constants a,b

Fundamental judgment for large scale QP (GB, TB)

[Dalvi&S’04][R,Dalvi,S’06][R&S’07a][R&S’07b][R,Letchner,B&S’08][Dalvi&S’04][R,Dalvi,S’06][R&S’07a][R&S’07b][R,Letchner,B&S’08]

Safe Plans: reduce problem of evaluate q to q{x ← a} for some a.

If x is shared in all subgoals of q then x is independent on q.

And no Self-Joins

34

q = R(x,y), S(x,y), T(z,x) q{ x ←`a’} = R(`a’,y), S(`a’,y), T(z,`a’)

q{ x ←`b’} = R(`b’,y), S(`b’,y), T(z,`b’)

Page 35: System Aspects of Probabilistic Data Management

Compiling Safe Plans (Top-Down)Example coming…Assuming no self-joins, tuple indep.

Compile[Query q] returns A plan1. If single subgoal R with no variables then return R2. If exists x s.t. q is independent on x then

Return P-{x}( Compile[ q{x ← FreshConst()} ] )

3. ElsIf q=q1q2 so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2])

4. Else return “No Safe Plan”

35

[Dalvi&S’04][Dalvi&S’04]

Page 36: System Aspects of Probabilistic Data Management

Compiling Safe Plans (Top-Down)Compile[Query q] returns A plan1. If single subgoal R with no variables then return R2. If exists x s.t. q is independent on x then

Return P-{x}( Compile[ q{x ← FreshConst()} ] )

3. ElsIf q=q1q2 so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2])

4. Else return “No Safe Plan”

Compile[ R(x),S(x,y) ]

Compile[ R(`a’),S(`a’,y) ]

Compile(R(`a’))Compile(S(`a’,y))

Compile(S(`a’,`b’))A safe plan!

R

S

JOIN

P-{x}

P-{y}

36

[Dalvi&S’04][Dalvi&S’04] Assuming no self-joins, tuple indep.

Page 37: System Aspects of Probabilistic Data Management

Compiling Safe Plans (Top-Down)

Compile(R(x),S(x,y),T(y)) No Safe Plan!

Does our algorithm miss some plans?

Compile[Query q] returns A plan1. If single subgoal R with no variables then return R2. If exists x s.t. q is independent on x then

Return P-{x}( Compile[ q{x ← FreshConst()} ] )

3. ElsIf q=q1q2 so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2])

4. Else return “No Safe Plan”

37

Assuming no self-joins, tuple indep.

Page 38: System Aspects of Probabilistic Data Management

38

Thm: The algorithm is Complete

Qbad :- R(x), S(x,y), T(y)Qbad :- R(x), S(x,y), T(y)Data

complexityis #P complete

Theorem The following are equivalent• Q has PTIME data complexity• Q admits an extensional plan (and one finds it in PTIME)• Q does not have Qbad as a subquery

Theorem The following are equivalent• Q has PTIME data complexity• Q admits an extensional plan (and one finds it in PTIME)• Q does not have Qbad as a subquery

Bottomline: If there is a plan, we find it. If we don’t find a plan, it’s provably hard

[Dalvi&S’04][Dalvi&S’04]

NB: never looked at the data, so is query compilation

Page 39: System Aspects of Probabilistic Data Management

Basic Query Processing Techniques

• SELECT-FROM-WHERE Queries– Compiling Safe Queries– Unsafe Queries (Sampling)– Top-K

• Aggregation Queries + Probabilities– Top-K + Measures– OLAP Queries – HAVING Queries

39

Page 40: System Aspects of Probabilistic Data Management

40

Intensional Query EvaluationGoal: Make relational ops compute Boolean expression f

s

v f

v f

v1 f1

v1 v2 f1˄ f2

v2 f2

P

v f1

v f2

v f1 ˅ f2 …

[Fuhr&Roellke’97, Graedel et al. ’98, Dalvi & S ‘04][Fuhr&Roellke’97, Graedel et al. ’98, Dalvi & S ‘04]

f is a small DNFf is a small DNF

Pr[q] reduced toPr[f is SAT].

NB: f is also known as lineage

JOIN Idea: Approximate Pr[f is SAT]

Tuples = variables in expression

Page 41: System Aspects of Probabilistic Data Management

41

Monte Carlo Simulation

Set Cnt = 0repeat N times randomly choose X1, X2, X3 in {0,1} if E(X1, X2, X3) = 1 then Cnt = Cnt+1P = Cnt/Nreturn P /* ' Pr(E) */

Set Cnt = 0repeat N times randomly choose X1, X2, X3 in {0,1} if E(X1, X2, X3) = 1 then Cnt = Cnt+1P = Cnt/Nreturn P /* ' Pr(E) */

(0/1)-Estimator Theorem.

If then

(0/1)-Estimator Theorem.

If then

X1X2 X1X3

X2X3

Naïve:

Good: Works for any E (not just DNF)

[Karp,Luby&Madras’89][Karp,Luby&Madras’89]

1 2 1 3 2 3X X XE X X X

1 2(Pr( )2

) 4ln N E

Pr( / Pr[ ] 1 )P E

May be very big (Pr(E) very small)

Bad: Many samples (N) until get a sat assignment

sample

Estimate Pr[E] = 1/6

Page 42: System Aspects of Probabilistic Data Management

42

Monte Carlo Simulation

Luby-Karp Theorem.

If then

Luby-Karp Theorem.

If then

X1X2 X1X3

X2X3

Improved:

[Karp,Luby&Madras’89][Karp,Luby&Madras’89]

1 2 1 3 2 3X X XE X X X

2 2 2

4lnN m

Pr( / Pr[ ] 1 )P E

Key idea: Estimate overlap of SAT assigns

X1X2 X1X3

X2X3

Samples from here

Better now! Bottom Line: if E from SFW query, efficient technique

1. Pick a monomial (randomly) – satisfy it2. Pick other vars randomly3. Count overlap

In 2 sets, so contributes ½ NB: Because DNF still sats E

Page 43: System Aspects of Probabilistic Data Management

Basic Query Processing Techniques

• SELECT-FROM-WHERE Queries– Compiling Safe Queries– Unsafe Queries (Sampling)– Top-K

• Aggregation Queries + Probabilities– Top-K + Measures– OLAP Queries – HAVING Queries

43

Page 44: System Aspects of Probabilistic Data Management

Motivation for Top-K for SFW queries

• LK is fast in theory…

[R,Dalvi&S’07][R,Dalvi&S’07]

Find the top actor in Pulp Fiction who appeared in two bad movies five years earlier

0.0 1.01

3

4

2

Can we do better?

Naïve: Sim until all small

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

“Confidence intervals” contain true probability 44

Page 45: System Aspects of Probabilistic Data Management

45

A Better Method: Multisimulation

• Separate Top-K with few simulations– Concentrate on intervals in Top-K– Asymptotically, confidence intervals are nested

• Compare against OPT: “knows” intervals to simulate

Evaluating Complex SQL on PDBs 4512/8/2006

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

[R,Dalvi&S’07][R,Dalvi&S’07]

Page 46: System Aspects of Probabilistic Data Management

46

Key Idea: Critical Region

• The critical region is the interval– (kth-highest min, k+1st higest max)– For k = 2

0.0 1.0

[R,Dalvi&S’07][R,Dalvi&S’07]

Page 47: System Aspects of Probabilistic Data Management

47

Key Idea: Critical Region

• The critical region is the interval– (kth-highest min, k+1st higest max)– For k = 2

0.0 1.0

[R,Dalvi&S’07][R,Dalvi&S’07]

Separated the top 2

Page 48: System Aspects of Probabilistic Data Management

48

Three Simple Rules: Rule 1

0.0 1.0

Pick a “Double Crosser” OPT must pick this too

Page 49: System Aspects of Probabilistic Data Management

49

Three Simple Rules: Rule 2

• All lower/upper crossers then maximal– OPT must pick this too

0.0 1.0

Page 50: System Aspects of Probabilistic Data Management

50

Three Simple Rules: Rule 3

• Pick an upper and a lower crosser– OPT may only pick 1 of these two

0.0 1.0

Page 51: System Aspects of Probabilistic Data Management

51

Multisimulation Performance

• Thm: Multisimulation performs at most twice as many simulations as OPT– And, no deterministic algorithm can do better on every

instance.

• Practice: very slow w.o. low-level optimization– Still slow with current techniques.

• Open question!

[R,Dalvi&S’07][R,Dalvi&S’07]

Slow v. SQL, not inference

Page 52: System Aspects of Probabilistic Data Management

Basic Query Processing Outline

• SELECT-FROM-WHERE Queries– Compiling Safe Queries– Unsafe Queries (Sampling)– Top-K

• Aggregation Queries + Probabilities– Top-K + Measures– OLAP Queries – HAVING Queries

52

Page 53: System Aspects of Probabilistic Data Management

3 Semantics for Top-K + Measures

• The worst speeder? 2 speeders?• Combine prob+measure

• All 3 semantics:1. Create single score2. Return ranked by score

License Plate

Speed P

A-123 200 0.250 0.8

B-456 75 0.970 0.1

C-789 74 1

[Soliman et al’07][Zhang&Chomicki’08][Soliman et al’07][Zhang&Chomicki’08]

A-123 either 200 or 50

Differ in score def 53

Page 54: System Aspects of Probabilistic Data Management

Semantic 1: Expectation

• The worst speeder? 2 speeders?• Expectation

– Score=Expected Speed

License Plate

E[Speed]

A-123 80B-456 74.5C-789 74 Top1 = {A-123}

Top2 = {A-123,B-456}

Linear apx, so fast to compute!

License Plate

Speed Conf

A-123 200 0.250 0.8

B-456 75 0.970 0.1

C-789 74 1200 *.2 + 50 *.8

54

Page 55: System Aspects of Probabilistic Data Management

Semantic 2: U-kRanks

• The worst speeder? 2 speeders?• U-kRank

– Score(t)=Pr[t at rank k]

License Plate

Rank 1 Rank 2

A-123 0.2 0.0B-456 0.72 0.14C-789 0.08 0.496 Top1 = {B-456}

Top2 = {B-456,C-789}NB: Soliman et al consider correlations

[Soliman et al’07][Soliman et al’07]

License Plate

Speed Conf

A-123 200 0.250 0.8

B-456 75 0.970 0.1

C-789 74 1

0.8 * 0.9

55

Page 56: System Aspects of Probabilistic Data Management

Semantic 3: Global-Top-K

• The worst speeder? 2 speeders?• Global-Top-K

– Score(t)=Pr[t in top-k]

[Zhang&Chomicki’08][Zhang&Chomicki’08]

License Plate

Top-1 Top-2

A-123 0.2 0.2B-456 0.72 0.98C-789 0.08 0.8 Top1 = {B-456}

Top2 = {B-456,C-789}

License Plate

Speed Conf

A-123 200 0.250 0.8

B-456 75 0.970 0.1

C-789 74 1

56

Page 57: System Aspects of Probabilistic Data Management

Comparing the semantics

• Z&C’s three properties for top-k

[Zhang&Chomicki’08][Zhang&Chomicki’08]

Exact k: If the cardinality of the db is large then the top-k has k exactly distinct values

Faithful: If the probability and score of t is higher than u, then u in top-k implies t in top-k

Stability: Raising the score/probability of a tuple in top-k, will not remove it from the top-k.

THM [Z&C’08]: Global-top-k has these properties.

Expectation also has these properties 57

Page 58: System Aspects of Probabilistic Data Management

Basic Query Processing Outline

• SELECT-FROM-WHERE Queries– Compiling Safe Queries– Unsafe Queries (Sampling)– Top-K

• Aggregation Queries + Probabilities– Top-K + Measures– OLAP Queries – HAVING Queries

58

Page 59: System Aspects of Probabilistic Data Management

Motivation for OLAP

• Customer Relationship Management App

• Data is dirty:– Extracted/Classified from text (e.g. Color, Brake)– Attributes are non-leaf/ambiguous (e.g. EAST)

• Do we need probabilities?

[Burdick et al’05][Burdick et al’05]

Auto Loc Cost Color Brake?

F-150 NY $200 R:1,B:0 0.8F-150 EAST $140 R:0.5,B:0.5 1.0Truck MA $500 R:1,B:0 0.9

Is it a brake repair?

East = NY? East= MA?

Sources of uncertainty

59

Page 60: System Aspects of Probabilistic Data Management

OLAP Data & Query ModelAuto Loc Cost Color Brake?

F-150 NY $200 R:1,B:0 0.8F-150 EAST $140 R:0.5,B:0.5 1.0Truck MA $500 R:1,B:0 0.9

NY MA

T1F-150

RAM

T1T2T3

T2T3

EAST

TR

UC

KS

“Cost of F-150 brake repairs in NY”

“Cost of F-150 brake repairs in EAST”

Query Regions

[Burdick et al’05][Burdick et al’05]

Size is not significant 60

Page 61: System Aspects of Probabilistic Data Management

3 Semantics for OLAPAuto Loc Cost Color Brake?

F-150 NY $200 R:1,B:0 0.8F-150 EAST $140 R:0.5,B:0.5 1.0Truck MA $500 R:1,B:0 0.9

NY MA

T1F-150

RAM

T1T2T3

T2T3

EAST

TR

UC

KS

[Burdick et al’05][Burdick et al’05]

Size is not significant

Not faithful: Color uncertainty, breaks report!

Sem 1, None. Any uncertainty, ignore tuple.

61

Page 62: System Aspects of Probabilistic Data Management

3 Semantics for OLAPAuto Loc Cost Color Brake?

F-150 NY $200 R:1,B:0 0.8F-150 EAST $140 R:0.5,B:0.5 1.0Truck MA $500 R:1,B:0 0.9

NY MA

T1F-150

RAM

T1T2T3

T2T3

EAST

TR

UC

KS

[Burdick et al’05][Burdick et al’05]

Size is not significant

Sem 2:Contains. Contained in query’s region.

Not Consistent. NY + MA != Easti.e. Blue + Yellow ≠ Green(t2 not in either.)

62

Page 63: System Aspects of Probabilistic Data Management

3 Semantics for OLAPAuto Loc Cost Color Brake?

F-150 NY $200 R:1,B:0 0.8F-150 EAST $140 R:0.5,B:0.5 1.0Truck MA $500 R:1,B:0 0.9

NY MA

T1F-150

RAM

T1T2T3

T2T3

EAST

TR

UC

KS

[Burdick et al’05][Burdick et al’05]

Size is not significant

Sem 3: Overlaps. Probability in each region

Motivation for pDB approach-Consistent for Sum

63

Page 64: System Aspects of Probabilistic Data Management

OLAP Algorithms

• Answer semantics: expectations– SUM

– AVG

[Burdick et al’05][Burdick et al’05]

[ ( )] [ ]Pr[ ]Sum A t A t Q E

Tuple contributes to Q

[ ( )][ ( )]

[ ( )]

Sum AAVG A

Count AE

EE

When COUNT big, good approximation [Jayram et al ‘07]

Important, well-studied problem: I/O optimizations, constraints [Burdick et al’06,07]

Faithful, consistent and efficient!

Difficult to implement!

64

Page 65: System Aspects of Probabilistic Data Management

Motivation for HAVINGItem Forecaster Amount P

Widget Alice $-99k 0.99

Bob $100M 0.01

Whatsit Alice $1M 1

SELECT SUM(Amount)FROM ProfitWHERE item=‘Widget’

SELECT item FROM ProfitWHERE item =‘Widget’GROUP BY itemHAVING SUM(Amount) > 0

Expectation Style [OLAP Style] HAVING style

Ans: -99k *.99 +100M*0.01 ~900K

Ans: 0.01

Profit

65

[R&S’07][R&S’07]

Page 66: System Aspects of Probabilistic Data Management

Summary of HAVING results

• Safety uses the independence test – Twist: Safety depends on the aggregate– If the “plan is safe” then so is COUNT, MIN,MAX

• Not true for SUM and AVG!

• Theoretical Algorithms– Require innovation to make SQL efficient

• Native operators, sort based algorithm, etc.

[R&S’07][R&S’07]

66

Page 67: System Aspects of Probabilistic Data Management

Top-K & Aggregation Summary

• Diverse semantics driven by applications– Top K: U-kRanks and Global-top-k– OLAP & HAVING– Skylines too! [Pei et al ‘08]

• Lots of interest in the community– Conjecture: Aggregation and Top-k are more

important for probabilistic databases than RDBMS• Tuple carries less information• Many prob tuples not as valuable as 1 correct tuple

67

Page 68: System Aspects of Probabilistic Data Management

Take-home messages of Day 1

• pDBs used in diverse application domains– RFID, Information Extraction, Sentiment Analysis– Value: Higher Recall, without loss of precision

• The fundamentals of QP in pDBs– Compile a safe query to SQL– Evaluate an unsafe plan (Monte Carlo)– Top-K Semantics for pDBs– OLAP on Probabilistic pDBs

68

Page 69: System Aspects of Probabilistic Data Management

Advertisement for Day Two

• Applications– RFID with movies, Smoothed data

• Advanced representations– Lineage, Markov Models, Graphical Models, World

Sets, Continuous Function.• Advanced QP

– Lazy Evaluation in Trio, Probabilistic Automaton, Probabilistic Inference, Sampling Technique.

And More!

All sales final. Offer not valid in Alaska, or where prohibited by law.69

Page 70: System Aspects of Probabilistic Data Management

Thank you

70