System Aspects of Probabilistic Data Management

System Aspects of Probabilistic Data Management

Magdalena Balazinska, Christopher Ré and Dan Suciu

University of Washington

One slide overview of motivation

• Data are uncertain in many applications– Business: Dedup, Info. Extraction– Data from physical-world: RFID

2

Probabilistic DBs (pDBs) manage uncertainty

Integrate, Query, and Build Applications on uncertain data

Value: Higher recall, without loss of precision

DB Niche: Community that knows scale

Overview of tutorial

• Part I: Basic Query Processing (Today)– Two Scenarios for pDBs– A Basic Query & Data Model – Basic Query Processing Techniques

• Highlights:1. The intuition behind and how to compile safe plans2. Process any SELECT-FROM-WHERE (SFW) query3. Process top-k queries4. Aggregation: Top-k + measures, OLAP, HAVING

3


• Part II: Advanced Techniques (Tomorrow)– Correlations – Advanced Representation & QP– Discussion and Open Problems

• Highlights:1. Lineage and View Processing (GBs of data)2. Events on Correlated Streams (GBs of Streams)3. Sophisticated Factor Evaluation (Highly Correlated)4. Continuous DBs

4

Hasn’t this been solved? (an analogy to keep in mind)

5

AI Databases

Deterministic Theorem prover

Query processing

Probabilistic Probabilistic inference [this talk]

Impact: Fortune 500 companies rely on DBs, but how many have theorem provers?

SCALE

Ancillary Material

• pDBs have a long history – Cavallo&Pitarelli ’87– ProbView [Lakshmanan et al’97]– Many active projects today: Mystiq, Lahar, Trio,

MayBMS, Maryland, Orion, MCDB, Wisconsin, IBM, BayesStore, UMass, Waterloo, SFU and more

• Many important topics omitted – Query languages– XML

6


• Part I: Basic Query Processing (Today)– Two Scenarios for pDBs– A Basic Query & Data Model – Basic Query Processing Techniques

• Highlights:1. The intuition behind and how to compile safe plans2. Process any SELECT-FROM-WHERE (SFW) query3. Process top-k queries4. Aggregation: Top-k + measures, OLAP, HAVING

7

Example 1: Querying RFID8

C B

A

DE

• Apps: UbiComp, Diary, Social Applications,..• In general, Event queries [Cayuga, Sase]

Joe entered office 422 at t=8

Query: “Alert when Joe enters 422”

i.e. Joe outside 422, inside 422

[R,Letchner,B&S’07] [http://rfid.cs.washington.edu][R,Letchner,B&S’07] [http://rfid.cs.washington.edu]

Challenges: Tracking Joe’s Location

9

6th Floor in CS building

Blue ring is Joe’s Location

Antennas

[RFID Ecosystem @ UW][RFID Ecosystem @ UW]


Challenges: Tracking Joe’s Location

10

Blue ring is Joe’s Location

Antennas Two Problems:1. Missed Readings2. Granularity Mismatch

• Model Based View (Probabilistic)– [Deshpande et al 04, Kanagal & Deshpande’08]

[Re et al ‘08, Kanagal & Deshpande’08][Re et al ‘08, Kanagal & Deshpande’08]

Probabilities via particle filter

11

Each orange particle is a guess of Joe’s location

Blue ring is ground truth

Antennas

Particles guess many locations per timestep, so data are uncertain


[Doucet et al’01][Doucet et al’01]

Probabilities via particle filter

12


[R et al ’08] [Kanagal & Deshpande’08][R et al ’08] [Kanagal & Deshpande’08]

Tag t Loc P

Joe 7 422 0.4

Hall3 0.4

Hall4 0.2

Joe 8 422 0.6

Hall3 0.2

Hall4 0.2

Sue 7 … …

“Joe entered 422 at t=8 with probability 0.36”

Shameless Ad: Markov Correlations on Day 2

Query Particle Filter output via At, a model based view

At(tag,loc)

13

IMDB

IMDB:

• Lots of data !

• Well maintained and clean

• But no reviews!

Example 2: Alice Looks for Movies

I’d like to know whichmovies are really good…

[R,Dalvi&S’07][R,Dalvi&S’07]

14

IMDB

On the web thereare lots of reviews…

Which movie is the review about?

…is the reviewpositive or negative ?

…should I trustthe reviewer ?

Alice needs:• Information Extraction• Fuzzy joins• Sentiment analysis• Social networksForced to deal with uncertainty

15

Find actors in Pulp Fiction whoappeared in two bad moviesfive years earlier

Find years when‘Anthony Hopkins’starred in a goodmovie

IMDB

A probabilisticdatabase canhelp Alice storeand query heruncertain data

Alice’s workflow:1. Download reviews2. Information Extraction3. Fuzzy Joins4. Query pDB

IE FJ pDB

16

Alice needs Information Extraction

ID House-No Street City P

1 52 Goregaon West Mumbai 0.1

1 52-A Goregaon West Mumbai 0.4



2 . . . . . . . . . . . . . . . .

2 . . . .

...52 A Goregaon West Mumbai ...

Here probabilities are meaningful

Here probabilities are meaningful

Addressp

[Gupta&Sarawagi’2006][Gupta&Sarawagi’2006] IE FJ pDB

Queries on IE

SELECT DISTINCT x.nameFROM Person x, Addressp yWHERE x.ID = y.ID and y.city = ‘West Mumbai’


Find people living in ‘West Mumbai’

IE FJ pDB

ID House-No Street City P





By PJoe 0.4

If kept only most likely extraction, would return empty set

18

Queries on IE



Find people living in ‘West Mumbai’

Today: keep only the most likely extraction: low recall.pDBs keeps all extractions: higher recall.Today: keep only the most likely extraction: low recall.pDBs keeps all extractions: higher recall.

SELECT DISTINCT x.name, u.nameFROM Person x, Addressp y, Person u, Addressp vWHERE x.ID = y.ID and y.city = v.city and u.ID = v.ID

SELECT DISTINCT x.name, u.nameFROM Person x, Addressp y, Person u, Addressp vWHERE x.ID = y.ID and y.city = v.city and u.ID = v.ID

Find people of the same age, living in the same city

IE FJ pDB

19

Alice needs Fuzzy Joins

IMDB Reviews

Title Year

Twelve Monkeys 1995

Monkey Love 1997 1997

Monkey Love 1935 1935

Monkey Love Panet 2005

titles don’tmatch

Review By Rating

12 Monkeys Joe 4

Monkey Boy Jim 2

Monkey Love Joe 2

IE FJ pDB

20

Result of a Fuzzy Join

TitleReviewMatchp

Movie Review P

Twelve Monkeys 12 Monkeys 0.7

Monkey Love 1997 12 Monkeys 0.45

Monkey Love 1935 Monkey Love 0.82

Monkey Love 1935 Monkey Boy 0.68

Monkey Love Planet Monkey Love 0.8

[Gravano et al’01,Arasu’06][Gravano et al’01,Arasu’06] IE FJ pDB

Higher scores, more likely to match

21

Queries over Fuzzy JoinsMovieTitle Year

Twelve Monkeys 1995

Monkey Love 97 1997

Monkey Love 35 1935

Monkey Love PL 2005

Review By Rating

12 Monkeys Joe 4

Monkey Boy Jim 2

Monkey Love Joe 2

Movie Review P

Twelve Monkeys 12 Monkeys 0.7

Monkey Love 97 12 Monkeys 0.45

Monkey Love 35 Monkey Love 0.82

Monkey Love 35 Monkey Boy 0.68

Monkey Love Planet Monkey Love 0.8

Who reviewed movies made in 1935 ? By PJoe 0.73

Fred 0.68

Jim 0.43. . . 0.12

IMDB ReviewsTitleReviewMatchp

SELECT DISTINCT z.ByFROM IMDB x, TitleReviewMatchp y, Amazon zWHERE x.title=y.title and x.year=1935 and y.review=z.review

SELECT DISTINCT z.ByFROM IMDB x, TitleReviewMatchp y, Amazon zWHERE x.title=y.title and x.year=1935 and y.review=z.review

Ranked !

Find movies reviewed by Jim and Joe

SELECT DISTINCT x.TitleFROM IMDB x, TitleReviewMatchp y1, Amazon z1, TitleReviewMatchp y2, Amazon z2WHERE . . .z1.By=‘Joe’ . . . . z2.By=‘Jim’ . . .

SELECT DISTINCT x.TitleFROM IMDB x, TitleReviewMatchp y1, Amazon z1, TitleReviewMatchp y2, Amazon z2WHERE . . .z1.By=‘Joe’ . . . . z2.By=‘Jim’ . . .

Title PGone with… 0.73

Amadeus 0.68

. . . 0.43

Answer:

Answer:

IE FJ pDB

Application Summary

• pDBs can manage outputs of great techniques• Value over standard RDBMs: Recall• To keep precision high, need ranking (by prob)

Major Theme: Get high quality efficiently!

RFID: Particle Filters, HMMS

Alice needs:• Fuzzy Joins• IE • Sentiment Analysis

22


• Part I: Basic Query Processing – Two Scenarios for pDBs– A Basic Query & Data Model – Basic Query Processing Techniques

23

24

Simple Probabilistic DB (pDB)

Object Time Person P

Laptop77 9:07John 0.62

Jim 0.34

Book302 9:18

Mary 0.45

John 0.33

Fred 0.11

HasObjectp

What does it mean ?

Keys ProbabilityNon-keys

[Barbara et al. ‘92][Barbara et al. ‘92]

25

Possible Worlds Semantics


Laptop77 9:07John p1

Jim p2

Book302 9:18

Mary p3

John p4

Fred p5

Object Tim Person

Laptop77 9:07 John

Book302 9:18 Mary

Object Tim Person

Laptop77 9:07 John

Book302 9:18 John

Object Tim Person

Laptop77 9:07 John

Book302 9:18 Fred

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 Mary

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 John

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 Fred

Object Tim Person

Laptop77 9:07 JohnObject Tim Person

Laptop77 9:07 JimObject Tim Person

Book302 9:18 MaryObject Tim Person

Book302 9:18 JohnObject Tim Person

Book302 9:18 FredObject Tim Person

p1p3p1p4

p1(1- p3-p4-p5)

Possibleworlds

PDB

HasObjectp

HasObject

[Fagin,Halpern,Megido’90][Fagin,Halpern,Megido’90]

Distribution over possible worlds

26

Two Approaches to Queries

• Standard queries, probabilistic answers– Query: “find all movies with rating > 4”– Answers: list of tuples with probabilities

• Queries with explicit probabilities– Query: find all Movie-review matches with

probability in [0.3, 0.8]– Answer: …

This tutorial

[Koch ’08] MayBMS

Object Tim Person

Laptop77 9:07 John

Book302 9:18 Mary

27

Object Tim Person

Laptop77 9:07 John

Book302 9:18 John

Possible Worlds Query Semantics


Laptop77 9:07John p1

Jim p2

Book302 9:18

Mary p3

John p4

Fred p5

Object Tim Person

Laptop77 9:07 John

Book302 9:18 Fred

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 Mary

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 John

Object Tim Person

Laptop77 9:07 Jim

Book302 9:18 Fred

Object Tim Person

Laptop77 9:07 JohnObject Tim Person

Laptop77 9:07 JimObject Tim Person

Book302 9:18 MaryObject Tim Person

Book302 9:18 JohnObject Tim Person

Book302 9:18 FredObject Tim Person

PDB

HasObjectp

HasObject“John has laptop77 and doesn’t have book302”

p1p3

p1p5

p1(1- p3-p4-p5)

= p1(1-p4)

QP Goal: Compute cleverly, directly

Overview of Part I

• Part I: Basic Query Processing (TODAY)– Motivating Applications – A Simple Data Model (Representation)– Basic Query Processing Techniques

28

Basic Query Processing Outline

• SELECT-FROM-WHERE Queries– Compiling Safe Queries– Unsafe Queries (Sampling)– Top-K

• Aggregation Queries + Probabilities– Top-K + Measures– OLAP Queries – HAVING Queries

Natural start, workhorse RDMS queries.

Believe these are very important for applications

29

30

Extensional Query EvaluationGoal: Make relational ops compute probabilities

s

v p

v p

JOIN

v1 p1

v1 v2 p1 p2

v2 p2

P

v p1

v p2

v 1-(1-p1)(1-p2)…

Why? It’s SQL–scale and SQL-fastWhy? It’s SQL–scale and SQL-fast

[Fuhr&Roellke’97, Dalvi & S ‘04][Fuhr&Roellke’97, Dalvi & S ‘04]

“Not all are false”

Removes Duplicates

Extensional Plan to SQL

Person Loc p

Bob SEA p1

Joe NYC p2

Jon SEA p3

Jeff SEA p4

SELECT DISTINCT locFROM HomeOffice

Loc P

SEA 1-(1-p1)(1-p3)(1-p4)NYC p2

SELECT loc, 1 – PRODUCT(1-p) as pFROM HomeOfficeGROUP BY loc

Important point: Extensional Evaluation is SQL – so SQL fast

HomeOffice

[Fuhr&Roellke’97, Dalvi & S ‘04][Fuhr&Roellke’97, Dalvi & S ‘04]

So pDBs are just SQL, but…

NB: Remove attribute

P{-person}Translation

31

32

Jon Sea p1

Jon q1

Jon q2

Jon q3

SELECT DISTINCT x.CityFROM Personp x, Purchasep yWHERE x.Name = y.Cust and y.Product = ‘Gadget’

SELECT DISTINCT x.CityFROM Personp x, Purchasep yWHERE x.Name = y.Cust and y.Product = ‘Gadget’

Jon Sea p1q1

Jon Sea p1q2

Jon Sea p1q3

Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)

Jon Sea p1 Jon q1

Jon q2

Jon q3

Jon 1-(1-q1)(1-q2)(1-q3)

Sea p1(1-(1-q1)(1-q2)(1-q3))

Wrong !

Correct

Depends on plan !!!Depends on plan !!!

[Dalvi&S’04][Dalvi&S’04]

JOIN

PJOIN

PNot independent!

Safe Plans

• A plan that correctly computes probabilities is called a safe plan

• Query Compilation = finding this condition• Q: When are projected tuples independent?

Intuition: A plan is safe if it only multiplies independent probabilities.


33

A Definition of Independence

No tuple used by both qa and qb.

Query q is independent on variable x if q{x ←à’} and q{x ← `b’} are independent

events for any distinct constants a,b

Fundamental judgment for large scale QP (GB, TB)

[Dalvi&S’04][R,Dalvi,S’06][R&S’07a][R&S’07b][R,Letchner,B&S’08][Dalvi&S’04][R,Dalvi,S’06][R&S’07a][R&S’07b][R,Letchner,B&S’08]

Safe Plans: reduce problem of evaluate q to q{x ← a} for some a.

If x is shared in all subgoals of q then x is independent on q.

And no Self-Joins

34

q = R(x,y), S(x,y), T(z,x) q{ x ←à’} = R(à’,y), S(à’,y), T(z,à’)

q{ x ←`b’} = R(`b’,y), S(`b’,y), T(z,`b’)

Compiling Safe Plans (Top-Down)Example coming…Assuming no self-joins, tuple indep.

Compile[Query q] returns A plan1. If single subgoal R with no variables then return R2. If exists x s.t. q is independent on x then

Return P-{x}( Compile[ q{x ← FreshConst()} ] )

3. ElsIf q=q1q2 so that qi do not share variables thenreturn Join(Compile[q1], Compile[q2])

4. Else return “No Safe Plan”

35


Compiling Safe Plans (Top-Down)Compile[Query q] returns A plan1. If single subgoal R with no variables then return R2. If exists x s.t. q is independent on x then




Compile[ R(x),S(x,y) ]

Compile[ R(à’),S(à’,y) ]

Compile(R(à’))Compile(S(à’,y))

Compile(S(à’,`b’))A safe plan!

R

S

JOIN

P-{x}

P-{y}

36

[Dalvi&S’04][Dalvi&S’04] Assuming no self-joins, tuple indep.

Compiling Safe Plans (Top-Down)

Compile(R(x),S(x,y),T(y)) No Safe Plan!

Does our algorithm miss some plans?

Compile[Query q] returns A plan1. If single subgoal R with no variables then return R2. If exists x s.t. q is independent on x then




37

Assuming no self-joins, tuple indep.

38

Thm: The algorithm is Complete

Qbad :- R(x), S(x,y), T(y)Qbad :- R(x), S(x,y), T(y)Data

complexityis #P complete

Theorem The following are equivalent• Q has PTIME data complexity• Q admits an extensional plan (and one finds it in PTIME)• Q does not have Qbad as a subquery

Theorem The following are equivalent• Q has PTIME data complexity• Q admits an extensional plan (and one finds it in PTIME)• Q does not have Qbad as a subquery

Bottomline: If there is a plan, we find it. If we don’t find a plan, it’s provably hard


NB: never looked at the data, so is query compilation

Basic Query Processing Techniques



39

40

Intensional Query EvaluationGoal: Make relational ops compute Boolean expression f

s

v f

v f

v1 f1

v1 v2 f1˄ f2

v2 f2

P

v f1

v f2

…

v f1 ˅ f2 …

[Fuhr&Roellke’97, Graedel et al. ’98, Dalvi & S ‘04][Fuhr&Roellke’97, Graedel et al. ’98, Dalvi & S ‘04]

f is a small DNFf is a small DNF

Pr[q] reduced toPr[f is SAT].

NB: f is also known as lineage

JOIN Idea: Approximate Pr[f is SAT]

Tuples = variables in expression

41

Monte Carlo Simulation

Set Cnt = 0repeat N times randomly choose X1, X2, X3 in {0,1} if E(X1, X2, X3) = 1 then Cnt = Cnt+1P = Cnt/Nreturn P /* ' Pr(E) */

Set Cnt = 0repeat N times randomly choose X1, X2, X3 in {0,1} if E(X1, X2, X3) = 1 then Cnt = Cnt+1P = Cnt/Nreturn P /* ' Pr(E) */

(0/1)-Estimator Theorem.

If then

(0/1)-Estimator Theorem.

If then

X1X2 X1X3

X2X3

Naïve:

Good: Works for any E (not just DNF)

[Karp,Luby&Madras’89][Karp,Luby&Madras’89]

1 2 1 3 2 3X X XE X X X

1 2(Pr( )2

) 4ln N E

Pr( / Pr[ ] 1 )P E

May be very big (Pr(E) very small)

Bad: Many samples (N) until get a sat assignment

sample

Estimate Pr[E] = 1/6

42

Monte Carlo Simulation

Luby-Karp Theorem.

If then

Luby-Karp Theorem.

If then

X1X2 X1X3

X2X3

Improved:

[Karp,Luby&Madras’89][Karp,Luby&Madras’89]

1 2 1 3 2 3X X XE X X X

2 2 2

4lnN m

Pr( / Pr[ ] 1 )P E

Key idea: Estimate overlap of SAT assigns

X1X2 X1X3

X2X3

Samples from here

Better now! Bottom Line: if E from SFW query, efficient technique

1. Pick a monomial (randomly) – satisfy it2. Pick other vars randomly3. Count overlap

In 2 sets, so contributes ½ NB: Because DNF still sats E

Basic Query Processing Techniques



43

Motivation for Top-K for SFW queries

• LK is fast in theory…


Find the top actor in Pulp Fiction who appeared in two bad movies five years earlier

0.0 1.01

3

4

2

Can we do better?

Naïve: Sim until all small

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

“Confidence intervals” contain true probability 44

45

A Better Method: Multisimulation

• Separate Top-K with few simulations– Concentrate on intervals in Top-K– Asymptotically, confidence intervals are nested

• Compare against OPT: “knows” intervals to simulate

Evaluating Complex SQL on PDBs 4512/8/2006

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2


46

Key Idea: Critical Region

• The critical region is the interval– (kth-highest min, k+1st higest max)– For k = 2

0.0 1.0


47

Key Idea: Critical Region

• The critical region is the interval– (kth-highest min, k+1st higest max)– For k = 2

0.0 1.0


Separated the top 2

48

Three Simple Rules: Rule 1

0.0 1.0

Pick a “Double Crosser” OPT must pick this too

49


• All lower/upper crossers then maximal– OPT must pick this too

0.0 1.0

50


• Pick an upper and a lower crosser– OPT may only pick 1 of these two

0.0 1.0

51

Multisimulation Performance

• Thm: Multisimulation performs at most twice as many simulations as OPT– And, no deterministic algorithm can do better on every

instance.

• Practice: very slow w.o. low-level optimization– Still slow with current techniques.

• Open question!


Slow v. SQL, not inference




52

3 Semantics for Top-K + Measures

• The worst speeder? 2 speeders?• Combine prob+measure

• All 3 semantics:1. Create single score2. Return ranked by score

License Plate

Speed P

A-123 200 0.250 0.8

B-456 75 0.970 0.1

C-789 74 1

[Soliman et al’07][Zhang&Chomicki’08][Soliman et al’07][Zhang&Chomicki’08]

A-123 either 200 or 50

Differ in score def 53

Semantic 1: Expectation

• The worst speeder? 2 speeders?• Expectation

– Score=Expected Speed

License Plate

E[Speed]

A-123 80B-456 74.5C-789 74 Top1 = {A-123}

Top2 = {A-123,B-456}

Linear apx, so fast to compute!

License Plate

Speed Conf

A-123 200 0.250 0.8

B-456 75 0.970 0.1

C-789 74 1200 *.2 + 50 *.8

54

Semantic 2: U-kRanks

• The worst speeder? 2 speeders?• U-kRank

– Score(t)=Pr[t at rank k]

License Plate

Rank 1 Rank 2

A-123 0.2 0.0B-456 0.72 0.14C-789 0.08 0.496 Top1 = {B-456}

Top2 = {B-456,C-789}NB: Soliman et al consider correlations

[Soliman et al’07][Soliman et al’07]

License Plate

Speed Conf

A-123 200 0.250 0.8

B-456 75 0.970 0.1

C-789 74 1

0.8 * 0.9

55

Semantic 3: Global-Top-K

• The worst speeder? 2 speeders?• Global-Top-K

– Score(t)=Pr[t in top-k]

[Zhang&Chomicki’08][Zhang&Chomicki’08]

License Plate

Top-1 Top-2

A-123 0.2 0.2B-456 0.72 0.98C-789 0.08 0.8 Top1 = {B-456}

Top2 = {B-456,C-789}

License Plate

Speed Conf

A-123 200 0.250 0.8

B-456 75 0.970 0.1

C-789 74 1

56

Comparing the semantics

• Z&C’s three properties for top-k

[Zhang&Chomicki’08][Zhang&Chomicki’08]

Exact k: If the cardinality of the db is large then the top-k has k exactly distinct values

Faithful: If the probability and score of t is higher than u, then u in top-k implies t in top-k

Stability: Raising the score/probability of a tuple in top-k, will not remove it from the top-k.

THM [Z&C’08]: Global-top-k has these properties.

Expectation also has these properties 57




58

Motivation for OLAP

• Customer Relationship Management App

• Data is dirty:– Extracted/Classified from text (e.g. Color, Brake)– Attributes are non-leaf/ambiguous (e.g. EAST)

• Do we need probabilities?

[Burdick et al’05][Burdick et al’05]

Auto Loc Cost Color Brake?

F-150 NY $200 R:1,B:0 0.8F-150 EAST $140 R:0.5,B:0.5 1.0Truck MA $500 R:1,B:0 0.9

Is it a brake repair?

East = NY? East= MA?

Sources of uncertainty

59

OLAP Data & Query ModelAuto Loc Cost Color Brake?


NY MA

T1F-150

RAM

T1T2T3

T2T3

EAST

TR

UC

KS

“Cost of F-150 brake repairs in NY”

“Cost of F-150 brake repairs in EAST”

Query Regions


Size is not significant 60

3 Semantics for OLAPAuto Loc Cost Color Brake?


NY MA

T1F-150

RAM

T1T2T3

T2T3

EAST

TR

UC

KS


Size is not significant

Not faithful: Color uncertainty, breaks report!

Sem 1, None. Any uncertainty, ignore tuple.

61



NY MA

T1F-150

RAM

T1T2T3

T2T3

EAST

TR

UC

KS



Sem 2:Contains. Contained in query’s region.

Not Consistent. NY + MA != Easti.e. Blue + Yellow ≠ Green(t2 not in either.)

62



NY MA

T1F-150

RAM

T1T2T3

T2T3

EAST

TR

UC

KS



Sem 3: Overlaps. Probability in each region

Motivation for pDB approach-Consistent for Sum

63

OLAP Algorithms

• Answer semantics: expectations– SUM

– AVG


[ ( )] [ ]Pr[ ]Sum A t A t Q E

Tuple contributes to Q

[ ( )][ ( )]

[ ( )]

Sum AAVG A

Count AE

EE

When COUNT big, good approximation [Jayram et al ‘07]

Important, well-studied problem: I/O optimizations, constraints [Burdick et al’06,07]

Faithful, consistent and efficient!

Difficult to implement!

64

Motivation for HAVINGItem Forecaster Amount P

Widget Alice $-99k 0.99

Bob $100M 0.01

Whatsit Alice $1M 1

SELECT SUM(Amount)FROM ProfitWHERE item=‘Widget’

SELECT item FROM ProfitWHERE item =‘Widget’GROUP BY itemHAVING SUM(Amount) > 0

Expectation Style [OLAP Style] HAVING style

Ans: -99k *.99 +100M*0.01 ~900K

Ans: 0.01

Profit

65

[R&S’07][R&S’07]

Summary of HAVING results

• Safety uses the independence test – Twist: Safety depends on the aggregate– If the “plan is safe” then so is COUNT, MIN,MAX

• Not true for SUM and AVG!

• Theoretical Algorithms– Require innovation to make SQL efficient

• Native operators, sort based algorithm, etc.

[R&S’07][R&S’07]

66

Top-K & Aggregation Summary

• Diverse semantics driven by applications– Top K: U-kRanks and Global-top-k– OLAP & HAVING– Skylines too! [Pei et al ‘08]

• Lots of interest in the community– Conjecture: Aggregation and Top-k are more

important for probabilistic databases than RDBMS• Tuple carries less information• Many prob tuples not as valuable as 1 correct tuple

67

Take-home messages of Day 1

• pDBs used in diverse application domains– RFID, Information Extraction, Sentiment Analysis– Value: Higher Recall, without loss of precision

• The fundamentals of QP in pDBs– Compile a safe query to SQL– Evaluate an unsafe plan (Monte Carlo)– Top-K Semantics for pDBs– OLAP on Probabilistic pDBs

68

Advertisement for Day Two

• Applications– RFID with movies, Smoothed data

• Advanced representations– Lineage, Markov Models, Graphical Models, World

Sets, Continuous Function.• Advanced QP

– Lazy Evaluation in Trio, Probabilistic Automaton, Probabilistic Inference, Sampling Technique.

And More!

All sales final. Offer not valid in Alaska, or where prohibited by law.69

Thank you

70

Documents

System Aspects of Probabilistic Data Management