51
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Efficient Query Evaluation on Probabilistic Databases

Nilesh Dalvi

Dan Suciu

Presenter : Amit GoyalDiscussion Lead : Jonatan

Page 2: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Outline

Motivation Query Evaluation:

Intensional Extensional

Query Optimization Complexity Unsafe Plans Extensions Conclusions

Page 3: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Databases Are Deterministic

Databases we see today are deterministic A tuple is either in the query answer or not They don’t deal with uncertainties

Page 4: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Future of Data Management

Uncertainties in Data Biological Data Sensor Data (Geographical Data) Data extracted from various AI, data mining

techniques (information extraction) Uncertainties are represented as probabilities Extend data management tools to handle

probabilistic data

Page 5: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Example

Review Text

I have not used IPOD but Apple products are good

Facts TableCompany Products Rating

Apple IPOD 0.3

Page 6: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Representing Uncertainty

Tuple-existence uncertainty All attributes in a tuple are known precisely;

existence of the tuple is uncertain E.g. in previous slide. More later

Attribute-value uncertainty Tuples (identified by keys) exist for certain;

attributes (one or more) value are however uncertain

Tomorrow, it may rain (probability is 0.6)

Page 7: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Our Goal For Today

Understand how queries can be evaluated efficiently on Probabilistic Databases For simplicity, we will deal with tuple-level

uncertainties only We also assume independence among tuples. i.e.

P(t1, t2) = P(t1) * P(t2)

Page 8: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Possible Worlds: Example 1Camera Feature p

C21 Lens P1

C29 Battery P2

C31 Lens P3

;LensC21

FeatCam

BattC29

FeatCam

LensC31

LensC21

FeatCam

LensC31

BattC29

LensC21

FeatCam

I1

(1-p1)(1-p2)(1-p3)

I2

p1(1-p2)(1-p3)

I4

p1(1-p2)p3

I3

(1-p1)p2(1-p3)

I5

p1p2p3

Total number of worlds: 2^count_tuples

∑Ii = 1

Page 9: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Possible Worlds: Example 2

A B

‘m’ 1

‘n’ 1

s1

s2

0.8

0.5

C D

1 ‘p’ 0.6t1

S T

World Prob.

D1 = {s1, s2, t1}

D2 = {s1, t1}

D3 = {s2, t1}

D4 = {t1}

D5 = {s1, s2}

D6 = {s1}

D7 = {s2}

D8 = !

0.24

0.24

0.06

0.06

0.16

0.16

0.04

0.04

PossibleWorldspwd(Dp)

Page 10: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Query Evaluation

So, lets consider a query: Q(D) :- S(A,B), T(C,D), B = C S join T on B = C, project on D

Intuitively: Execute the query on each possible world The final result is a probabilistic relation that

represents end result

Page 11: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Query Evaluation: Example

World Prob. Result

D1 = {s1, s2, t1}

D2 = {s1, t1}

D3 = {s2, t1}

D4 = {t1}

D5 = {s1, s2}

D6 = {s1}

D7 = {s2}

D8 = Φ

0.24

0.24

0.06

0.06

0.16

0.16

0.04

0.04

{‘p’}

{‘p’}

{‘p’}

{}

{}

{}

{}

{}

S join T on B = C, project on D

Answer Prob.

{‘p’} 0.54

Φ 0.46

qpwd(Dp) =

Page 12: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Query Evaluation

Semantically correct If T has ‘n’ tuples, there can be as many as

2^n possible worlds. Exponential complexity, thus impractical

Goal of the paper: Evaluate query efficiently

Page 13: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Intensional Query Evaluation

Define the complex event ep(t) for each tuple t

For each intermediate tuple, associate an explicit (complex) event expression

Compute the actual probabilities at the end For this talk, we will look only select, join

project queries

Page 14: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Intensional Semantics

Ev

Ev

X

v2 E1 ˄ E2v1

v

E1v1 E2v2E2v

E1v

… …

E1 V E2 V …

Page 15: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Theorem (2)

The intesional semantics and the possible world semantics on probabilistic databases are equivalent for conjunctive queries.

pwd(qi(Dp)) = qpwd(Dp)

Page 16: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Intensional Semantics: Example

A B

s1 ‘m’ 1 0.8

s2 ‘n’ 1 0.5

C D

t1 1 ‘p’ 0.6

S T

S join T on B = C

A B C D E

‘m’ 1 1 ‘p’ s1 ˄ t1

‘n’ 1 1 ‘p’ s2 ˄ t1

Project on D

D Rank

‘p’ (s1 ˄ t1) V (s2 ˄ t1)

qrank(Dp) = Pr(q) = (0.8 * 0.6) + (0.5 * 0.6) – (0.8 * 0.5 * 0.6)= 0.48 + 0.3 – 0.24 = 0.54

Page 17: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Intensional Semantics

Does not depend on the choice of plan Impractical to use it:

The event expressions can become very large due to projections

For each tuple t, one has to compute Pr(e) for its event e, which is #P-complete problem

Thus very expensive

Page 18: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Extensional Semantics

Play with probabilities instead of event expressions

Much more efficient Assume tuple independence Not always correct. WHY?

Page 19: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Extensional Semantics

pv

pv

x

p1v1

v2 p1 p2v1

p2v2

p2v

p1v

1-(1-p1)(1-p2)…v

Page 20: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Extensional Query Evaluation:Example

A B

s1 ‘m’ 1 0.8

s2 ‘n’ 1 0.5

C D

t1 1 ‘p’ 0.6

ST

S join T on B = C

A B C D Prob

‘m’ 1 1 ‘p’ 0.48

‘n’ 1 1 ‘p’ 0.30

Project on D

D Prob

‘p’ 1 – (1-0.48)*(1-0.30) = 0.636

Wrong?? Because the two tuples in the join are no longer independent!!

Plan : πD(S joinB=C T)

Page 21: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Extensional: Alternate Query Plan

A B

s1 ‘m’ 1 0.8

s2 ‘n’ 1 0.5

C D

t1 1 ‘p’ 0.6

ST

Project S on B

B Prob

1 1 – (1-0.8)*(1-0.5) = 0.9

Join with T on B=C

B C D Prob

1 1 ‘p’ 0.9 * 0.6 = 0.54

CORRECT!!

Plan : πD(πB(S) joinB=C T)

Page 22: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Observation

The answer depends on query plan

Page 23: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Notations

R is a relation name. D = instance of a database schema Γ = set of functional dependencies E = set of all complex events q = query PRels(q) = the probabilistic relation names in q Attr(q) = all attributes in all relations in q Head(q) = the set of attributes that are in output of

the query q

Page 24: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Safe Plan

A plan is safe if it produces the correct result Formally, given a schema Rp, Γp, a plan P for

a query q is safe if Pe(Dp) = qrank(Dp) for all instances Dp of that schema

Page 25: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Theorem (3)

Consider a database schema where all the probabilistic relations are tuple-independent. Let q, q’ be the conjunctive queries that do not share any relation name. Then σ is always safe x is always safe in q x q’ Π is safe iff A1,…Ak, R.E → Head (q)

Page 26: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Example Same example, Γp is :

S.A, S.B → S.E T.C, T.D → T.E S.E → S.A, S.B T.E → T.C, T.D

Query :- S join T on B = C, project on D Plan : πD(S joinB=C T) Join is safe. We need to check the safeness of project. From

theorem 3, we need to check A1,…Ak, R.E → Head (q) T.D, S.E → S.A, S.B, T.C, T.D (pass) T.D, T.E → S.A, S.B, T.C, T.D (fails, why?)

Where A1,…Ak is T.D R.E is S.E and T.E Head (q) is S.A, S.B, T.C, T.D

Page 27: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Example: Alternative Plan Query :- S join T on B = C, project on D Plan : πD(πB(S) joinB=C T) Project on B is safe. We need to check the safeness

of project on D. From theorem 3, we need to check A1,…Ak, R.E → Head (q) T.D, S.E → S.B, T.C, T.D T.D, T.E → S.B, T.C, T.D

Where A1,…Ak is T.D R.E is S.E and T.E Head (q) is S.B, T.C, T.D

Plan is safe!!

Page 28: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Separation Let q be a conjunctive query. Two relations R1, R2 are called

connected if the query contains a join condition R1.A = R2.B and either R1.A or R2.B is not in Head(q). The relations R1, R2 are called separate if they are not connected.

Two sets of relations Y1 and Y2 are said to form a separation for query q iff They partition the set Rels(q) For any pair of R1 and R2 s.t. R1 belongs to Y1 and R2 belongs

to Y2, they are separate Intuitively,

The query does not contains a join condition If the query has join condition, output of query does contains both

R1.A and R2.B

Page 29: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Separation: Example Query :- S(A,B), T(C,D), B = C qBC = (S joinB=C T) Head(qBC) = {B,C,D}

S join T on B = C

B C D

1 1 ‘p’

1 1 ‘p’

Both B and C are present in head(qBC). Thus S and T are separate for this query

Page 30: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan

Authors proposed SAFE-PLAN algorithm to find safe plans for a query

Try to postpone all safe projections in the query plan When no more safe projections possible, it tries to

perform a join, by splitting q into q1 join q2 Since we perform join in the last, all attributes of join

condition must be in Head(q), thus making sure that relations involved in join are separate.

If a safe plan exist, the algorithm finds it

Page 31: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan: Example

Processing :- SAFE-PLAN(πD(S joinB=C T))

Head(qA) = {A, D}

qA = πD(S joinB=C T))

Z = {A}

Head(q) = {D}

Is πHead(q)(qA) is a safe operator?

Conditions:T.D, S.E → S.A, T.D (safe)T.D, T.E → S.A, T.D (unsafe)

Page 32: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan: Example

Processing :- SAFE-PLAN(πD(S joinB=C T))

Head(qB) = {B, D}

qB = πD(S joinB=C T))

Z = {B}

Head(q) = {D}

Is πHead(q)(qB) is a safe operator?

Conditions:T.D, S.E → S.B, T.D (safe)T.D, T.E → S.B, T.D (safe)

Return πD(SAFE-PLAN(qB))

Page 33: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan: Example

Processing :- πD(SAFE-PLAN(qB))

Head(qAB) = {A, B, D}

qAB = πD(S joinB=C T))

Z = {A}

Head(qB) = {B, D}

Is πHead(q)(qAB) is a safe operator?

Conditions:T.D, S.E → S.A, S.B, T.D (safe)T.D, T.E → S.A, S.B, T.D (unsafe)

Page 34: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan: Example

Processing :- πD(SAFE-PLAN(qB))

Head(qBC) = {B, C, D}

qBC = πD(S joinB=C T))

Z = {C}

Head(qB) = {B, D}

Is πHead(q)(qBC) is a safe operator?

Conditions:T.D, S.E → T.C, S.B, T.D (safe)T.D, T.E → T.C, S.B, T.D (safe)

Return πBD(SAFE-PLAN(qBC))

Page 35: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan: Example

Processing :- πD(πBD(SAFE-PLAN(qBC))

Head(qABC) = {A, B, C, D}

qABC = πD(S joinB=C T))

Z = {A}

Head(qBC) = {B, C, D}

Is πHead(q)(qABC) is a safe operator?

Conditions:T.D, S.E → S.A,T.C, S.B, T.D (safe)T.D, T.E → S.A,T.C, S.B, T.D (unsafe)

Page 36: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan: Example

Processing :- πD(πBD(SAFE-PLAN(qBC))

No projection possible!!

qBC = πD(S joinB=C T))

Head(qBC) = {B, C, D}

Split qBC into q1 joinB=C q2, s.t.q1(B) :- S(A,B)q2(C,D) :- T(C,D)

We know that S and T are separate on query qBC!!

Return SAFE-PLAN(q1) joinB=C SAFE-PLAN(q2))

Page 37: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan: Example

πD(πBD(SAFE-PLAN(q1) joinB=C SAFE-PLAN(q2)))

Head(qA) = {A, B}

qA = S(A,B)

Z = {A}

Head(q1) = {B}

Is πHead(q1)(qA) is a safe operator?

Conditions:S.B, S.E → S.A, S.B (safe)

Return πB(SAFE-PLAN(S(A,B)))i.e. πB(S(A,B))

Page 38: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Finding Safe Plan: Example

SAFE-PLAN(q2) = T(C,D) Thus, final result :

πD(πBD(πB(S) joinB=C T)) πBD is redundant. Can be optimized. SAFE-PLAN algorithm is sound and complete How can we optimize our query plan? Traditional equivalences do not work in extensional

semantics. Need to define extensional semantics equivalences

Page 39: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Query Optimization Select behaves exactly like traditional select operator Extensional joins are commutative

R join S S join R Extensional joins are associative

R join (S join T) (R join S) join T Cascading Projections

πA(πAUB(R)) πA(R) Pushing Projection below a join

πA(R join S) => (πA(R)) join (πA(S)) Lifting Projections Up a Join: only when it satisfies the project condition

in theorem 3(πA(R)) join S => πAUAttrs(S)(R join S)

Theorem (10) : Let Z1 and Z2 be two safe plans for a query q. Then Z1 Z2

Page 40: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Complexity Fundamentals

PTIME : solvable in polynomial time NP complete : Is? Checks satisfiability. #P complete : How many?

Page 41: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Complexity Analysis

The data complexity of a query q is the complexity of evaluating qrank(Dp) as a function of size of Dp

If q has a safe plan, then its data complexity is in PTIME All extensional operators are in PTIME

If q does not has a safe plan, then its data complexity is in #P-complete. i.e. if SAFE-PLAN algorithm fails to return a plan

Page 42: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Unsafe Plans

What if there is no safe plan? The author proposes two solutions Least Unsafe Plans Monte-Carlo Approximations

Page 43: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Least Unsafe Plans

Minimize the error in computing the probabilities Modify SAFE-PLAN algorithm

When splitting a query q in two sub-queries q1 and q2, allow joins b/w q1 and q2 on attributes not in Head(q), then project out these attributes

These projections will be unsafe. Minimize their degree of unsafety

Pick q1, q2 to be a minimum cut of graph (rather than separation)

Problem of finding minimum cut is in PTIME

Page 44: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Monte-Carlo Approximations Let q’ be the query obtained from q by making it

return all the variables in its body. Evaluate q’ instead of q without any probability

calculations Group the tuples based on the values of attributes in

Head(q) Complex event expression of a group will be in

CNF. i.e. Vni=1Ci where each Ci is in DNF. i.e. e1 ˄ e2

˄ … Back to same problem!! Complexity of evaluating the probability of a boolean

expression is in #P-complete

Page 45: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Monte-Carlo Approximations

Given a DNF formula with N clauses and any ε and δ, the probability can be approximated in time O(N/ε2 ln (1/δ))

Probability of the error being greater than ε is less than δ.

If N is small, an exact algorithm may be applied in place of simulation

Page 46: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Extensions

Till now: All the events in probabilistic relations are distinct Dealt with select, project, join queries.

The authors have extended their solutions to non-distinct relations and additional operators

Page 47: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Handling Repeated Events

Multiple tuple can share a common event 4 easy steps to handle them:

Normalize the schema – represents the same data in normalized form, s.t. no probabilistic table has repeated events

TP :- T1 and TP2

Translate original query into new schema Find a safe plan Translate back to original schema

Page 48: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Handling Repeated Events:Example

Consider two prob. Relations: R(A,B) and S(C,D) s.t. R has all distinct events while S has a distinct event for each value of D

Query q(x) :- R(x,y), S(y,z) Step1: create a new schema. Decompose S into two

relations: S1(C, D, EID) and S2(EID) q’(x) :- R(x,y), S1(y,z, eid), S2(eid) Using SAFE-PLAN, we get the following plan

P’ = πA(R joinB=C (πB,EID(S1) joinEID S2)) Substitute back S1 and S2 accordingly

Page 49: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Additional Operators

Union, Difference and Groupby operators Covers almost all queries with nested sub-

queries, aggregates, group-by and existensial/universal quantifiers

Page 50: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Uncertain Predicates q≈ predicate on a deterministic database Syntactic closeness: String Matching. e.g. certain ~

uncertain Edit distances, q-grams etc.

Semantic closeness: e.g. musical ~ opera TF/IDF, ontologies from Wordnet

Numeric closeness: e.g. 25 ~ 26 similar numeric values

Once distances are defined, they need to be meaningfully converted into probabilities gaussian, student-T, normal-gamma parameters can be learned (ideal case) or can be

specified by user

Page 51: Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Presenter : Amit Goyal Discussion Lead : Jonatan

Conclusions

Extensional semantics can be used to evaluate certain class of queries in PTIME

#P-complete problems can be solved using approximations techniques

In practice, many (around 80% as in experiments) queries have safe plans

Extended their approach to deal with non-distinct relations and additional operators