View
214
Download
0
Embed Size (px)
Citation preview
Efficient Query Evaluation on Probabilistic Databases
Nilesh Dalvi
Dan Suciu
Presenter : Amit GoyalDiscussion Lead : Jonatan
Outline
Motivation Query Evaluation:
Intensional Extensional
Query Optimization Complexity Unsafe Plans Extensions Conclusions
Databases Are Deterministic
Databases we see today are deterministic A tuple is either in the query answer or not They don’t deal with uncertainties
Future of Data Management
Uncertainties in Data Biological Data Sensor Data (Geographical Data) Data extracted from various AI, data mining
techniques (information extraction) Uncertainties are represented as probabilities Extend data management tools to handle
probabilistic data
Example
Review Text
I have not used IPOD but Apple products are good
Facts TableCompany Products Rating
Apple IPOD 0.3
Representing Uncertainty
Tuple-existence uncertainty All attributes in a tuple are known precisely;
existence of the tuple is uncertain E.g. in previous slide. More later
Attribute-value uncertainty Tuples (identified by keys) exist for certain;
attributes (one or more) value are however uncertain
Tomorrow, it may rain (probability is 0.6)
Our Goal For Today
Understand how queries can be evaluated efficiently on Probabilistic Databases For simplicity, we will deal with tuple-level
uncertainties only We also assume independence among tuples. i.e.
P(t1, t2) = P(t1) * P(t2)
Possible Worlds: Example 1Camera Feature p
C21 Lens P1
C29 Battery P2
C31 Lens P3
;LensC21
FeatCam
BattC29
FeatCam
LensC31
LensC21
FeatCam
LensC31
BattC29
LensC21
FeatCam
I1
(1-p1)(1-p2)(1-p3)
I2
p1(1-p2)(1-p3)
I4
p1(1-p2)p3
I3
(1-p1)p2(1-p3)
I5
p1p2p3
Total number of worlds: 2^count_tuples
∑Ii = 1
Possible Worlds: Example 2
A B
‘m’ 1
‘n’ 1
s1
s2
0.8
0.5
C D
1 ‘p’ 0.6t1
S T
World Prob.
D1 = {s1, s2, t1}
D2 = {s1, t1}
D3 = {s2, t1}
D4 = {t1}
D5 = {s1, s2}
D6 = {s1}
D7 = {s2}
D8 = !
0.24
0.24
0.06
0.06
0.16
0.16
0.04
0.04
PossibleWorldspwd(Dp)
Query Evaluation
So, lets consider a query: Q(D) :- S(A,B), T(C,D), B = C S join T on B = C, project on D
Intuitively: Execute the query on each possible world The final result is a probabilistic relation that
represents end result
Query Evaluation: Example
World Prob. Result
D1 = {s1, s2, t1}
D2 = {s1, t1}
D3 = {s2, t1}
D4 = {t1}
D5 = {s1, s2}
D6 = {s1}
D7 = {s2}
D8 = Φ
0.24
0.24
0.06
0.06
0.16
0.16
0.04
0.04
{‘p’}
{‘p’}
{‘p’}
{}
{}
{}
{}
{}
S join T on B = C, project on D
Answer Prob.
{‘p’} 0.54
Φ 0.46
qpwd(Dp) =
Query Evaluation
Semantically correct If T has ‘n’ tuples, there can be as many as
2^n possible worlds. Exponential complexity, thus impractical
Goal of the paper: Evaluate query efficiently
Intensional Query Evaluation
Define the complex event ep(t) for each tuple t
For each intermediate tuple, associate an explicit (complex) event expression
Compute the actual probabilities at the end For this talk, we will look only select, join
project queries
Intensional Semantics
Ev
Ev
X
v2 E1 ˄ E2v1
v
E1v1 E2v2E2v
E1v
… …
E1 V E2 V …
Theorem (2)
The intesional semantics and the possible world semantics on probabilistic databases are equivalent for conjunctive queries.
pwd(qi(Dp)) = qpwd(Dp)
Intensional Semantics: Example
A B
s1 ‘m’ 1 0.8
s2 ‘n’ 1 0.5
C D
t1 1 ‘p’ 0.6
S T
S join T on B = C
A B C D E
‘m’ 1 1 ‘p’ s1 ˄ t1
‘n’ 1 1 ‘p’ s2 ˄ t1
Project on D
D Rank
‘p’ (s1 ˄ t1) V (s2 ˄ t1)
qrank(Dp) = Pr(q) = (0.8 * 0.6) + (0.5 * 0.6) – (0.8 * 0.5 * 0.6)= 0.48 + 0.3 – 0.24 = 0.54
Intensional Semantics
Does not depend on the choice of plan Impractical to use it:
The event expressions can become very large due to projections
For each tuple t, one has to compute Pr(e) for its event e, which is #P-complete problem
Thus very expensive
Extensional Semantics
Play with probabilities instead of event expressions
Much more efficient Assume tuple independence Not always correct. WHY?
Extensional Semantics
pv
pv
x
p1v1
v2 p1 p2v1
p2v2
p2v
p1v
1-(1-p1)(1-p2)…v
Extensional Query Evaluation:Example
A B
s1 ‘m’ 1 0.8
s2 ‘n’ 1 0.5
C D
t1 1 ‘p’ 0.6
ST
S join T on B = C
A B C D Prob
‘m’ 1 1 ‘p’ 0.48
‘n’ 1 1 ‘p’ 0.30
Project on D
D Prob
‘p’ 1 – (1-0.48)*(1-0.30) = 0.636
Wrong?? Because the two tuples in the join are no longer independent!!
Plan : πD(S joinB=C T)
Extensional: Alternate Query Plan
A B
s1 ‘m’ 1 0.8
s2 ‘n’ 1 0.5
C D
t1 1 ‘p’ 0.6
ST
Project S on B
B Prob
1 1 – (1-0.8)*(1-0.5) = 0.9
Join with T on B=C
B C D Prob
1 1 ‘p’ 0.9 * 0.6 = 0.54
CORRECT!!
Plan : πD(πB(S) joinB=C T)
Observation
The answer depends on query plan
Notations
R is a relation name. D = instance of a database schema Γ = set of functional dependencies E = set of all complex events q = query PRels(q) = the probabilistic relation names in q Attr(q) = all attributes in all relations in q Head(q) = the set of attributes that are in output of
the query q
Safe Plan
A plan is safe if it produces the correct result Formally, given a schema Rp, Γp, a plan P for
a query q is safe if Pe(Dp) = qrank(Dp) for all instances Dp of that schema
Theorem (3)
Consider a database schema where all the probabilistic relations are tuple-independent. Let q, q’ be the conjunctive queries that do not share any relation name. Then σ is always safe x is always safe in q x q’ Π is safe iff A1,…Ak, R.E → Head (q)
Example Same example, Γp is :
S.A, S.B → S.E T.C, T.D → T.E S.E → S.A, S.B T.E → T.C, T.D
Query :- S join T on B = C, project on D Plan : πD(S joinB=C T) Join is safe. We need to check the safeness of project. From
theorem 3, we need to check A1,…Ak, R.E → Head (q) T.D, S.E → S.A, S.B, T.C, T.D (pass) T.D, T.E → S.A, S.B, T.C, T.D (fails, why?)
Where A1,…Ak is T.D R.E is S.E and T.E Head (q) is S.A, S.B, T.C, T.D
Example: Alternative Plan Query :- S join T on B = C, project on D Plan : πD(πB(S) joinB=C T) Project on B is safe. We need to check the safeness
of project on D. From theorem 3, we need to check A1,…Ak, R.E → Head (q) T.D, S.E → S.B, T.C, T.D T.D, T.E → S.B, T.C, T.D
Where A1,…Ak is T.D R.E is S.E and T.E Head (q) is S.B, T.C, T.D
Plan is safe!!
Separation Let q be a conjunctive query. Two relations R1, R2 are called
connected if the query contains a join condition R1.A = R2.B and either R1.A or R2.B is not in Head(q). The relations R1, R2 are called separate if they are not connected.
Two sets of relations Y1 and Y2 are said to form a separation for query q iff They partition the set Rels(q) For any pair of R1 and R2 s.t. R1 belongs to Y1 and R2 belongs
to Y2, they are separate Intuitively,
The query does not contains a join condition If the query has join condition, output of query does contains both
R1.A and R2.B
Separation: Example Query :- S(A,B), T(C,D), B = C qBC = (S joinB=C T) Head(qBC) = {B,C,D}
S join T on B = C
B C D
1 1 ‘p’
1 1 ‘p’
Both B and C are present in head(qBC). Thus S and T are separate for this query
Finding Safe Plan
Authors proposed SAFE-PLAN algorithm to find safe plans for a query
Try to postpone all safe projections in the query plan When no more safe projections possible, it tries to
perform a join, by splitting q into q1 join q2 Since we perform join in the last, all attributes of join
condition must be in Head(q), thus making sure that relations involved in join are separate.
If a safe plan exist, the algorithm finds it
Finding Safe Plan: Example
Processing :- SAFE-PLAN(πD(S joinB=C T))
Head(qA) = {A, D}
qA = πD(S joinB=C T))
Z = {A}
Head(q) = {D}
Is πHead(q)(qA) is a safe operator?
Conditions:T.D, S.E → S.A, T.D (safe)T.D, T.E → S.A, T.D (unsafe)
Finding Safe Plan: Example
Processing :- SAFE-PLAN(πD(S joinB=C T))
Head(qB) = {B, D}
qB = πD(S joinB=C T))
Z = {B}
Head(q) = {D}
Is πHead(q)(qB) is a safe operator?
Conditions:T.D, S.E → S.B, T.D (safe)T.D, T.E → S.B, T.D (safe)
Return πD(SAFE-PLAN(qB))
Finding Safe Plan: Example
Processing :- πD(SAFE-PLAN(qB))
Head(qAB) = {A, B, D}
qAB = πD(S joinB=C T))
Z = {A}
Head(qB) = {B, D}
Is πHead(q)(qAB) is a safe operator?
Conditions:T.D, S.E → S.A, S.B, T.D (safe)T.D, T.E → S.A, S.B, T.D (unsafe)
Finding Safe Plan: Example
Processing :- πD(SAFE-PLAN(qB))
Head(qBC) = {B, C, D}
qBC = πD(S joinB=C T))
Z = {C}
Head(qB) = {B, D}
Is πHead(q)(qBC) is a safe operator?
Conditions:T.D, S.E → T.C, S.B, T.D (safe)T.D, T.E → T.C, S.B, T.D (safe)
Return πBD(SAFE-PLAN(qBC))
Finding Safe Plan: Example
Processing :- πD(πBD(SAFE-PLAN(qBC))
Head(qABC) = {A, B, C, D}
qABC = πD(S joinB=C T))
Z = {A}
Head(qBC) = {B, C, D}
Is πHead(q)(qABC) is a safe operator?
Conditions:T.D, S.E → S.A,T.C, S.B, T.D (safe)T.D, T.E → S.A,T.C, S.B, T.D (unsafe)
Finding Safe Plan: Example
Processing :- πD(πBD(SAFE-PLAN(qBC))
No projection possible!!
qBC = πD(S joinB=C T))
Head(qBC) = {B, C, D}
Split qBC into q1 joinB=C q2, s.t.q1(B) :- S(A,B)q2(C,D) :- T(C,D)
We know that S and T are separate on query qBC!!
Return SAFE-PLAN(q1) joinB=C SAFE-PLAN(q2))
Finding Safe Plan: Example
πD(πBD(SAFE-PLAN(q1) joinB=C SAFE-PLAN(q2)))
Head(qA) = {A, B}
qA = S(A,B)
Z = {A}
Head(q1) = {B}
Is πHead(q1)(qA) is a safe operator?
Conditions:S.B, S.E → S.A, S.B (safe)
Return πB(SAFE-PLAN(S(A,B)))i.e. πB(S(A,B))
Finding Safe Plan: Example
SAFE-PLAN(q2) = T(C,D) Thus, final result :
πD(πBD(πB(S) joinB=C T)) πBD is redundant. Can be optimized. SAFE-PLAN algorithm is sound and complete How can we optimize our query plan? Traditional equivalences do not work in extensional
semantics. Need to define extensional semantics equivalences
Query Optimization Select behaves exactly like traditional select operator Extensional joins are commutative
R join S S join R Extensional joins are associative
R join (S join T) (R join S) join T Cascading Projections
πA(πAUB(R)) πA(R) Pushing Projection below a join
πA(R join S) => (πA(R)) join (πA(S)) Lifting Projections Up a Join: only when it satisfies the project condition
in theorem 3(πA(R)) join S => πAUAttrs(S)(R join S)
Theorem (10) : Let Z1 and Z2 be two safe plans for a query q. Then Z1 Z2
Complexity Fundamentals
PTIME : solvable in polynomial time NP complete : Is? Checks satisfiability. #P complete : How many?
Complexity Analysis
The data complexity of a query q is the complexity of evaluating qrank(Dp) as a function of size of Dp
If q has a safe plan, then its data complexity is in PTIME All extensional operators are in PTIME
If q does not has a safe plan, then its data complexity is in #P-complete. i.e. if SAFE-PLAN algorithm fails to return a plan
Unsafe Plans
What if there is no safe plan? The author proposes two solutions Least Unsafe Plans Monte-Carlo Approximations
Least Unsafe Plans
Minimize the error in computing the probabilities Modify SAFE-PLAN algorithm
When splitting a query q in two sub-queries q1 and q2, allow joins b/w q1 and q2 on attributes not in Head(q), then project out these attributes
These projections will be unsafe. Minimize their degree of unsafety
Pick q1, q2 to be a minimum cut of graph (rather than separation)
Problem of finding minimum cut is in PTIME
Monte-Carlo Approximations Let q’ be the query obtained from q by making it
return all the variables in its body. Evaluate q’ instead of q without any probability
calculations Group the tuples based on the values of attributes in
Head(q) Complex event expression of a group will be in
CNF. i.e. Vni=1Ci where each Ci is in DNF. i.e. e1 ˄ e2
˄ … Back to same problem!! Complexity of evaluating the probability of a boolean
expression is in #P-complete
Monte-Carlo Approximations
Given a DNF formula with N clauses and any ε and δ, the probability can be approximated in time O(N/ε2 ln (1/δ))
Probability of the error being greater than ε is less than δ.
If N is small, an exact algorithm may be applied in place of simulation
Extensions
Till now: All the events in probabilistic relations are distinct Dealt with select, project, join queries.
The authors have extended their solutions to non-distinct relations and additional operators
Handling Repeated Events
Multiple tuple can share a common event 4 easy steps to handle them:
Normalize the schema – represents the same data in normalized form, s.t. no probabilistic table has repeated events
TP :- T1 and TP2
Translate original query into new schema Find a safe plan Translate back to original schema
Handling Repeated Events:Example
Consider two prob. Relations: R(A,B) and S(C,D) s.t. R has all distinct events while S has a distinct event for each value of D
Query q(x) :- R(x,y), S(y,z) Step1: create a new schema. Decompose S into two
relations: S1(C, D, EID) and S2(EID) q’(x) :- R(x,y), S1(y,z, eid), S2(eid) Using SAFE-PLAN, we get the following plan
P’ = πA(R joinB=C (πB,EID(S1) joinEID S2)) Substitute back S1 and S2 accordingly
Additional Operators
Union, Difference and Groupby operators Covers almost all queries with nested sub-
queries, aggregates, group-by and existensial/universal quantifiers
Uncertain Predicates q≈ predicate on a deterministic database Syntactic closeness: String Matching. e.g. certain ~
uncertain Edit distances, q-grams etc.
Semantic closeness: e.g. musical ~ opera TF/IDF, ontologies from Wordnet
Numeric closeness: e.g. 25 ~ 26 similar numeric values
Once distances are defined, they need to be meaningfully converted into probabilities gaussian, student-T, normal-gamma parameters can be learned (ideal case) or can be
specified by user
Conclusions
Extensional semantics can be used to evaluate certain class of queries in PTIME
#P-complete problems can be solved using approximations techniques
In practice, many (around 80% as in experiments) queries have safe plans
Extended their approach to deal with non-distinct relations and additional operators