Upload
december
View
30
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Efficient Detection of Empty Result Queries. Gang Luo IBM T.J. Watson Research Center [email protected]. Empty Result Problem. Query returns an empty result set User gets lost about where to look at next Frequently encountered in interactive exploration of massive data sets - PowerPoint PPT Presentation
Citation preview
32nd International Conference on Very Large Data BasesSeptember 12 - 15, 2006 Seoul, Korea
Efficient Detection of Empty Result Queries
Gang Luo IBM T.J. Watson Research Center
2
Empty Result Problem
• Query returns an empty result set• User gets lost about where to look at next• Frequently encountered in interactive
exploration of massive data sets• Our contribution: method for quickly
detecting empty result sets
3
Example Percentages of Empty Result Queries
• In a Customer Relationship Management (CRM) application developed by IBM– 18.07% (3,396 empty result queries in 18,793
queries)
• In a real estate application developed by IBM – 5.75%
• In a digital library application [JCM+00] – 10.53%
• In a bioinformatics application [RCP+98]– 38%
4
Empty Result Queries May Not Finish Execution Quickly
• Consider a query joining two relations– Query execution time is longer than join time, no
matter whether or not query result set is empty
• Even if a query finishes in a few seconds in a lightly loaded RDBMS, it may last longer than one minute in a heavily loaded RDBMS
5
Outline
• Limitations of previous approaches
• Fast detection method for empty result queries
• Some experiments
6
Existing Solutions to the Empty Result Problem
• Explain what leads to the empty result set• Automatically generalize the query so that the
generalized query will return some answers
7
Limitations of Existing Solutions
• Require domain specific knowledge• Only apply to a restricted form of queries• Require an excessive amount of time• Give too many reasons why the result set is
empty• Users cannot reuse each other’s query results
8
Outline
• Limitations of previous approaches
• Fast detection method for empty result queries
• Some experiments
9
Our Solution
• Only consider read-only environment• From previous queries’ execution, remember
the query parts that lead to empty result sets• When a new query Q comes, match it with the
remembered query parts. If such a match exists, report that Q will return an empty result set without executing Q
• Utilize special properties of empty result sets and thus often more powerful than traditional materialized view method
10
Definitions
• Empty result propagating operator: An operator whose output is empty if any input is empty
• Empty result propagating query: A query whose query plan only contains empty result propagating operators (our focus)
• Query part: A sub-tree of a query plan• Atomic query part: An ordered pair (relation
names RN, selection condition SC)– Corresponds to a relational algebra formula: first
product join all relations in RN, then apply SC
– SC is a conjunction of primitive terms, where each primitive term is a comparison
11
Definitions – Cont.
• Cover: Atomic query part P1=(RN1, SC1) covers atomic query part P2=(RN2, SC2) if
– RN1RN2
– Whenever SC2 is true, SC1 is true
• Property: Suppose atomic query part P1 covers atomic query part P2. For a given database, if the output of P1 is empty, the output of P2 is also empty.
12
Given an Empty Result Query• Find the lowest-level query part P whose
output is empty
B (index-scan) B.e<40 B.e=50 [5000]
C (table-scan) [20000]
sort-merge join B.g=C.h [0]
C.f<300 [1000]
[0]
sort [0] sort [1000]
[0]
A (table-scan) [40000]
50<A.a<100 A.b=200 [200] [5000]
hash join A.c=B.d [0]
hash [200] hash [5000]
13
Transforming P into a Simplified Query Part Ps
• Drop all operators (e.g., projection, hash, sort) that have no influence on the emptiness of the output
• Replace each physical join operator with a logical join operator
• Replace each index-scan operator with a table-scan operator followed by a selection operator, where the selection condition is the index-scan condition
14
Transforming P into a Simplified Query Part Ps – Cont.
• Corresponding relational algebra formula– (50<A.a<100 A.b=200 (A)) ⋈A.c=B.d (B.e<40 B.e=50 (B))
B (table-scan) A (table-scan)
50<A.a<100 A.b=200B.e<40 B.e=50
⋈A.c=B.d
15
Breaking Ps into Atomic Query Parts
• Get all selection conditions in the selection/join operators
• Rewrite the conjunction of these selection conditions into a disjunctive normal form (DNF) – Negations on numeric or string attributes are
removed using complementary operators– Interval-based comparison is treated as a single
primitive term • Generate a set of atomic query parts (RN, SC)
– RN: input relations of all table-scan operators in Ps – SC: a term in the DNF
16
Breaking Ps into Atomic Query Parts – Cont.
• Property: The following three assertions are equivalent to each other:– The output of the query part P is empty
– The output of the simplified query part Ps is empty
– The output of each generated atomic query part is empty
(50<A.a<100 (A)) ⋈A.c=B.d (B.e<40 (B))
(A.b=200 (A)) ⋈A.c=B.d (B.e<40 (B))
(50<A.a<100 (A)) ⋈A.c=B.d (B.e=50 (B))
(A.b=200 (A)) ⋈A.c=B.d (B.e=50 (B))
17
Storing the Generated Atomic Query Parts
• For each generated atomic query part Pa
– Insert Pa into a collection Caqp of atomic query parts
– Remove from Caqp all previously stored atomic query parts that are covered by Pa
• See paper for details of the coverage checking algorithm
18
When Getting a New Query Q
• Break Q into a set of atomic query parts
• For each such atomic query part Pa, check whether some atomic query part Ai in Caqp covers Pa
• If such an Ai exists for each Pa, report that Q will return an empty result set without executing Q
19
Outline
• Limitations of previous approaches
• Fast detection method for empty result queries
• Some experiments
20
Setup
• Testing environment– PostgreSQL 7.3.4– Windows XP OS– Dell Inspiron 8500 PC with one 2.2GHz CPU,
512MB memory, one 40GB disk
• TPC-R benchmark• See paper for detection probability analysis
21
Overhead Experiment
• Query Q1: Find the information about certain parts that were sold on certain days
select * from orders o, lineitem lwhere o.orderkey=l.orderkey and
(o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf);
22
Overhead Experiment – Cont.
• Query Q2: Find the information about certain parts that were sold to certain customers on certain days
select * from orders o, lineitem l, customer cwhere o.orderkey=l.orderkey and o.custkey=c.custkey and
(o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf) and(c.nationkey=n1 or … or c.nationkey=ng);
23
Overhead Experiment – Cont.• The overhead of our method increases with both query
complexity and the number of atomic query parts stored in Caqp • When check fails, the overhead of our method is higher than that
when check succeeds
0
0.002
0.004
0.006
0.008
1000 2000 3000
number of atomic query parts in Caqp
over
head
(sec
ond)
Q1, check succeedsQ1, check failsQ2, check succeedsQ2, check fails
24
Overhead Experiment – Cont.• The overhead of our method is trivial compared to
query execution overhead
0.001
0.01
0.1
1
10
100
1000
1 2 3database size (GB)
exec
utio
n tim
e or
ove
rhea
d (s
econ
d)
execute Q1check Q1execute Q2check Q2
25
Summary
• Provide a fast detection method for empty result queries– Low overhead– High detection probability once enough information
has been accumulated
26
Open Issues
• In the presence of update, correctly preserve as much stored information as possible
• A hybrid method that can combine the advantages of both our method and the existing solutions
• More aggressive storage saving technique