Upload
shana-arnold
View
212
Download
0
Embed Size (px)
Citation preview
AnHai Doan & Alon Halevy
Department of Computer Science & Engineering
University of Washington
Efficiently Ordering Query Plans Efficiently Ordering Query Plans for Data Integrationfor Data Integration
2
Data Integration ChallengeData Integration Challenge
Find Olympus cameras on sale and their reviews
(Brand, Cameras) (Olympus, C-3000)
(Cameras, Reviews)(C-3000, review-article-1)
TARGET.COM
WAL-MART.COM
BESTBUY.COM
EPINIONS.COM
DPREVIEW.COM
CONSUMERREPORTS.ORG
3
Architecture of a Data Integration SystemArchitecture of a Data Integration System
Query Optimizer
Execution Engine
Query Reformulator
Find Olympus cameras on sale and their reviews
logical query plans
physical query execution plans
TARGET.COM EPINIONS.COMTARGET.COM DPREVIEW.COMTARGET.COM CONSUMERREPORTS.COMWAL-MART.COM EPINIONS.COM
BESTBUY.COM CONSUMERREPORTS.COM
Answers = UNION of outputs of all logical query plans
Must execute multiple plans!
4
Ordering Query Plans Ordering Query Plans
Time to & quality of first answers is important!– executing all plans is expensive or infeasible– plans tend to vary significantly in their utility
– coverage, execution time, monetary cost, ...
Solution– find query plans in decreasing order of utility– execute best plans first– abort query execution as soon as
– satisfactory answer is found, or
– resource limits have been reached
5
Our ContributionsOur Contributions Formally defined plan-ordering problem
– does not assume any specific utility measure– models dependencies among plans
Developed three efficient solutions– GREEDY: exploits utility monotonicity– iDRIPS: exploits source similarity– STREAMER: exploits source similarity,
plan independence utility-diminishing returns
– work with a broad range of utility measures– find the best plans very fast
6
Problem DefinitionProblem Definition
Utility measure– plan coverage: number of new answers returned by a plan– execution time, monetary fee– plan utility depends on plans previously executed!
Plan-ordering problem– modify query reformulator so that
given user query and utility measure, it outputs– best plan p1
– next best plan p2, assuming p1 has been executed
– next best plan p3, assuming p1 & p2 have been executed, ...
– focus on finding first few best plans
7
Current Query Reformulator: Current Query Reformulator: the Bucket Algorithm the Bucket Algorithm [Levy [Levy et al.et al., VLDB-96], VLDB-96]
Collect sources into buckets– sources in a bucket can return answer to a certain part of query
Take cross product of buckets– to form logical query plans
Find Olympus cameras on sale and their reviews
Bucket B1 Bucket B2
V1: TARGET V4: EPINIONSV2: WAL-MART V5: DPREVIEWV3: BESTBUY V6: CONSUMERREPORT
V1V4
V1V5
...V3V5
V3V6
8
Our GREEDY AlgorithmOur GREEDY Algorithm
Properties– linear run time– broadly applicable
– many practical utility measures are monotonic [Yerneni et al., EDBT-98]
Utility monotonicity– if replacing a source by a “better” source yields a better plan
– e.g., cost(ViVj) = cost(Vi) + cost(Vj)
Finds best plan– by local comparison of sources
Removes best plan & finds next best plan, ...
V5
V6
V4
V2
V3
V1
B1 B2
V1 V5
9
Source SimilaritySource Similarity Two sources are similar
– if replacing one by the other changes plan utility very little
Large domains often have many similar sources– similar in monetary fee, access time, coverage, etc
Key idea– similar sources can be grouped and treated as a single source
V1: time = 2, fee = 3V2: time = 3, fee = 4
V4: time = 1, fee = 2V1 V4: time = 3, fee = 5V2 V4: time = 4, fee = 6
V12: time = [2,3], fee = [3,4] V12 V4: time = [3,4], fee = [5,6]
utility(V1V4) = 0.5utility(V2V4) = 0.7
utility(V12 V4) = [0.4,0.7]
Abstract source Abstract plan
10
Grouping Sources to Find Best Plan:Grouping Sources to Find Best Plan:DRIPS Algorithm DRIPS Algorithm [Haddawy [Haddawy et al.et al., UAI-95], UAI-95]
V456
V56V4
V5 V6
V123
V12V3
V1 V2
V123 V456
V12 V456 V3 V456
V2 V456
V3 V4
V3 V56
[0.5, 0.8]
[0.4, 0.6] [0.1, 0.3] 0.8 [0.6, 0.7]
[0.1, 0.7]
Branch & Bound Search
Source Grouping
V3 V4 V3 V56
V1 V456
V1 V456 V2 V456
V5
V6
V4
V2
V3
V1
B1 B2
Dominance graph
11
Extending DRIPS: iDRIPS & STREAMERExtending DRIPS: iDRIPS & STREAMER
iDRIPS (iterative DRIPS)– applies DRIPS to find best plan– removes best plan, re-groups sources– applies DRIPS to find second best plan, ...
Observation– iDRIPS may re-establish dominance relations many times
Challenge: recycle dominance relations Solution: STREAMER
– applicable when utility-diminishing returns holds– exploits plan independence
V2 V456
V3 V4
V3 V56 V1 V456
12
The STREAMER AlgorithmThe STREAMER Algorithm
Second IterationFirst Iteration
V2 V456
V3 V56V1 V456V3 V56
V2 V456
V1 V456
V3 V4
V2 V4
V2 V5
V2 V6
V1 V4
V1 V5
V1 V6
still true if utility-diminishing returns holds + V3V4 is independent of V1 V456
13
Summary & ExperimentsSummary & Experiments
Empirical evaluation of iDRIPS and STREAMER– seven non-monotonic utility classes– for five classes: source grouping worked
– both algorithms found first 100 plans very fast
– STREAMER outperformed iDRIPS (when it is applicable)
Algorithms Applicable when Evaluation
GREEDY utility monotonicity O(nm2k2)
iDRIPS source similarity empirical STREAMER source similarity empirical utility-diminishing returns plan independence
14
Related WorkRelated Work Query reformulation algorithms
– BUCKET [Levy et al., VLDB-96] INVERSE-RULE [Duschka&Genesereth, PODS-97]MINICON [Pottinger&Levy, VLDB-00]
– our solutions generalize to all of these
Ordering query plans– [Levy et al., AAAI-96][Florescu et al., VLDB-97]
[Naumann et al., VLDB-99][Leser&Naumann, FQAS-00], ...
– only considered in restricted settings
Query optimization– many works at all levels– most works optimize cost to get all answers
15
Conclusions Conclusions
Ordering query plans is important & difficult
Contributions– formally defined problem– identified interesting problem properties
– utility monotonicity
– source similarity
– plan independence
– utility-diminishing returns
– developed 3 solutions: GREEDY, iDRIPS, STREAMER– solutions can handle a broad range of utility measures– showed that solutions find best plans very fast