15
AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Efficiently Ordering Query Plans Plans for Data Integration for Data Integration

AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

Embed Size (px)

Citation preview

Page 1: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

AnHai Doan & Alon Halevy

Department of Computer Science & Engineering

University of Washington

Efficiently Ordering Query Plans Efficiently Ordering Query Plans for Data Integrationfor Data Integration

Page 2: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

2

Data Integration ChallengeData Integration Challenge

Find Olympus cameras on sale and their reviews

(Brand, Cameras) (Olympus, C-3000)

(Cameras, Reviews)(C-3000, review-article-1)

TARGET.COM

WAL-MART.COM

BESTBUY.COM

EPINIONS.COM

DPREVIEW.COM

CONSUMERREPORTS.ORG

Page 3: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

3

Architecture of a Data Integration SystemArchitecture of a Data Integration System

Query Optimizer

Execution Engine

Query Reformulator

Find Olympus cameras on sale and their reviews

logical query plans

physical query execution plans

TARGET.COM EPINIONS.COMTARGET.COM DPREVIEW.COMTARGET.COM CONSUMERREPORTS.COMWAL-MART.COM EPINIONS.COM

BESTBUY.COM CONSUMERREPORTS.COM

Answers = UNION of outputs of all logical query plans

Must execute multiple plans!

Page 4: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

4

Ordering Query Plans Ordering Query Plans

Time to & quality of first answers is important!– executing all plans is expensive or infeasible– plans tend to vary significantly in their utility

– coverage, execution time, monetary cost, ...

Solution– find query plans in decreasing order of utility– execute best plans first– abort query execution as soon as

– satisfactory answer is found, or

– resource limits have been reached

Page 5: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

5

Our ContributionsOur Contributions Formally defined plan-ordering problem

– does not assume any specific utility measure– models dependencies among plans

Developed three efficient solutions– GREEDY: exploits utility monotonicity– iDRIPS: exploits source similarity– STREAMER: exploits source similarity,

plan independence utility-diminishing returns

– work with a broad range of utility measures– find the best plans very fast

Page 6: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

6

Problem DefinitionProblem Definition

Utility measure– plan coverage: number of new answers returned by a plan– execution time, monetary fee– plan utility depends on plans previously executed!

Plan-ordering problem– modify query reformulator so that

given user query and utility measure, it outputs– best plan p1

– next best plan p2, assuming p1 has been executed

– next best plan p3, assuming p1 & p2 have been executed, ...

– focus on finding first few best plans

Page 7: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

7

Current Query Reformulator: Current Query Reformulator: the Bucket Algorithm the Bucket Algorithm [Levy [Levy et al.et al., VLDB-96], VLDB-96]

Collect sources into buckets– sources in a bucket can return answer to a certain part of query

Take cross product of buckets– to form logical query plans

Find Olympus cameras on sale and their reviews

Bucket B1 Bucket B2

V1: TARGET V4: EPINIONSV2: WAL-MART V5: DPREVIEWV3: BESTBUY V6: CONSUMERREPORT

V1V4

V1V5

...V3V5

V3V6

Page 8: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

8

Our GREEDY AlgorithmOur GREEDY Algorithm

Properties– linear run time– broadly applicable

– many practical utility measures are monotonic [Yerneni et al., EDBT-98]

Utility monotonicity– if replacing a source by a “better” source yields a better plan

– e.g., cost(ViVj) = cost(Vi) + cost(Vj)

Finds best plan– by local comparison of sources

Removes best plan & finds next best plan, ...

V5

V6

V4

V2

V3

V1

B1 B2

V1 V5

Page 9: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

9

Source SimilaritySource Similarity Two sources are similar

– if replacing one by the other changes plan utility very little

Large domains often have many similar sources– similar in monetary fee, access time, coverage, etc

Key idea– similar sources can be grouped and treated as a single source

V1: time = 2, fee = 3V2: time = 3, fee = 4

V4: time = 1, fee = 2V1 V4: time = 3, fee = 5V2 V4: time = 4, fee = 6

V12: time = [2,3], fee = [3,4] V12 V4: time = [3,4], fee = [5,6]

utility(V1V4) = 0.5utility(V2V4) = 0.7

utility(V12 V4) = [0.4,0.7]

Abstract source Abstract plan

Page 10: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

10

Grouping Sources to Find Best Plan:Grouping Sources to Find Best Plan:DRIPS Algorithm DRIPS Algorithm [Haddawy [Haddawy et al.et al., UAI-95], UAI-95]

V456

V56V4

V5 V6

V123

V12V3

V1 V2

V123 V456

V12 V456 V3 V456

V2 V456

V3 V4

V3 V56

[0.5, 0.8]

[0.4, 0.6] [0.1, 0.3] 0.8 [0.6, 0.7]

[0.1, 0.7]

Branch & Bound Search

Source Grouping

V3 V4 V3 V56

V1 V456

V1 V456 V2 V456

V5

V6

V4

V2

V3

V1

B1 B2

Dominance graph

Page 11: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

11

Extending DRIPS: iDRIPS & STREAMERExtending DRIPS: iDRIPS & STREAMER

iDRIPS (iterative DRIPS)– applies DRIPS to find best plan– removes best plan, re-groups sources– applies DRIPS to find second best plan, ...

Observation– iDRIPS may re-establish dominance relations many times

Challenge: recycle dominance relations Solution: STREAMER

– applicable when utility-diminishing returns holds– exploits plan independence

V2 V456

V3 V4

V3 V56 V1 V456

Page 12: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

12

The STREAMER AlgorithmThe STREAMER Algorithm

Second IterationFirst Iteration

V2 V456

V3 V56V1 V456V3 V56

V2 V456

V1 V456

V3 V4

V2 V4

V2 V5

V2 V6

V1 V4

V1 V5

V1 V6

still true if utility-diminishing returns holds + V3V4 is independent of V1 V456

Page 13: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

13

Summary & ExperimentsSummary & Experiments

Empirical evaluation of iDRIPS and STREAMER– seven non-monotonic utility classes– for five classes: source grouping worked

– both algorithms found first 100 plans very fast

– STREAMER outperformed iDRIPS (when it is applicable)

Algorithms Applicable when Evaluation

GREEDY utility monotonicity O(nm2k2)

iDRIPS source similarity empirical STREAMER source similarity empirical utility-diminishing returns plan independence

Page 14: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

14

Related WorkRelated Work Query reformulation algorithms

– BUCKET [Levy et al., VLDB-96] INVERSE-RULE [Duschka&Genesereth, PODS-97]MINICON [Pottinger&Levy, VLDB-00]

– our solutions generalize to all of these

Ordering query plans– [Levy et al., AAAI-96][Florescu et al., VLDB-97]

[Naumann et al., VLDB-99][Leser&Naumann, FQAS-00], ...

– only considered in restricted settings

Query optimization– many works at all levels– most works optimize cost to get all answers

Page 15: AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration

15

Conclusions Conclusions

Ordering query plans is important & difficult

Contributions– formally defined problem– identified interesting problem properties

– utility monotonicity

– source similarity

– plan independence

– utility-diminishing returns

– developed 3 solutions: GREEDY, iDRIPS, STREAMER– solutions can handle a broad range of utility measures– showed that solutions find best plans very fast