26
Factorizing Complex Predicates in Queries to Exploit Indexes Prasanna Ganesan* Stanford University Surajit Chaudhuri Sunita Sarawagi* Microsoft Research IIT Bombay *Work done at Microsoft Research

Factorizing Complex Predicates in Queries to Exploit Indexes Prasanna Ganesan* Stanford University Surajit Chaudhuri Sunita Sarawagi* Microsoft Research

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Factorizing Complex Predicates in Queries to Exploit Indexes

Prasanna Ganesan*Stanford University

Surajit Chaudhuri Sunita Sarawagi*Microsoft Research IIT Bombay

*Work done at Microsoft Research

Motivation

• Complex, redundant WHERE clauses– Application-generated decision-support queries

• May result in unsatisfactory plans– So many candidate plans, so little time– Conversion to normal form doesn’t work– Often end up with a table scan

• Goal: Techniques for efficiently identifying access paths for such complex WHERE clauses

Outline

• Why is the problem challenging?– CNF or DNF does not avoid redundancy– Plan space is large– Formal problem statement– Challenges

• Factorization– Basic factorization: “largest common conjunctive

factor”– Factorization involving union

• Approximate Factorization• Experiments

Basic Primitives(Index Intersection)

• SELECT addr FROM consumers WHERE (income>100000) AND (zipcode=94305)

Index intersect ()

Seek(income>100000) Seek(zipcode=94305)

Lookup(addr)

Basic Primitives(Index Union)

• SELECT addr FROM consumers WHERE (income>100000) OR (zipcode=94301)

Index union()

Seek(income>100000) Seek(zipcode=94301)

Lookup(addr)

A B

A B

Index Intersection and Union

• (A AND B) OR (A AND C) AB+AC

Seek(A) Seek(B) Seek(A)Seek(C)

Data Lookup

A(B+C)

Index Intersection and Union

• (A OR B) AND (A OR C) (A+B)(A+C)

Seek(A) Seek(B) Seek(A)Seek(C)

Data Lookup

A+BC

The Problem

• Given a relation R, find a plan to retrieve πS(σP(R)) using one or more of– Table scan – Index seeks/scans– Index intersections– Index unions– Data lookup from RID lists

• Focus on single-table selection– Naturally extends to arbitrary queries

Challenges

• Understanding the set of all feasible plans– Many equivalent rewritings– Can we rewrite to retrieve a superset?

• Identifying the “best” plan– Different index characteristics

• Impacts access cost

– Different selectivities• Impacts intersection/union costs as well

Roadmap

Plan Complexity

Expn. Format

Exact

vs A

ppro

ximat

e

DNF

CNF

Intersection +One Union

Arbitrary

Arbitrary

Basic Factorization

AND

B ED D FCAA

ANDAND AND

OR

AND

OR

C

{A,B,C} {A,C} {D,E} {D,F}

{A,C} {D}

{A,C,D}

Basic Factorization(Contd.)

• We now have a conjunctive factor (A AND C AND D)

• Use standard optimizer module to find plan for this factor– Table scan, index seek or index intersection– Typically a greedy algorithm based on index

costs and selectivity

• Evaluate remaining conditions as a filter

Introducing the Union

• If query has conjunctive factors, simple factorization usually suffices

• Many queries don’t have such factors– Need to explore index unions

• Consider plans with at most one union– No index intersection above it – Sufficient for large set of practical queries– Limited space allows optimal algorithms

Single-Union Plans ---(1)

• Assume expression in Disjunctive Normal Form(DNF)

• E.g. E = ABC+ACD+ADG+DGH

• Consider factorizing E as f.Q+R– Find intersection plan for f– Recursively find single-union plan for R– Merge the two plans (re-use R’s union if it

exists)

ACDG

A

Single-Union Plans---(2)

• E=ABC+ACD+ADG+DGH = f.Q+R

• Say f=AC. E=AC(B+D)+ADG+DGH

• Recursively factorize R into DG(A+H) Q

R

Seek(A) Seek(G)Seek(D)

Lookup(Filter E)

Lookup(Filter C(B+D))

Lookup(Filter A+H)

Single-Union Plans ---(3)

• Cost(E)~minE=f.Q+R ( cost(f.Q)+cost(R))

– Natural dynamic-programming formulation – Real equation slightly more complex– Cost is exponential

• Use a greedy alternative– Choose the f that provides greatest cost

reduction without further factorizing R

Other Expression Forms

• Conjunctive Normal Form(CNF)– Can just use one term– Multiply terms E.g. (A+B)(A+C) => A+BC– Recursive algorithm in paper

• General AND-OR trees– Bottom-up algorithm– Applies DNF algorithm to OR nodes and CNF

algorithm to AND nodes

Approximate Factoring

• Often, predicates are similar but not identical– A(X BETWEEN 1 AND 100) + B(X BETWEEN

10 AND 110)– Like to exploit similarity of X predicates

• Relax both X predicates to (X BETWEEN 1 AND 110)– Resulting query is more general (assuming no

NOTty problems)

Challenge

• What predicates do we relax?– Trade-off between factoring benefit and cost

of false positives

• Rule 1 of relaxation is:– We do not talk about Fight Club.

• Find “best” set of range predicates to relax for each attribute– Then select the “best” attribute

Don’t relax irrelevant predicates

Finding predicates to relax

• Given expression with range predicates involving attribute X. – Find which predicates to relax for greatest

plan improvement.

• Turns out a greedy algorithm is optimal for many cost functions– Proof in paper appendix– Useful as a heuristic even otherwise

Key Idea

• Relax a pair of predicates if computed to be beneficial

• Repeat treating the relaxed query as the original query

• Trick is in figuring out when a relaxation is beneficial– Original predicates are treated slightly

differently from relaxed predicates– Details in paper

Experiments

• Experiments on SQL Server 2000

• Factorizing done in stand-alone module– Did I hear someone say SQL is declarative?

• Queries on UCI Machine Learning and UCI KDD data.– Table sizes ~ 1 million rows

• 15 workloads– Mostly DNF queries (#terms:1 to >100)

0

10

20

30

40

50

60

70

80

Annea

lU

Austra

lian

Balanc

e_Sca

le

Breas

t

Breas

t2

Diabet

es

Diabet

es2

Hypot

hyro

id

Lette

r

Mus

hroo

omPim

a

Pima2

Shuttle

Soybe

an_L

arge

Soybe

an_L

arge

2

% R

ed

uct

ion

in R

un

nin

g T

ime

%IncrementalReduction(Approx)%Reduction(Exact)

Reduction in Running Time

Impact of Factorization

0

5

10

15

20

25

30

35

40

45

50

Total ChangedPlans

Changes fromExact factoring

Changes fromApprox. factoring

Index-intersection

plans

IIU plans

% o

f Q

ue

rie

s

Related Work

• Optimization of complex WHERE clause– Convert to CNF/DNF [Selinger79, Dayal87]– Using multiple indexes [Mohan90]

• No factorization

– Using smarter indexes [Leslie95]

• Factorization a popular idea in other domains– Compilers [Reinwald66], VLSI Design

[Brayton87]

Conclusion

• Our contributions– Using factorization to optimize queries

• Efficient algorithms requiring no normalization• Staged to reduce compile-time overhead

– Introduced approximate factoring• Algorithm for optimal relaxation

– Integration into overall optimization framework