Query Optimization Strategies Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 31, 2005

Query Optimization Strategies

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

January 31, 2005

2

Administrivia

Wednesday: Some initial suggestions for the project proposal Scheduling of the deadline for your midterm

report

The next assignment: Read Gray et al. (granularity of locks) and Kung

and Robinson (optimistic concurrency control) papers

Review Kung and Robinson Consider how optimistic CC is or isn’t useful in Web-

scale data management

3

Today’s Trivia Question

4

Query Optimization

Challenge: pick the query execution plan that has minimum cost

Sources of cost: Interactions with other work Size of intermediate results Choices of algorithms, access methods Mismatch between I/O, CPU rates Data properties – skew, order, placement

5

The General Model of Optimization

Given an AST of a query: Build a logical query plan

(Tree of query algebraic operations)

Transform into “better” logical plan Convert into a physical query plan

(Includes strategies for executing operations)

6

Strategies

Basically, search, over the space of possible plans At least of exponential complexity in the number of

operators Hence, exhaustive search is generally not feasible

What can you do? Heuristics only – INGRES, Oracle until the mid-90s Randomized, simulated annealing, … – many efforts in the

mid-90s Heuristics plus cost-based join enumeration – System-R Stratified search: heuristics plus cost-based enumeration of

joins and a few other operators – Starburst optimizers Unified search: full cost-based search – EXODUS, Volcano,

Cascades optimizers

7

What Are the Cost Estimating Factors?

Some notion of CPU, disk speeds, page sizes, buffer sizes, …

Cost model for every operator Some information about tables and data

Sizes Cardinalities Number of unique values (from index) Histograms, sketches, …

8

The System-R Approach: Heuristics with Cost-Based Join Enumeration

Make the following assumptions: All predicates are evaluated as early as possible All data is projected away as early as possible

Separately consider operations that produce intermediate state or are blocking: Joins Aggregation Sorting Correlation with a subquery (join, exists, …)

By choosing a join ordering, we’re automatically choosing where selections and projections are pushed – why is this so?

9

System-R Architecture Breaks a query into its blocks, separately optimizes

them Nested loops join between them, if necessary

Within a block: focuses on joins (only a few kinds) in dynamic programming enumeration Principle of optimality: best k-way join includes best (k-1)-way

join Use simple table statistics when available, based on indices;

“magic numbers” where unavailable Heuristics

Push “sargable” selects, projects as low as possible Cartesian products only after joins Left-linear trees only: n2n-1 cost-estimation operations Grouping last

Extra “interesting orders” dimensionGrouping, ordering, join attributes

10

Example

Schema: R(a,b), S(b,c), T(a,c), U(c,d) SELECT d, AVG(c)

FROM R,S,T,UWHERE R.a=S.b AND S.c=T.c AND R.a=T.a AND T.c=U.cGROUP BY dORDER BY d

In relational algebra: γd/AVG(c)(Πd(σa < 2 (R ⋈ S ⋈ T ⋈ U)))

11

Why We Need More than System-R

Cross-query-block optimizations e.g., push a selection predicate from one block to

another Better statistics More general kinds of optimizations

Optimization of aggregation operations with joins Different cost and data models, e.g., OO, XML Additional joins, e.g., “containment joins” Can we build an extensible architecture for this?

Logical, physical, and logical-to-physical transformations Enforcers

Alternative search strategies Left-deep plans aren’t always optimal Perhaps we can prune more efficiently

12

Optimizer Generators

Idea behind EXODUS and Starburst:Build a programming language for writing custom optimizers!

Rule-based or transformational language Describes, using declarative template and conditions,

an equivalent expression

EXODUS: compile an optimizer based on a set of rules

Starburst: run a set of rules in the interpreter; a client could customize the rules

13

Starburst Query Optimizer

Corona query processor vs. Core engine – follows RDS and RSS separation

Part of a very ambitious project Hydrogen language highly orthogonal (unlike SQL), and

supported many fancy OO concepts (e.g., inheritance, methods), recursion special constraints, etc. – much more powerful than SQL at the time

Some portions of Hydrogen and Corona made their way into DB2, later SQL standards

Two stages (stratified search): Query rewrite/query graph model – a SQL-block-level,

relational calculus-like representation of queries Plan optimization – a System-R-style dynamic

programming phase once query rewrite has completed

14

Starburst QGM

Tries to encode relational calculus-like concepts: Predicates between variables within each SQL

block body Variables can be

Distinguished (“set-builder” / “for-each” / “F”) Existential (“∃”) Universal (“∀”)

Returned values from each block (head) Predicates across blocks

15

Starburst QGM Example

SELECT partno, price, order_qtyFROM quotations Q1WHERE Q1.partno IN

(SELECT partnoFROM inventory Q3WHERE Q3.onhand_qty <

Q1.order_qtyAND Q3.type = ‘cpu’

16

Starburst Query Rewrite

Focus: inter-block optimizations Pushing predicates across views, pushing projections down Magic sets rewritings Simplification, transitivity

Implemented through production rules Condition – action rules selected by:

Sequence Priority Probability distribution

Search may stop after a “budget” End product: logical plan(s), chosen via above constraints

Normally: set is singleton Can also CHOOSE among set by invoking the cost-based plan

optimizer

17

Query Rewrite Example Convert subquery to join:

IF OP1.type = Select Æ Q2.type = ‘9’ Æ (at each evaluation of the existential predicate at most one tuple of T2 satisfies the predicate)

THEN Q2.type = ‘F’ Merge operations:

IF OP1.type = Select Æ OP2.type = Select Æ Q2.type = ‘F’ Æ (NOT (T1.distinct = false Æ OP2.eliminate_duplicate = true))

THENmerge OP2 into OP1;IF OP2.eliminate_duplicateTHEN OP1.eliminate_duplicate = true

18

Query Rewrite Example Convert subquery to join:

IF OP1.type = Select Æ Q2.type = ‘9’ Æ (at each evaluation of the existential predicate at most one tuple of T2 satisfies the predicate)

THEN Q2.type = ‘F’ Merge operations:

IF OP1.type = Select Æ OP2.type = Select Æ Q2.type = ‘F’ Æ (NOT (T1.distinct = false Æ OP2.eliminate_duplicate = true))

THENmerge OP2 into OP1;IF OP2.eliminate_duplicateTHEN OP1.eliminate_duplicate = true

19

Starburst Plan Optimization

Separately optimizes each QGM operation (box) Grammar of STrategy AlteRnatives (STARs)

Take high-level operations and turn them into LOw-LEvel Plan OPerations (LOLEPOPs)

JOIN, UNION, SCAN, SORT, SHIP, … Tables and plans have properties

Relational (tables joined, columns accessed, predicates applied)

Operational (ordering, site) Estimated (cost, cardinality)

GLUE operators: SORT, SHIP Join enumerator tries alternative join sequences a la

System-R Can produce bushy trees Can have rank/priority with each STAR

20

Starburst Pros and Cons

Pro: Stratified search generally works well in practice – DB2

UDB has perhaps the best query optimizer out there Interesting model of separating calculus-level and

algebra-level optimizations Generally provides fast performance

Con: Interpreted rules were too slow – and no database user

ever customized the engine! Difficult to assign priorities to transformations Some QGM transformations that were tried were difficult

to assess without running many cost-based optimizations

Rules got out of control

21

The EXODUS and Volcano Optimizer Generators

Part of a “database toolkit” approach to building systems A set of libraries and languages for building

databases with custom data models and functionalities

(rules in “E”)

(EXODUS)

(MyQL) (MyDB plan)

(gcc)

22

EXODUS/Volcano Model

Try to unify the notion of logical-logical transformations and logical-physical transformations No stratification as in Starburst – everything is

transformations Challenge: efficient search – need a lot of

pruning EXODUS: used many heuristics, something called a

MESH Volcano: branch-and-bound pruning, recursion +

memoization

23

Example Rules

Physical operators:%operator 2 join%method 2 hash_join loops_join cartesian_product

Logical-logical transformations:join (1,2) ->’ join(2,1)

Logical-physical transformations:join (1,2) by hash_join (1,2)

Can get quite hairy:join 7 (join 8 (1,2), 3) <-> join 8(1, join 7 (2,3))

{{#ifdef FORWARDif (NOT cover_predicate (OPERATOR_7 oper_argument, INPUT_2 oper_property, INPUT_3 oper_property))

REJECT…

24

So How Does the Optimizer Work?(EXODUS version)

Needs to enumerate all possible transformations without repeating

Every expression is stored in a “MESH”: Basically, an associative lookup for each expression,

which can link to other entries in the same MESH

25

Search in EXODUS

Apply a transformation, see if it produces a new node If so:

Find cheapest implementation rule Also apply all relevant transformation rules, add results to

OPEN set Propagate revised cost to parents (reanalyze) Check parents for new transformation possibilities (rematch)

Heuristics to guide the search in the OPEN set: “Promise” – an “expected cost factor” for each transformation

rule, based on analysis of averages of the optimizer’s cost model results

Favor items with high expected payoff over the current cost Problem: often need to apply 2 rules to get a benefit; use

heuristics Once a full plan is found, optimizer does hill climbing,

only applying a limited set of rules

26

Pros and Cons of EXODUS Pros:

Unified model of optimization is powerful, elegant Very extensible architecture

Cons: Combined logical and physical expressions in the same MESH

equivalent logical plans with different physical operators (e.g., merge vs. hash joins) were kept twice

Physical properties weren’t handled well sort enforcers were seldom applied since they didn’t pay off

immediately – had to hack them into sort-merge join Hard-coded transformation, then algorithm selection, cost

analysis always applied even if not part of the most promising expression applied based on perceived benefit – biased towards larger

expressions, which meant repeated re-optimization Cost wasn’t very generic a concept

27

Volcano, Successor to EXODUS(Shouldn’t it be LEVITICUS?)

Re-architected into a top-down, memoized engine Depth-first search – allows branch-and-bound pruning

FindBestPlan takes logical expression, physical properties, cost bound: If already computed, return Else compute set of possible “moves”:

Logical-logical rule Compliant logical-physical rule Enforcer

Insert logical expression into lookup table Insert physical op, plan into separate lookup table Return best plan and cost

More generic notions of properties and enforcers (e.g., location, SHIP), cost (an ADT)

28

EXODUS, Revision 3: Cascades

Basically, a cleaner, more object-oriented version of the Volcano engine

Rumor has it that MS SQL Server is currently based on a (simplified and streamlined) version of the Volcano/Cascades optimizer generator

29

Optimization Evaluation

So, which is best? Heuristics plus join-based enumeration

(System-R) Stratified, calculus-then-algebraic (Starburst)

Con: QGM transformations are almost always heuristics-based

Pro: very succinct transformations at QGM level

Unified algebraic (Volcano/Cascades) Con: many more rules need to be applied to get

effect of QGM rewrites Pro: unified, fully cost-based model

Documents

Query Optimization Strategies Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 31, 2005