View
215
Download
2
Embed Size (px)
Citation preview
Query Optimization Strategies
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Implementing Data Management Systems
January 31, 2005
2
Administrivia
Wednesday: Some initial suggestions for the project proposal Scheduling of the deadline for your midterm
report
The next assignment: Read Gray et al. (granularity of locks) and Kung
and Robinson (optimistic concurrency control) papers
Review Kung and Robinson Consider how optimistic CC is or isn’t useful in Web-
scale data management
3
Today’s Trivia Question
4
Query Optimization
Challenge: pick the query execution plan that has minimum cost
Sources of cost: Interactions with other work Size of intermediate results Choices of algorithms, access methods Mismatch between I/O, CPU rates Data properties – skew, order, placement
5
The General Model of Optimization
Given an AST of a query: Build a logical query plan
(Tree of query algebraic operations)
Transform into “better” logical plan Convert into a physical query plan
(Includes strategies for executing operations)
6
Strategies
Basically, search, over the space of possible plans At least of exponential complexity in the number of
operators Hence, exhaustive search is generally not feasible
What can you do? Heuristics only – INGRES, Oracle until the mid-90s Randomized, simulated annealing, … – many efforts in the
mid-90s Heuristics plus cost-based join enumeration – System-R Stratified search: heuristics plus cost-based enumeration of
joins and a few other operators – Starburst optimizers Unified search: full cost-based search – EXODUS, Volcano,
Cascades optimizers
7
What Are the Cost Estimating Factors?
Some notion of CPU, disk speeds, page sizes, buffer sizes, …
Cost model for every operator Some information about tables and data
Sizes Cardinalities Number of unique values (from index) Histograms, sketches, …
8
The System-R Approach: Heuristics with Cost-Based Join Enumeration
Make the following assumptions: All predicates are evaluated as early as possible All data is projected away as early as possible
Separately consider operations that produce intermediate state or are blocking: Joins Aggregation Sorting Correlation with a subquery (join, exists, …)
By choosing a join ordering, we’re automatically choosing where selections and projections are pushed – why is this so?
9
System-R Architecture Breaks a query into its blocks, separately optimizes
them Nested loops join between them, if necessary
Within a block: focuses on joins (only a few kinds) in dynamic programming enumeration Principle of optimality: best k-way join includes best (k-1)-way
join Use simple table statistics when available, based on indices;
“magic numbers” where unavailable Heuristics
Push “sargable” selects, projects as low as possible Cartesian products only after joins Left-linear trees only: n2n-1 cost-estimation operations Grouping last
Extra “interesting orders” dimensionGrouping, ordering, join attributes
10
Example
Schema: R(a,b), S(b,c), T(a,c), U(c,d) SELECT d, AVG(c)
FROM R,S,T,UWHERE R.a=S.b AND S.c=T.c AND R.a=T.a AND T.c=U.cGROUP BY dORDER BY d
In relational algebra: γd/AVG(c)(Πd(σa < 2 (R ⋈ S ⋈ T ⋈ U)))
11
Why We Need More than System-R
Cross-query-block optimizations e.g., push a selection predicate from one block to
another Better statistics More general kinds of optimizations
Optimization of aggregation operations with joins Different cost and data models, e.g., OO, XML Additional joins, e.g., “containment joins” Can we build an extensible architecture for this?
Logical, physical, and logical-to-physical transformations Enforcers
Alternative search strategies Left-deep plans aren’t always optimal Perhaps we can prune more efficiently
12
Optimizer Generators
Idea behind EXODUS and Starburst:Build a programming language for writing custom optimizers!
Rule-based or transformational language Describes, using declarative template and conditions,
an equivalent expression
EXODUS: compile an optimizer based on a set of rules
Starburst: run a set of rules in the interpreter; a client could customize the rules
13
Starburst Query Optimizer
Corona query processor vs. Core engine – follows RDS and RSS separation
Part of a very ambitious project Hydrogen language highly orthogonal (unlike SQL), and
supported many fancy OO concepts (e.g., inheritance, methods), recursion special constraints, etc. – much more powerful than SQL at the time
Some portions of Hydrogen and Corona made their way into DB2, later SQL standards
Two stages (stratified search): Query rewrite/query graph model – a SQL-block-level,
relational calculus-like representation of queries Plan optimization – a System-R-style dynamic
programming phase once query rewrite has completed
14
Starburst QGM
Tries to encode relational calculus-like concepts: Predicates between variables within each SQL
block body Variables can be
Distinguished (“set-builder” / “for-each” / “F”) Existential (“∃”) Universal (“∀”)
Returned values from each block (head) Predicates across blocks
15
Starburst QGM Example
SELECT partno, price, order_qtyFROM quotations Q1WHERE Q1.partno IN
(SELECT partnoFROM inventory Q3WHERE Q3.onhand_qty <
Q1.order_qtyAND Q3.type = ‘cpu’
16
Starburst Query Rewrite
Focus: inter-block optimizations Pushing predicates across views, pushing projections down Magic sets rewritings Simplification, transitivity
Implemented through production rules Condition – action rules selected by:
Sequence Priority Probability distribution
Search may stop after a “budget” End product: logical plan(s), chosen via above constraints
Normally: set is singleton Can also CHOOSE among set by invoking the cost-based plan
optimizer
17
Query Rewrite Example Convert subquery to join:
IF OP1.type = Select Æ Q2.type = ‘9’ Æ (at each evaluation of the existential predicate at most one tuple of T2 satisfies the predicate)
THEN Q2.type = ‘F’ Merge operations:
IF OP1.type = Select Æ OP2.type = Select Æ Q2.type = ‘F’ Æ (NOT (T1.distinct = false Æ OP2.eliminate_duplicate = true))
THENmerge OP2 into OP1;IF OP2.eliminate_duplicateTHEN OP1.eliminate_duplicate = true
18
Query Rewrite Example Convert subquery to join:
IF OP1.type = Select Æ Q2.type = ‘9’ Æ (at each evaluation of the existential predicate at most one tuple of T2 satisfies the predicate)
THEN Q2.type = ‘F’ Merge operations:
IF OP1.type = Select Æ OP2.type = Select Æ Q2.type = ‘F’ Æ (NOT (T1.distinct = false Æ OP2.eliminate_duplicate = true))
THENmerge OP2 into OP1;IF OP2.eliminate_duplicateTHEN OP1.eliminate_duplicate = true
19
Starburst Plan Optimization
Separately optimizes each QGM operation (box) Grammar of STrategy AlteRnatives (STARs)
Take high-level operations and turn them into LOw-LEvel Plan OPerations (LOLEPOPs)
JOIN, UNION, SCAN, SORT, SHIP, … Tables and plans have properties
Relational (tables joined, columns accessed, predicates applied)
Operational (ordering, site) Estimated (cost, cardinality)
GLUE operators: SORT, SHIP Join enumerator tries alternative join sequences a la
System-R Can produce bushy trees Can have rank/priority with each STAR
20
Starburst Pros and Cons
Pro: Stratified search generally works well in practice – DB2
UDB has perhaps the best query optimizer out there Interesting model of separating calculus-level and
algebra-level optimizations Generally provides fast performance
Con: Interpreted rules were too slow – and no database user
ever customized the engine! Difficult to assign priorities to transformations Some QGM transformations that were tried were difficult
to assess without running many cost-based optimizations
Rules got out of control
21
The EXODUS and Volcano Optimizer Generators
Part of a “database toolkit” approach to building systems A set of libraries and languages for building
databases with custom data models and functionalities
(rules in “E”)
(EXODUS)
(MyQL) (MyDB plan)
(gcc)
22
EXODUS/Volcano Model
Try to unify the notion of logical-logical transformations and logical-physical transformations No stratification as in Starburst – everything is
transformations Challenge: efficient search – need a lot of
pruning EXODUS: used many heuristics, something called a
MESH Volcano: branch-and-bound pruning, recursion +
memoization
23
Example Rules
Physical operators:%operator 2 join%method 2 hash_join loops_join cartesian_product
Logical-logical transformations:join (1,2) ->’ join(2,1)
Logical-physical transformations:join (1,2) by hash_join (1,2)
Can get quite hairy:join 7 (join 8 (1,2), 3) <-> join 8(1, join 7 (2,3))
{{#ifdef FORWARDif (NOT cover_predicate (OPERATOR_7 oper_argument, INPUT_2 oper_property, INPUT_3 oper_property))
REJECT…
24
So How Does the Optimizer Work?(EXODUS version)
Needs to enumerate all possible transformations without repeating
Every expression is stored in a “MESH”: Basically, an associative lookup for each expression,
which can link to other entries in the same MESH
25
Search in EXODUS
Apply a transformation, see if it produces a new node If so:
Find cheapest implementation rule Also apply all relevant transformation rules, add results to
OPEN set Propagate revised cost to parents (reanalyze) Check parents for new transformation possibilities (rematch)
Heuristics to guide the search in the OPEN set: “Promise” – an “expected cost factor” for each transformation
rule, based on analysis of averages of the optimizer’s cost model results
Favor items with high expected payoff over the current cost Problem: often need to apply 2 rules to get a benefit; use
heuristics Once a full plan is found, optimizer does hill climbing,
only applying a limited set of rules
26
Pros and Cons of EXODUS Pros:
Unified model of optimization is powerful, elegant Very extensible architecture
Cons: Combined logical and physical expressions in the same MESH
equivalent logical plans with different physical operators (e.g., merge vs. hash joins) were kept twice
Physical properties weren’t handled well sort enforcers were seldom applied since they didn’t pay off
immediately – had to hack them into sort-merge join Hard-coded transformation, then algorithm selection, cost
analysis always applied even if not part of the most promising expression applied based on perceived benefit – biased towards larger
expressions, which meant repeated re-optimization Cost wasn’t very generic a concept
27
Volcano, Successor to EXODUS(Shouldn’t it be LEVITICUS?)
Re-architected into a top-down, memoized engine Depth-first search – allows branch-and-bound pruning
FindBestPlan takes logical expression, physical properties, cost bound: If already computed, return Else compute set of possible “moves”:
Logical-logical rule Compliant logical-physical rule Enforcer
Insert logical expression into lookup table Insert physical op, plan into separate lookup table Return best plan and cost
More generic notions of properties and enforcers (e.g., location, SHIP), cost (an ADT)
28
EXODUS, Revision 3: Cascades
Basically, a cleaner, more object-oriented version of the Volcano engine
Rumor has it that MS SQL Server is currently based on a (simplified and streamlined) version of the Volcano/Cascades optimizer generator
29
Optimization Evaluation
So, which is best? Heuristics plus join-based enumeration
(System-R) Stratified, calculus-then-algebraic (Starburst)
Con: QGM transformations are almost always heuristics-based
Pro: very succinct transformations at QGM level
Unified algebraic (Volcano/Cascades) Con: many more rules need to be applied to get
effect of QGM rewrites Pro: unified, fully cost-based model