Upload
amato
View
30
Download
1
Embed Size (px)
DESCRIPTION
Parametric Query Generation. Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri. Problem Statement. Given Queries with Parametric filters , find values of Parameters so that cardinality constraints are satisfied on a given fixed database - PowerPoint PPT Presentation
Citation preview
Parametric Query Parametric Query GenerationGeneration
Student: Dilys ThomasStudent: Dilys Thomas
Mentor: Nico BrunoMentor: Nico Bruno
Manager: Surajit ChaudhuriManager: Surajit Chaudhuri
Problem StatementProblem Statement
Given Queries with Given Queries with Parametric filtersParametric filters, find , find values of Parameters so that values of Parameters so that cardinality cardinality constraintsconstraints are satisfied on a given are satisfied on a given fixed fixed databasedatabase
Constraints: Cardinality constraints on theConstraints: Cardinality constraints on the query and its subexpressions.query and its subexpressions.
Parameters: Simple filters.Parameters: Simple filters.
ExampleExample
Select * from testR where Select * from testR where ( testR.v1 between %f and %f) : 100,000( testR.v1 between %f and %f) : 100,000
Select * from testS where Select * from testS where ( testS.v1 <= %f): 17,000( testS.v1 <= %f): 17,000
Select * from testR, testS where Select * from testR, testS where (testR.v1=testS.v0) and ( testS.v1 <= %f) (testR.v1=testS.v0) and ( testS.v1 <= %f)
and ( testR.v0 >= %f) and ( testR.v0 >= %f) and ( testR.v1 between %f and %f): 30,000and ( testR.v1 between %f and %f): 30,000
MotivationMotivation
Generation of queries to test the optimizer.Generation of queries to test the optimizer.
RAGS tool is available presently to RAGS tool is available presently to syntactically generate random queries andsyntactically generate random queries and
test for errors by a majority vote.test for errors by a majority vote.
MotivationMotivation
Needed to test different modules, new Needed to test different modules, new algorithms, test statistics estimator, and algorithms, test statistics estimator, and compare performancescompare performances
Queries not random but you want them to Queries not random but you want them to satisfy some constraintssatisfy some constraints
Solution exists? NP complete.Solution exists? NP complete.
For n parametric attributes with JoinsFor n parametric attributes with Joins
Database only has O(n) tuplesDatabase only has O(n) tuples
Reduction from SUBSET SUM even for aReduction from SUBSET SUM even for a
single constraint.single constraint.
ModelModel
For a given set of parameters can find the For a given set of parameters can find the cardinality by a function invocation. cardinality by a function invocation.
Implemented by: Implemented by: Actually running the query (slow, accurate)Actually running the query (slow, accurate) Using optimizer estimates about the cardinality Using optimizer estimates about the cardinality
(fast, inaccurate)(fast, inaccurate) Using an intermediate datastructure.Using an intermediate datastructure.
Objective: Minimize the number of cardinality Objective: Minimize the number of cardinality estimation callsestimation calls
kn
Understanding the Problem: Understanding the Problem: SimplificationSimplification
K single sided <= attribute parametersK single sided <= attribute parameters Single relation and single constraintSingle relation and single constraint
Let n=number of distinct values in each attribute.Let n=number of distinct values in each attribute. k= number of attributesk= number of attributes
Simple algorithm of time:Simple algorithm of time:
Can we do better? Can we do better? 1 Dimension: Yes, Binary search.1 Dimension: Yes, Binary search.
Results:Results:
DimensionDimension Upper BoundUpper Bound Lower BoundLower Bound
11 Log nLog n Log nLog n
22 nn nn
K>=2K>=2
2 Dimension Algorithm2 Dimension Algorithm
Walk based AlgorithmWalk based Algorithm
Search for 20Search for 202525 2727 3535 4343 5050
1010 1515 2121 2727 3535
88 1212 1414 2020 2222
22 55 88 2020 2222
11 33 77 1010 2020
Lower BoundLower Bound
Incomparable setIncomparable set 2525 2727 3535 4343 5050
1010 1515 2121 2727 3535
88 1212 1414 2020 2222
22 55 88 2020 2222
11 33 77 1010 2121
For general k.For general k.
Upper boundUpper bound: For k-dimensions, recursively call : For k-dimensions, recursively call n invocations of (k-1) dimension algorithm.n invocations of (k-1) dimension algorithm.
T(k)=n * T(k-1)T(k)=n * T(k-1)T(2)=nT(2)=nHence T(K)=Hence T(K)=(Multiple walk algorithm)(Multiple walk algorithm)
Lower boundLower bound::x_1 + x_2 + … x_k = nx_1 + x_2 + … x_k = nSolutions C(n+k-1,k-1)Solutions C(n+k-1,k-1)
Optimization Problem:Optimization Problem:Error Metrics.Error Metrics.
Single ConstraintSingle Constraint:: Constraint cardinality: C ,Constraint cardinality: C , Achieved cardinality: DAchieved cardinality: D
RelErr= max (C/D, D/C)RelErr= max (C/D, D/C)
Multiple ConstraintsMultiple Constraints: : Combing the errors: Combing the errors: Average relative error across all constraints.Average relative error across all constraints.
Objective: Minimize errorObjective: Minimize error
CC1
Simple WalkSimple Walk
STEP= unit change in current parameter STEP= unit change in current parameter valuesvalues
While (can improve with step)While (can improve with step)
{Make the improving step}{Make the improving step}Stepsize=1 tuple->converges to local optimaStepsize=1 tuple->converges to local optima
Stepsize small -> convergence slow Stepsize small -> convergence slow
Simple Walk-> Halving WalkSimple Walk-> Halving Walk Initialize the parameters (point).Initialize the parameters (point). Each stepsize=1.0 quantile Each stepsize=1.0 quantile For (int i=0; i< maxhalve; i++)For (int i=0; i< maxhalve; i++) {{ while (can improve with step)while (can improve with step) {Make the improving step}{Make the improving step} //exited above loop -> cannot improve with local//exited above loop -> cannot improve with local Halve all step sizes.Halve all step sizes. }}Use quantiles to decide steps.Use quantiles to decide steps.
Halving WalkHalving Walk
Initializing the parameters [More later]Initializing the parameters [More later]Steps made in quantile domain of attributeSteps made in quantile domain of attribute
done by simple equidepth wrapper over done by simple equidepth wrapper over histograms provided by SQLServerhistograms provided by SQLServer
Initial stepsize=1.0 quantileInitial stepsize=1.0 quantile
Halving Walk: Steps consideredHalving Walk: Steps considered
For For <=<=, , >=>= parameters: parameters:
RIGHT move ,LEFT moveRIGHT move ,LEFT move
For For betweenbetween parameters: parameters:
Apart from RIGHT move, LEFT move for Apart from RIGHT move, LEFT move for each parameter.each parameter.
LEFT Translate. RIGHT TranslateLEFT Translate. RIGHT Translate
Algorithm Halving-StepsAlgorithm Halving-Steps
A generalization of binary searchA generalization of binary search
// But only a heuristic.// But only a heuristic.
Converges to Converges to Local OptimaLocal Optima
#Steps per iteration : Constant.#Steps per iteration : Constant.
Hence much Hence much faster convergencefaster convergence..
InitializationInitialization
Random Random Optimizer estimateOptimizer estimateSolving equations:Solving equations:
Power method.Power method.
Least Square Error.Least Square Error.
Least Squares InitializationLeast Squares Initialization
For each parametric attribute, Pi , have variable piFor each parametric attribute, Pi , have variable pi
For each Constraint build an equation:For each Constraint build an equation:
Cardinality without parametric filters: CCardinality without parametric filters: C Constraint cardinality with filters: FConstraint cardinality with filters: FThen Filter selectivity= S = F/CThen Filter selectivity= S = F/CIf P1, P2, P3, Pk are parameters in this constraintIf P1, P2, P3, Pk are parameters in this constraintWrite equation: p1 * p2 * .. pk = S Write equation: p1 * p2 * .. pk = S (Making Independence assumption)(Making Independence assumption)
Least Squares InitializationLeast Squares Initialization
In log space: set of linear equations.In log space: set of linear equations.
May have single, multiple or no solutions!May have single, multiple or no solutions!
Use the solution that minimizes the least squares Use the solution that minimizes the least squares error metric.error metric.
As in log-space this amounts to minimizing sum As in log-space this amounts to minimizing sum (L_2) of relative error.(L_2) of relative error.
Simple and Fast Initialization.Simple and Fast Initialization.
Why still INIT step=1.0 quantile?Why still INIT step=1.0 quantile?
Big Jumps in algorithm inspite of good start Big Jumps in algorithm inspite of good start point:point:
Optimizer estimates and independence Optimizer estimates and independence assumptions may not be valid in theassumptions may not be valid in the
presence of correlated columns.presence of correlated columns.
Efficiency: Statistics vs ExecutionEfficiency: Statistics vs Execution
Optimizer used for cardinality estimationOptimizer used for cardinality estimation but Executor used to verify the final step but Executor used to verify the final step
taken.taken.For a step when Optimizer (esimates For a step when Optimizer (esimates
decrease) and executor (evaluates decrease) and executor (evaluates increase) disagree switch to using only increase) disagree switch to using only executor for cardinality estimation.executor for cardinality estimation.
Good initialization obviates Optimizer use.Good initialization obviates Optimizer use.
ShortcuttingShortcutting
Traverse parameters in random orderTraverse parameters in random order
Make the first step that decreases the errorMake the first step that decreases the error
(Compare to previous approach of trying all steps (Compare to previous approach of trying all steps and making the “best” step thatand making the “best” step that
decreases error most)decreases error most) No significant benefit. Shortcutting doesn’t No significant benefit. Shortcutting doesn’t
seem to help. Infact sometimes slowerseem to help. Infact sometimes slower
convergence.convergence.
Experimental ResultsExperimental Results
Dataset description: tables testR, testS, tesT, Dataset description: tables testR, testS, tesT, tableTA with upto 1M tuples.tableTA with upto 1M tuples.
Have correlated columns and multipleHave correlated columns and multiplecorrelated foreign key join columns.correlated foreign key join columns.Columns include different Zipfian(1,0.5) andColumns include different Zipfian(1,0.5) andGaussian distributions.Gaussian distributions.
Queries description: Queries join over correlated Queries description: Queries join over correlated columns and have multiple correlated columns and have multiple correlated selectivities.selectivities.
Query Description:Query Description:
Eg1: 6 Correlated parameters, 1 constraint. Single Eg1: 6 Correlated parameters, 1 constraint. Single relation.relation.
Eg 2: Eg 2: 3 tables with 6 constraints including 2 way and 3 3 tables with 6 constraints including 2 way and 3
way join constraints. Filters on correlated way join constraints. Filters on correlated columns across joinscolumns across joins
Other Queries with constraints over joins, many Other Queries with constraints over joins, many parameters over correlated attributes.parameters over correlated attributes.
ERROR vs TIME graphERROR vs TIME graph
Error vs Time
0
10
20
30
40
50
60
70
80
0 50 100 150 200 250
Time (seconds)
Avg
Rel
ativ
e E
rro
r
Problem Specifics: Reusing ResultsProblem Specifics: Reusing Results
Lots of queries with the same skeletonLots of queries with the same skeleton
but different parameters.but different parameters.
Creation of Indices will help!Creation of Indices will help!
Use DTA for recommendations.Use DTA for recommendations.
2-10 fold improvement in speed.2-10 fold improvement in speed.
Using the DTA for index creationUsing the DTA for index creation
Use of Indices
0
0.5
1
1.5
2
2.5
3
0 500 1000 1500
Time (Seconds)
Avg
Rel
ativ
e E
rro
r
Without Indices
With Indices
Interleaving OPT and ExecInterleaving OPT and Exec
Using Optimizer to guide search: givesUsing Optimizer to guide search: gives
2-10 times improvement.2-10 times improvement.
Most of this improvement is also got by a Most of this improvement is also got by a good initialization procedure.good initialization procedure.
Prune SearchPrune Search
Look at only those steps that decrease the Look at only those steps that decrease the error error
If present query has larger cardinalityIf present query has larger cardinality
than constraint only make the filtersthan constraint only make the filters
less selective.less selective. 30-40% improvement.30-40% improvement.
Pruning SearchPruning Search
Pruning
0.000
5.000
10.000
15.000
20.000
25.000
0.000 100.000 200.000
Time (Seconds)
Av
g r
ela
tiv
e e
rro
r
With Pruning
WihoutPruning
Initial PointInitial Point
Random:Random:
Random may not converge to global optimaRandom may not converge to global optima
Convergence much slower.Convergence much slower.
LSE/Power: Usually converge to global optima. LSE/Power: Usually converge to global optima. Much faster convergence.Much faster convergence.
Esp in 6 parameter query. Does not converge to Esp in 6 parameter query. Does not converge to global optima. Gets stuck up.global optima. Gets stuck up.
Multiple start pointsMultiple start points
Searches from start points do not giveSearches from start points do not give
global optimaglobal optima In practice a few start points gives theIn practice a few start points gives the
global optima global optima
Problem SummaryProblem Summary
Create query for testing a module Create query for testing a module
Query not random but must satisfy some Query not random but must satisfy some constraints.constraints.
Must satisfy Cardinality constraints Must satisfy Cardinality constraints given the freedom to select some parametricgiven the freedom to select some parametricfilters.filters.
Algorithm: SummaryAlgorithm: Summary
Theoretical walk based algorithm.Theoretical walk based algorithm. Halving search good in practice.Halving search good in practice. UseUse
good initialization (optimizer, executor mix)good initialization (optimizer, executor mix)
pruning pruning
DTA indices.DTA indices. Cost: That of 10-100 query executions, optimizer Cost: That of 10-100 query executions, optimizer
calls.calls.