Parametric Query Generation

Preview:

DESCRIPTION

Parametric Query Generation. Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri. Problem Statement. Given Queries with Parametric filters , find values of Parameters so that cardinality constraints are satisfied on a given fixed database - PowerPoint PPT Presentation

Citation preview

Parametric Query Parametric Query GenerationGeneration

Student: Dilys ThomasStudent: Dilys Thomas

Mentor: Nico BrunoMentor: Nico Bruno

Manager: Surajit ChaudhuriManager: Surajit Chaudhuri

Problem StatementProblem Statement

Given Queries with Given Queries with Parametric filtersParametric filters, find , find values of Parameters so that values of Parameters so that cardinality cardinality constraintsconstraints are satisfied on a given are satisfied on a given fixed fixed databasedatabase

Constraints: Cardinality constraints on theConstraints: Cardinality constraints on the query and its subexpressions.query and its subexpressions.

Parameters: Simple filters.Parameters: Simple filters.

ExampleExample

Select * from testR where Select * from testR where ( testR.v1 between %f and %f) : 100,000( testR.v1 between %f and %f) : 100,000

Select * from testS where Select * from testS where ( testS.v1 <= %f): 17,000( testS.v1 <= %f): 17,000

Select * from testR, testS where Select * from testR, testS where (testR.v1=testS.v0) and ( testS.v1 <= %f) (testR.v1=testS.v0) and ( testS.v1 <= %f)

and ( testR.v0 >= %f) and ( testR.v0 >= %f) and ( testR.v1 between %f and %f): 30,000and ( testR.v1 between %f and %f): 30,000

MotivationMotivation

Generation of queries to test the optimizer.Generation of queries to test the optimizer.

RAGS tool is available presently to RAGS tool is available presently to syntactically generate random queries andsyntactically generate random queries and

test for errors by a majority vote.test for errors by a majority vote.

MotivationMotivation

Needed to test different modules, new Needed to test different modules, new algorithms, test statistics estimator, and algorithms, test statistics estimator, and compare performancescompare performances

Queries not random but you want them to Queries not random but you want them to satisfy some constraintssatisfy some constraints

Solution exists? NP complete.Solution exists? NP complete.

For n parametric attributes with JoinsFor n parametric attributes with Joins

Database only has O(n) tuplesDatabase only has O(n) tuples

Reduction from SUBSET SUM even for aReduction from SUBSET SUM even for a

single constraint.single constraint.

ModelModel

For a given set of parameters can find the For a given set of parameters can find the cardinality by a function invocation. cardinality by a function invocation.

Implemented by: Implemented by: Actually running the query (slow, accurate)Actually running the query (slow, accurate) Using optimizer estimates about the cardinality Using optimizer estimates about the cardinality

(fast, inaccurate)(fast, inaccurate) Using an intermediate datastructure.Using an intermediate datastructure.

Objective: Minimize the number of cardinality Objective: Minimize the number of cardinality estimation callsestimation calls

kn

Understanding the Problem: Understanding the Problem: SimplificationSimplification

K single sided <= attribute parametersK single sided <= attribute parameters Single relation and single constraintSingle relation and single constraint

Let n=number of distinct values in each attribute.Let n=number of distinct values in each attribute. k= number of attributesk= number of attributes

Simple algorithm of time:Simple algorithm of time:

Can we do better? Can we do better? 1 Dimension: Yes, Binary search.1 Dimension: Yes, Binary search.

Results:Results:

DimensionDimension Upper BoundUpper Bound Lower BoundLower Bound

11 Log nLog n Log nLog n

22 nn nn

K>=2K>=2

2 Dimension Algorithm2 Dimension Algorithm

Walk based AlgorithmWalk based Algorithm

Search for 20Search for 202525 2727 3535 4343 5050

1010 1515 2121 2727 3535

88 1212 1414 2020 2222

22 55 88 2020 2222

11 33 77 1010 2020

Lower BoundLower Bound

Incomparable setIncomparable set 2525 2727 3535 4343 5050

1010 1515 2121 2727 3535

88 1212 1414 2020 2222

22 55 88 2020 2222

11 33 77 1010 2121

For general k.For general k.

Upper boundUpper bound: For k-dimensions, recursively call : For k-dimensions, recursively call n invocations of (k-1) dimension algorithm.n invocations of (k-1) dimension algorithm.

T(k)=n * T(k-1)T(k)=n * T(k-1)T(2)=nT(2)=nHence T(K)=Hence T(K)=(Multiple walk algorithm)(Multiple walk algorithm)

Lower boundLower bound::x_1 + x_2 + … x_k = nx_1 + x_2 + … x_k = nSolutions C(n+k-1,k-1)Solutions C(n+k-1,k-1)

Optimization Problem:Optimization Problem:Error Metrics.Error Metrics.

Single ConstraintSingle Constraint:: Constraint cardinality: C ,Constraint cardinality: C , Achieved cardinality: DAchieved cardinality: D

RelErr= max (C/D, D/C)RelErr= max (C/D, D/C)

Multiple ConstraintsMultiple Constraints: : Combing the errors: Combing the errors: Average relative error across all constraints.Average relative error across all constraints.

Objective: Minimize errorObjective: Minimize error

CC1

Simple WalkSimple Walk

STEP= unit change in current parameter STEP= unit change in current parameter valuesvalues

While (can improve with step)While (can improve with step)

{Make the improving step}{Make the improving step}Stepsize=1 tuple->converges to local optimaStepsize=1 tuple->converges to local optima

Stepsize small -> convergence slow Stepsize small -> convergence slow

Simple Walk-> Halving WalkSimple Walk-> Halving Walk Initialize the parameters (point).Initialize the parameters (point). Each stepsize=1.0 quantile Each stepsize=1.0 quantile For (int i=0; i< maxhalve; i++)For (int i=0; i< maxhalve; i++) {{ while (can improve with step)while (can improve with step) {Make the improving step}{Make the improving step} //exited above loop -> cannot improve with local//exited above loop -> cannot improve with local Halve all step sizes.Halve all step sizes. }}Use quantiles to decide steps.Use quantiles to decide steps.

Halving WalkHalving Walk

Initializing the parameters [More later]Initializing the parameters [More later]Steps made in quantile domain of attributeSteps made in quantile domain of attribute

done by simple equidepth wrapper over done by simple equidepth wrapper over histograms provided by SQLServerhistograms provided by SQLServer

Initial stepsize=1.0 quantileInitial stepsize=1.0 quantile

Halving Walk: Steps consideredHalving Walk: Steps considered

For For <=<=, , >=>= parameters: parameters:

RIGHT move ,LEFT moveRIGHT move ,LEFT move

For For betweenbetween parameters: parameters:

Apart from RIGHT move, LEFT move for Apart from RIGHT move, LEFT move for each parameter.each parameter.

LEFT Translate. RIGHT TranslateLEFT Translate. RIGHT Translate

Algorithm Halving-StepsAlgorithm Halving-Steps

A generalization of binary searchA generalization of binary search

// But only a heuristic.// But only a heuristic.

Converges to Converges to Local OptimaLocal Optima

#Steps per iteration : Constant.#Steps per iteration : Constant.

Hence much Hence much faster convergencefaster convergence..

InitializationInitialization

Random Random Optimizer estimateOptimizer estimateSolving equations:Solving equations:

Power method.Power method.

Least Square Error.Least Square Error.

Least Squares InitializationLeast Squares Initialization

For each parametric attribute, Pi , have variable piFor each parametric attribute, Pi , have variable pi

For each Constraint build an equation:For each Constraint build an equation:

Cardinality without parametric filters: CCardinality without parametric filters: C Constraint cardinality with filters: FConstraint cardinality with filters: FThen Filter selectivity= S = F/CThen Filter selectivity= S = F/CIf P1, P2, P3, Pk are parameters in this constraintIf P1, P2, P3, Pk are parameters in this constraintWrite equation: p1 * p2 * .. pk = S Write equation: p1 * p2 * .. pk = S (Making Independence assumption)(Making Independence assumption)

Least Squares InitializationLeast Squares Initialization

In log space: set of linear equations.In log space: set of linear equations.

May have single, multiple or no solutions!May have single, multiple or no solutions!

Use the solution that minimizes the least squares Use the solution that minimizes the least squares error metric.error metric.

As in log-space this amounts to minimizing sum As in log-space this amounts to minimizing sum (L_2) of relative error.(L_2) of relative error.

Simple and Fast Initialization.Simple and Fast Initialization.

Why still INIT step=1.0 quantile?Why still INIT step=1.0 quantile?

Big Jumps in algorithm inspite of good start Big Jumps in algorithm inspite of good start point:point:

Optimizer estimates and independence Optimizer estimates and independence assumptions may not be valid in theassumptions may not be valid in the

presence of correlated columns.presence of correlated columns.

Efficiency: Statistics vs ExecutionEfficiency: Statistics vs Execution

Optimizer used for cardinality estimationOptimizer used for cardinality estimation but Executor used to verify the final step but Executor used to verify the final step

taken.taken.For a step when Optimizer (esimates For a step when Optimizer (esimates

decrease) and executor (evaluates decrease) and executor (evaluates increase) disagree switch to using only increase) disagree switch to using only executor for cardinality estimation.executor for cardinality estimation.

Good initialization obviates Optimizer use.Good initialization obviates Optimizer use.

ShortcuttingShortcutting

Traverse parameters in random orderTraverse parameters in random order

Make the first step that decreases the errorMake the first step that decreases the error

(Compare to previous approach of trying all steps (Compare to previous approach of trying all steps and making the “best” step thatand making the “best” step that

decreases error most)decreases error most) No significant benefit. Shortcutting doesn’t No significant benefit. Shortcutting doesn’t

seem to help. Infact sometimes slowerseem to help. Infact sometimes slower

convergence.convergence.

Experimental ResultsExperimental Results

Dataset description: tables testR, testS, tesT, Dataset description: tables testR, testS, tesT, tableTA with upto 1M tuples.tableTA with upto 1M tuples.

Have correlated columns and multipleHave correlated columns and multiplecorrelated foreign key join columns.correlated foreign key join columns.Columns include different Zipfian(1,0.5) andColumns include different Zipfian(1,0.5) andGaussian distributions.Gaussian distributions.

Queries description: Queries join over correlated Queries description: Queries join over correlated columns and have multiple correlated columns and have multiple correlated selectivities.selectivities.

Query Description:Query Description:

Eg1: 6 Correlated parameters, 1 constraint. Single Eg1: 6 Correlated parameters, 1 constraint. Single relation.relation.

Eg 2: Eg 2: 3 tables with 6 constraints including 2 way and 3 3 tables with 6 constraints including 2 way and 3

way join constraints. Filters on correlated way join constraints. Filters on correlated columns across joinscolumns across joins

Other Queries with constraints over joins, many Other Queries with constraints over joins, many parameters over correlated attributes.parameters over correlated attributes.

ERROR vs TIME graphERROR vs TIME graph

Error vs Time

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250

Time (seconds)

Avg

Rel

ativ

e E

rro

r

Problem Specifics: Reusing ResultsProblem Specifics: Reusing Results

Lots of queries with the same skeletonLots of queries with the same skeleton

but different parameters.but different parameters.

Creation of Indices will help!Creation of Indices will help!

Use DTA for recommendations.Use DTA for recommendations.

2-10 fold improvement in speed.2-10 fold improvement in speed.

Using the DTA for index creationUsing the DTA for index creation

Use of Indices

0

0.5

1

1.5

2

2.5

3

0 500 1000 1500

Time (Seconds)

Avg

Rel

ativ

e E

rro

r

Without Indices

With Indices

Interleaving OPT and ExecInterleaving OPT and Exec

Using Optimizer to guide search: givesUsing Optimizer to guide search: gives

2-10 times improvement.2-10 times improvement.

Most of this improvement is also got by a Most of this improvement is also got by a good initialization procedure.good initialization procedure.

Prune SearchPrune Search

Look at only those steps that decrease the Look at only those steps that decrease the error error

If present query has larger cardinalityIf present query has larger cardinality

than constraint only make the filtersthan constraint only make the filters

less selective.less selective. 30-40% improvement.30-40% improvement.

Pruning SearchPruning Search

Pruning

0.000

5.000

10.000

15.000

20.000

25.000

0.000 100.000 200.000

Time (Seconds)

Av

g r

ela

tiv

e e

rro

r

With Pruning

WihoutPruning

Initial PointInitial Point

Random:Random:

Random may not converge to global optimaRandom may not converge to global optima

Convergence much slower.Convergence much slower.

LSE/Power: Usually converge to global optima. LSE/Power: Usually converge to global optima. Much faster convergence.Much faster convergence.

Esp in 6 parameter query. Does not converge to Esp in 6 parameter query. Does not converge to global optima. Gets stuck up.global optima. Gets stuck up.

Multiple start pointsMultiple start points

Searches from start points do not giveSearches from start points do not give

global optimaglobal optima In practice a few start points gives theIn practice a few start points gives the

global optima global optima

Problem SummaryProblem Summary

Create query for testing a module Create query for testing a module

Query not random but must satisfy some Query not random but must satisfy some constraints.constraints.

Must satisfy Cardinality constraints Must satisfy Cardinality constraints given the freedom to select some parametricgiven the freedom to select some parametricfilters.filters.

Algorithm: SummaryAlgorithm: Summary

Theoretical walk based algorithm.Theoretical walk based algorithm. Halving search good in practice.Halving search good in practice. UseUse

good initialization (optimizer, executor mix)good initialization (optimizer, executor mix)

pruning pruning

DTA indices.DTA indices. Cost: That of 10-100 query executions, optimizer Cost: That of 10-100 query executions, optimizer

calls.calls.

Recommended