36
Parametric Query Parametric Query Generation Generation Student: Dilys Thomas Student: Dilys Thomas Mentor: Nico Bruno Mentor: Nico Bruno Manager: Surajit Manager: Surajit Chaudhuri Chaudhuri

Parametric Query Generation

  • Upload
    amato

  • View
    30

  • Download
    1

Embed Size (px)

DESCRIPTION

Parametric Query Generation. Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri. Problem Statement. Given Queries with Parametric filters , find values of Parameters so that cardinality constraints are satisfied on a given fixed database - PowerPoint PPT Presentation

Citation preview

Page 1: Parametric Query Generation

Parametric Query Parametric Query GenerationGeneration

Student: Dilys ThomasStudent: Dilys Thomas

Mentor: Nico BrunoMentor: Nico Bruno

Manager: Surajit ChaudhuriManager: Surajit Chaudhuri

Page 2: Parametric Query Generation

Problem StatementProblem Statement

Given Queries with Given Queries with Parametric filtersParametric filters, find , find values of Parameters so that values of Parameters so that cardinality cardinality constraintsconstraints are satisfied on a given are satisfied on a given fixed fixed databasedatabase

Constraints: Cardinality constraints on theConstraints: Cardinality constraints on the query and its subexpressions.query and its subexpressions.

Parameters: Simple filters.Parameters: Simple filters.

Page 3: Parametric Query Generation

ExampleExample

Select * from testR where Select * from testR where ( testR.v1 between %f and %f) : 100,000( testR.v1 between %f and %f) : 100,000

Select * from testS where Select * from testS where ( testS.v1 <= %f): 17,000( testS.v1 <= %f): 17,000

Select * from testR, testS where Select * from testR, testS where (testR.v1=testS.v0) and ( testS.v1 <= %f) (testR.v1=testS.v0) and ( testS.v1 <= %f)

and ( testR.v0 >= %f) and ( testR.v0 >= %f) and ( testR.v1 between %f and %f): 30,000and ( testR.v1 between %f and %f): 30,000

Page 4: Parametric Query Generation

MotivationMotivation

Generation of queries to test the optimizer.Generation of queries to test the optimizer.

RAGS tool is available presently to RAGS tool is available presently to syntactically generate random queries andsyntactically generate random queries and

test for errors by a majority vote.test for errors by a majority vote.

Page 5: Parametric Query Generation

MotivationMotivation

Needed to test different modules, new Needed to test different modules, new algorithms, test statistics estimator, and algorithms, test statistics estimator, and compare performancescompare performances

Queries not random but you want them to Queries not random but you want them to satisfy some constraintssatisfy some constraints

Page 6: Parametric Query Generation

Solution exists? NP complete.Solution exists? NP complete.

For n parametric attributes with JoinsFor n parametric attributes with Joins

Database only has O(n) tuplesDatabase only has O(n) tuples

Reduction from SUBSET SUM even for aReduction from SUBSET SUM even for a

single constraint.single constraint.

Page 7: Parametric Query Generation

ModelModel

For a given set of parameters can find the For a given set of parameters can find the cardinality by a function invocation. cardinality by a function invocation.

Implemented by: Implemented by: Actually running the query (slow, accurate)Actually running the query (slow, accurate) Using optimizer estimates about the cardinality Using optimizer estimates about the cardinality

(fast, inaccurate)(fast, inaccurate) Using an intermediate datastructure.Using an intermediate datastructure.

Objective: Minimize the number of cardinality Objective: Minimize the number of cardinality estimation callsestimation calls

kn

Page 8: Parametric Query Generation

Understanding the Problem: Understanding the Problem: SimplificationSimplification

K single sided <= attribute parametersK single sided <= attribute parameters Single relation and single constraintSingle relation and single constraint

Let n=number of distinct values in each attribute.Let n=number of distinct values in each attribute. k= number of attributesk= number of attributes

Simple algorithm of time:Simple algorithm of time:

Can we do better? Can we do better? 1 Dimension: Yes, Binary search.1 Dimension: Yes, Binary search.

Page 9: Parametric Query Generation

Results:Results:

DimensionDimension Upper BoundUpper Bound Lower BoundLower Bound

11 Log nLog n Log nLog n

22 nn nn

K>=2K>=2

Page 10: Parametric Query Generation

2 Dimension Algorithm2 Dimension Algorithm

Walk based AlgorithmWalk based Algorithm

Search for 20Search for 202525 2727 3535 4343 5050

1010 1515 2121 2727 3535

88 1212 1414 2020 2222

22 55 88 2020 2222

11 33 77 1010 2020

Page 11: Parametric Query Generation

Lower BoundLower Bound

Incomparable setIncomparable set 2525 2727 3535 4343 5050

1010 1515 2121 2727 3535

88 1212 1414 2020 2222

22 55 88 2020 2222

11 33 77 1010 2121

Page 12: Parametric Query Generation

For general k.For general k.

Upper boundUpper bound: For k-dimensions, recursively call : For k-dimensions, recursively call n invocations of (k-1) dimension algorithm.n invocations of (k-1) dimension algorithm.

T(k)=n * T(k-1)T(k)=n * T(k-1)T(2)=nT(2)=nHence T(K)=Hence T(K)=(Multiple walk algorithm)(Multiple walk algorithm)

Lower boundLower bound::x_1 + x_2 + … x_k = nx_1 + x_2 + … x_k = nSolutions C(n+k-1,k-1)Solutions C(n+k-1,k-1)

Page 13: Parametric Query Generation

Optimization Problem:Optimization Problem:Error Metrics.Error Metrics.

Single ConstraintSingle Constraint:: Constraint cardinality: C ,Constraint cardinality: C , Achieved cardinality: DAchieved cardinality: D

RelErr= max (C/D, D/C)RelErr= max (C/D, D/C)

Multiple ConstraintsMultiple Constraints: : Combing the errors: Combing the errors: Average relative error across all constraints.Average relative error across all constraints.

Objective: Minimize errorObjective: Minimize error

CC1

Page 14: Parametric Query Generation

Simple WalkSimple Walk

STEP= unit change in current parameter STEP= unit change in current parameter valuesvalues

While (can improve with step)While (can improve with step)

{Make the improving step}{Make the improving step}Stepsize=1 tuple->converges to local optimaStepsize=1 tuple->converges to local optima

Stepsize small -> convergence slow Stepsize small -> convergence slow

Page 15: Parametric Query Generation

Simple Walk-> Halving WalkSimple Walk-> Halving Walk Initialize the parameters (point).Initialize the parameters (point). Each stepsize=1.0 quantile Each stepsize=1.0 quantile For (int i=0; i< maxhalve; i++)For (int i=0; i< maxhalve; i++) {{ while (can improve with step)while (can improve with step) {Make the improving step}{Make the improving step} //exited above loop -> cannot improve with local//exited above loop -> cannot improve with local Halve all step sizes.Halve all step sizes. }}Use quantiles to decide steps.Use quantiles to decide steps.

Page 16: Parametric Query Generation

Halving WalkHalving Walk

Initializing the parameters [More later]Initializing the parameters [More later]Steps made in quantile domain of attributeSteps made in quantile domain of attribute

done by simple equidepth wrapper over done by simple equidepth wrapper over histograms provided by SQLServerhistograms provided by SQLServer

Initial stepsize=1.0 quantileInitial stepsize=1.0 quantile

Page 17: Parametric Query Generation

Halving Walk: Steps consideredHalving Walk: Steps considered

For For <=<=, , >=>= parameters: parameters:

RIGHT move ,LEFT moveRIGHT move ,LEFT move

For For betweenbetween parameters: parameters:

Apart from RIGHT move, LEFT move for Apart from RIGHT move, LEFT move for each parameter.each parameter.

LEFT Translate. RIGHT TranslateLEFT Translate. RIGHT Translate

Page 18: Parametric Query Generation

Algorithm Halving-StepsAlgorithm Halving-Steps

A generalization of binary searchA generalization of binary search

// But only a heuristic.// But only a heuristic.

Converges to Converges to Local OptimaLocal Optima

#Steps per iteration : Constant.#Steps per iteration : Constant.

Hence much Hence much faster convergencefaster convergence..

Page 19: Parametric Query Generation

InitializationInitialization

Random Random Optimizer estimateOptimizer estimateSolving equations:Solving equations:

Power method.Power method.

Least Square Error.Least Square Error.

Page 20: Parametric Query Generation

Least Squares InitializationLeast Squares Initialization

For each parametric attribute, Pi , have variable piFor each parametric attribute, Pi , have variable pi

For each Constraint build an equation:For each Constraint build an equation:

Cardinality without parametric filters: CCardinality without parametric filters: C Constraint cardinality with filters: FConstraint cardinality with filters: FThen Filter selectivity= S = F/CThen Filter selectivity= S = F/CIf P1, P2, P3, Pk are parameters in this constraintIf P1, P2, P3, Pk are parameters in this constraintWrite equation: p1 * p2 * .. pk = S Write equation: p1 * p2 * .. pk = S (Making Independence assumption)(Making Independence assumption)

Page 21: Parametric Query Generation

Least Squares InitializationLeast Squares Initialization

In log space: set of linear equations.In log space: set of linear equations.

May have single, multiple or no solutions!May have single, multiple or no solutions!

Use the solution that minimizes the least squares Use the solution that minimizes the least squares error metric.error metric.

As in log-space this amounts to minimizing sum As in log-space this amounts to minimizing sum (L_2) of relative error.(L_2) of relative error.

Simple and Fast Initialization.Simple and Fast Initialization.

Page 22: Parametric Query Generation

Why still INIT step=1.0 quantile?Why still INIT step=1.0 quantile?

Big Jumps in algorithm inspite of good start Big Jumps in algorithm inspite of good start point:point:

Optimizer estimates and independence Optimizer estimates and independence assumptions may not be valid in theassumptions may not be valid in the

presence of correlated columns.presence of correlated columns.

Page 23: Parametric Query Generation

Efficiency: Statistics vs ExecutionEfficiency: Statistics vs Execution

Optimizer used for cardinality estimationOptimizer used for cardinality estimation but Executor used to verify the final step but Executor used to verify the final step

taken.taken.For a step when Optimizer (esimates For a step when Optimizer (esimates

decrease) and executor (evaluates decrease) and executor (evaluates increase) disagree switch to using only increase) disagree switch to using only executor for cardinality estimation.executor for cardinality estimation.

Good initialization obviates Optimizer use.Good initialization obviates Optimizer use.

Page 24: Parametric Query Generation

ShortcuttingShortcutting

Traverse parameters in random orderTraverse parameters in random order

Make the first step that decreases the errorMake the first step that decreases the error

(Compare to previous approach of trying all steps (Compare to previous approach of trying all steps and making the “best” step thatand making the “best” step that

decreases error most)decreases error most) No significant benefit. Shortcutting doesn’t No significant benefit. Shortcutting doesn’t

seem to help. Infact sometimes slowerseem to help. Infact sometimes slower

convergence.convergence.

Page 25: Parametric Query Generation

Experimental ResultsExperimental Results

Dataset description: tables testR, testS, tesT, Dataset description: tables testR, testS, tesT, tableTA with upto 1M tuples.tableTA with upto 1M tuples.

Have correlated columns and multipleHave correlated columns and multiplecorrelated foreign key join columns.correlated foreign key join columns.Columns include different Zipfian(1,0.5) andColumns include different Zipfian(1,0.5) andGaussian distributions.Gaussian distributions.

Queries description: Queries join over correlated Queries description: Queries join over correlated columns and have multiple correlated columns and have multiple correlated selectivities.selectivities.

Page 26: Parametric Query Generation

Query Description:Query Description:

Eg1: 6 Correlated parameters, 1 constraint. Single Eg1: 6 Correlated parameters, 1 constraint. Single relation.relation.

Eg 2: Eg 2: 3 tables with 6 constraints including 2 way and 3 3 tables with 6 constraints including 2 way and 3

way join constraints. Filters on correlated way join constraints. Filters on correlated columns across joinscolumns across joins

Other Queries with constraints over joins, many Other Queries with constraints over joins, many parameters over correlated attributes.parameters over correlated attributes.

Page 27: Parametric Query Generation

ERROR vs TIME graphERROR vs TIME graph

Error vs Time

0

10

20

30

40

50

60

70

80

0 50 100 150 200 250

Time (seconds)

Avg

Rel

ativ

e E

rro

r

Page 28: Parametric Query Generation

Problem Specifics: Reusing ResultsProblem Specifics: Reusing Results

Lots of queries with the same skeletonLots of queries with the same skeleton

but different parameters.but different parameters.

Creation of Indices will help!Creation of Indices will help!

Use DTA for recommendations.Use DTA for recommendations.

2-10 fold improvement in speed.2-10 fold improvement in speed.

Page 29: Parametric Query Generation

Using the DTA for index creationUsing the DTA for index creation

Use of Indices

0

0.5

1

1.5

2

2.5

3

0 500 1000 1500

Time (Seconds)

Avg

Rel

ativ

e E

rro

r

Without Indices

With Indices

Page 30: Parametric Query Generation

Interleaving OPT and ExecInterleaving OPT and Exec

Using Optimizer to guide search: givesUsing Optimizer to guide search: gives

2-10 times improvement.2-10 times improvement.

Most of this improvement is also got by a Most of this improvement is also got by a good initialization procedure.good initialization procedure.

Page 31: Parametric Query Generation

Prune SearchPrune Search

Look at only those steps that decrease the Look at only those steps that decrease the error error

If present query has larger cardinalityIf present query has larger cardinality

than constraint only make the filtersthan constraint only make the filters

less selective.less selective. 30-40% improvement.30-40% improvement.

Page 32: Parametric Query Generation

Pruning SearchPruning Search

Pruning

0.000

5.000

10.000

15.000

20.000

25.000

0.000 100.000 200.000

Time (Seconds)

Av

g r

ela

tiv

e e

rro

r

With Pruning

WihoutPruning

Page 33: Parametric Query Generation

Initial PointInitial Point

Random:Random:

Random may not converge to global optimaRandom may not converge to global optima

Convergence much slower.Convergence much slower.

LSE/Power: Usually converge to global optima. LSE/Power: Usually converge to global optima. Much faster convergence.Much faster convergence.

Esp in 6 parameter query. Does not converge to Esp in 6 parameter query. Does not converge to global optima. Gets stuck up.global optima. Gets stuck up.

Page 34: Parametric Query Generation

Multiple start pointsMultiple start points

Searches from start points do not giveSearches from start points do not give

global optimaglobal optima In practice a few start points gives theIn practice a few start points gives the

global optima global optima

Page 35: Parametric Query Generation

Problem SummaryProblem Summary

Create query for testing a module Create query for testing a module

Query not random but must satisfy some Query not random but must satisfy some constraints.constraints.

Must satisfy Cardinality constraints Must satisfy Cardinality constraints given the freedom to select some parametricgiven the freedom to select some parametricfilters.filters.

Page 36: Parametric Query Generation

Algorithm: SummaryAlgorithm: Summary

Theoretical walk based algorithm.Theoretical walk based algorithm. Halving search good in practice.Halving search good in practice. UseUse

good initialization (optimizer, executor mix)good initialization (optimizer, executor mix)

pruning pruning

DTA indices.DTA indices. Cost: That of 10-100 query executions, optimizer Cost: That of 10-100 query executions, optimizer

calls.calls.