48
1 Evaluating Top- Evaluating Top- K K Selection Queries Selection Queries Surajit Chaudhuri Surajit Chaudhuri Microsoft Research Microsoft Research Luis Gravano Luis Gravano Columbia University Columbia University

1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

Embed Size (px)

Citation preview

Page 1: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

1

Evaluating Top-Evaluating Top-KK Selection QueriesSelection Queries

Surajit ChaudhuriSurajit ChaudhuriMicrosoft ResearchMicrosoft Research

Luis GravanoLuis GravanoColumbia UniversityColumbia University

Page 2: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

2

Motivating Example

Find 4-bedroom houses Find 4-bedroom houses priced at $350,000priced at $350,000

Exact matches often too Exact matches often too restrictiverestrictive

Rank of houses that are closest Rank of houses that are closest to specification more desirableto specification more desirable

Page 3: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

3

Motivating Example (cont.)

Find 4-bedroom houses Find 4-bedroom houses priced at $350,000priced at $350,000

House 1House 1:: 5 bedrooms; $400,000; 5 bedrooms; $400,000; Score=0.9Score=0.9 House 2House 2: 4 bedrooms; $485,000; : 4 bedrooms; $485,000; Score=0.8Score=0.8 House 3House 3: 6 bedrooms; $785,000; : 6 bedrooms; $785,000; Score=0.3Score=0.3

Page 4: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

4

Top-K Queries over Precise Relational Data

Support approximate matches Support approximate matches with with minimal changes to the minimal changes to the relational enginerelational engine

Initial focus: Initial focus: Selection queriesSelection queries with “equality” conditionswith “equality” conditions

Page 5: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

5

Outline

Definition of top-Definition of top-kk queries queriesExecution alternatives Execution alternatives Mapping of top-Mapping of top-kk queries to queries to

selection queriesselection queriesExperimentsExperiments

Page 6: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

6

Top-K Selection Queries

Specify an Specify an nn-dimensional target point-dimensional target pointDefine scoring functionDefine scoring functionSpecify Specify kk

AnswerAnswer:: kk objects with the best score objects with the best score for the target point (i.e., the “top for the target point (i.e., the “top kk” ” objects)objects)

Page 7: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

7

Specifying Top-K Queries using SQL

Select *Select *From From RROrder Order [k][k] By By Scoring_FunctionScoring_Function

Page 8: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

8

Scoring Functions Measure Degree of Match

Assume attributes defined over Assume attributes defined over metric spacemetric space

Score on any one attribute is Score on any one attribute is well definedwell defined

How to aggregate scores How to aggregate scores acrossacross attributes?attributes?

Page 9: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

9

Scoring Functions

Normalize attribute scores to be Normalize attribute scores to be in [0,1] rangein [0,1] range

Combine scores using popular Combine scores using popular aggregate functionsaggregate functions MinMin EuclideanEuclidean Sum, Max, …Sum, Max, …

Page 10: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

10

Some Example Scoring Functions

Let Let q=(qq=(q11, …, q, …, qnn)) be the target point be the target point and and t=(tt=(t11, …, t, …, tnn)) a tuple: a tuple:

Min(q, t)Min(q, t) = = min{1-|min{1-|qq11--tt11|, …, 1-||, …, 1-|qqnn--ttnn|}|}

Euclidean(q, t)Euclidean(q, t) = = 1- sqrt((1- sqrt((qq11--tt11))22//nn+ … + (+ … + (qqnn--ttnn))22//nn))

Page 11: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

11

Executing Top-K Queries

Known techniques require at least one Known techniques require at least one sequential scansequential scan (or a functional index) (or a functional index) Evaluate Scoring_Function Evaluate Scoring_Function for each tuplefor each tuple SortSort tuples [Carey & Kossman ‘97; ‘98] tuples [Carey & Kossman ‘97; ‘98]

Question: How to avoid sequential Question: How to avoid sequential scans?scans?Exploit implicit selectivity of top-Exploit implicit selectivity of top-kk queries queries

Page 12: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

12

Mapping a Top-K Query to a Selection Query

Determine a Determine a search score search score SS such that: such that: Expected # of tuples with Expected # of tuples with score > Sscore > S is is kk No false dismissals No false dismissals

Turn the condition that Turn the condition that score > Sscore > S into a into a range selectionrange selection condition(s) condition(s)

Evaluate selection query using existing Evaluate selection query using existing query processor and access pathsquery processor and access paths

Page 13: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

13

Mapping a Top-K Query to a Selection Query

4-bedrooms; $350,000; k=104-bedrooms; $350,000; k=10

Retrieve all tuples with Retrieve all tuples with score > 0.5 score > 0.5 (at least (at least kk=10 tuples expected)=10 tuples expected)

Analyze scoring function to Analyze scoring function to determine selection range: determine selection range: Bedrooms: [3, 5] and Price: [$250K, Bedrooms: [3, 5] and Price: [$250K,

$450K]$450K]

Page 14: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

14

Mapping a Search Score to a Selection Range

For For search score search score SS , target point , target point q=(qq=(q11, q, q22)),, and scoring function and scoring function MinMin::

Selection range:Selection range: tt11 IN [ IN [qq11 - (1.0- - (1.0-SS), ), qq11 + (1.0- + (1.0-SS)])]

tt22 IN [IN [qq22 - (1.0- - (1.0-SS), ), qq22 + (1.0- + (1.0-SS)])]

Page 15: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

15

Determining a Search Score

MonotonicityMonotonicity: Consider tuple : Consider tuple tt that is no further that is no further from target than from target than t’t’ on any attribute: on any attribute:

Score of t should be at least that of t’Score of t should be at least that of t’ Therefore, Score cannot be high “far away” Therefore, Score cannot be high “far away”

from targetfrom target Sphere for Sphere for EuclideanEuclidean Box for Box for MinMin

……centered at target pointcentered at target point

““Tightness” of enclosing range varies with scoring Tightness” of enclosing range varies with scoring functionsfunctions

a

b

c

Page 16: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

16

The Min Scoring Function

Page 17: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

17

The Euclidean Scoring Function

Page 18: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

18

Comments on Mapping

Search score determines Search score determines efficiencyefficiency, , not correctnessnot correctness

Issues in efficiency:Issues in efficiency: Avoid retrieving too many tuplesAvoid retrieving too many tuples Avoid retrieving fewer than Avoid retrieving fewer than kk top top

tuples tuples (restarts)(restarts)

How to determine good search How to determine good search scores?scores?

Page 19: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

19

Determining Search Scores

Find Find kk points in data points in dataCompute their scoreCompute their scoreSet search score to lowest scoreSet search score to lowest score

Challenges:Challenges: Determining the initial Determining the initial kk points to points to

optimize executionoptimize execution Taking original query into accountTaking original query into account

Page 20: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

20

Using Histograms

Q4

20

11

10

Page 21: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

21

Picking K Representative “Tuples”

Collapse histogram bucket to a single Collapse histogram bucket to a single representative pointrepresentative point Furthest from Furthest from QQ in bucket in bucket (“NoRestarts”)(“NoRestarts”) Closest to Closest to QQ in bucket in bucket (“Restarts”)(“Restarts”)

Assign bucket frequency to the single Assign bucket frequency to the single representative pointrepresentative point

Include closest representative points Include closest representative points until we have until we have kk tuples tuples

Page 22: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

22

Using Histograms:“NoRestarts”

Q4

20

11

10

Page 23: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

23

Using Histograms:“Restarts”

4

20

11

10

Q

Page 24: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

24

Other Strategies for Determining Search Scores

Calculate search score for: Calculate search score for: nn = = NoRestarts NoRestarts (“pessimistic” (“pessimistic”

extreme)extreme) rr = = Restarts Restarts (“optimistic” extreme)(“optimistic” extreme)

Use intermediate scores:Use intermediate scores: InterInter11 = (2 = (2nn + + rr)/3)/3

InterInter22 = (= (nn + 2 + 2rr)/3)/3

0 RestartsNoRestarts 1

Page 25: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

25

Evaluating the Generated Selection Query

Sequential scanSequential scanIntersection of a set of indexes, Intersection of a set of indexes,

followed by data access followed by data access Special case: index-only accessSpecial case: index-only access

Page 26: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

26

Indexes and Statistics

IndexesIndexesnn-dim (concatenated-key) B-trees-dim (concatenated-key) B-trees

StatisticsStatistics MaxDiffMaxDiff as base 1-dim histogram as base 1-dim histogram

Multidimensional histograms:Multidimensional histograms:AVI, Phased, MHistAVI, Phased, MHist

Page 27: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

27

Experimental Evaluation

Is mapping to selection queries an Is mapping to selection queries an effectiveeffective technique? technique?

Sensitivity of relevant parameters:Sensitivity of relevant parameters: Scoring functionsScoring functions Data skew and dimensionalityData skew and dimensionality StatisticsStatistics

Page 28: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

28

Data Generation

Characterized by Characterized by ZZ = < = <zz11, …, , …, zznn>>

Generate Generate NN tuples by Zipfian distribution tuples by Zipfian distribution zz11

Group tuples by Group tuples by attrattr11

For a partition with For a partition with attrattr11 = = aa with with NN11 tuples: tuples: Generate Generate NN11 values values ww11, ..., w, ..., wN1N1 using Zipfian using Zipfian

distribution distribution zz22

Create pairs (Create pairs (aa, , ww11), …, (), …, (aa, , wwN1N1))

Repeat steps to fill in all attribute valuesRepeat steps to fill in all attribute values

Page 29: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

29

Metrics for Comparison

Fraction of data tuples accessed may Fraction of data tuples accessed may be compared to:be compared to: Ideal: Ideal: kk Worst case: size of data setWorst case: size of data set

% of restarts% of restarts

Page 30: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

30

Exploring Limits

Intrinsic limitations of range-query approach: Intrinsic limitations of range-query approach: Enclose actual top-Enclose actual top-kk tuples in tight tuples in tight nn--

rectanglerectangle Retrieve all tuples in Retrieve all tuples in nn-rectangle-rectangle

Less than 1% of database tuples in n-rectangleLess than 1% of database tuples in n-rectangle(k=10; 100,000 tuples)(k=10; 100,000 tuples)

Effect of retrieving tuples with score > Effect of retrieving tuples with score > SS using using an an nn-rectangle-rectangle

Page 31: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

31

Effect of Scoring Functions

MinMin has little/no gap between has little/no gap between target region and enclosing target region and enclosing nn--rectanglerectangle

As As kk increases, fraction of retrieved increases, fraction of retrieved tuples grows slowest for tuples grows slowest for MinMin

EuclideanEuclidean performs worse performs worseLess tight Less tight nn-rectangle -rectangle

Page 32: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

32

Tuples with Score > S v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

Page 33: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

33

Effect of Mapping Strategies and Histograms

Multidimensional histograms aid Multidimensional histograms aid computation of tight search scorescomputation of tight search scores

NoRestartsNoRestarts dominates at high data dominates at high data skewskew

Page 34: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

34

Tuples Retrieved v. Data Skew(PHASED histogram of 5KB; n=3)

Page 35: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

35

Restarts v. Data Skew(PHASED histogram of 5KB; n=3)

Page 36: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

36

Related Work (1)

[Fagin ‘96; ‘98] [Fagin ‘96; ‘98] Multimedia attributes with query “subsystem”Multimedia attributes with query “subsystem” Multiple index scansMultiple index scans Independence assumptionIndependence assumption

[Chaudhuri & Gravano ‘96][Chaudhuri & Gravano ‘96] Multimedia attributes with query “subsystem”Multimedia attributes with query “subsystem” Map top-Map top-kk queries to “selection” queries queries to “selection” queries Independence assumptionIndependence assumption Limited scoring functionsLimited scoring functions

Page 37: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

37

Related Work (2)

[Carey & Kossman ‘97; ‘98][Carey & Kossman ‘97; ‘98]Optimized sorting phase using Optimized sorting phase using kk

Nearest-neighbor literatureNearest-neighbor literature [Donjerkovic & Ramakrishnan ‘99][Donjerkovic & Ramakrishnan ‘99]

Probabilistic optimization framework Probabilistic optimization framework No multidimensional scoring functionsNo multidimensional scoring functions Independence assumptionsIndependence assumptions

Page 38: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

38

SummaryDefined mapping of top-Defined mapping of top-kk queries to queries to

traditional selection queriestraditional selection queriesExploit existing database statistics and Exploit existing database statistics and

query processorsquery processorsStudied effect of scoring functions, Studied effect of scoring functions,

data skew, statistics on mappingdata skew, statistics on mapping

Full experimental analysis forthcoming!Full experimental analysis forthcoming!

Page 39: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

39

Tuples Retrieved v. Histogram Size(Euclidean; n=3; Z21)

Page 40: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

40

Tuples Retrieved v. n(PHASED histogram of 5KB; Z21)

Page 41: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

41

Restarts v. n(PHASED histogram of 5KB; Z21)

Page 42: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

42

Tuples Retrieved v. k(PHASED histogram of 5KB; Z21; n=3)

Page 43: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

43

Restarts v. k(PHASED histogram of 5KB; Z21; n=3)

Page 44: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

44

Restarts v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

Page 45: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

45

Tuples Retrieved v. Histogram Size(Census Database; PHASED)

Page 46: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

46

Tuples Retrieved v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

Page 47: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

47

The Sum Scoring Function

Page 48: 1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

48

The Max Scoring Function