1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia...

Preview:

Citation preview

1

Evaluating Top-Evaluating Top-KK Selection QueriesSelection Queries

Surajit ChaudhuriSurajit ChaudhuriMicrosoft ResearchMicrosoft Research

Luis GravanoLuis GravanoColumbia UniversityColumbia University

2

Motivating Example

Find 4-bedroom houses Find 4-bedroom houses priced at $350,000priced at $350,000

Exact matches often too Exact matches often too restrictiverestrictive

Rank of houses that are closest Rank of houses that are closest to specification more desirableto specification more desirable

3

Motivating Example (cont.)

Find 4-bedroom houses Find 4-bedroom houses priced at $350,000priced at $350,000

House 1House 1:: 5 bedrooms; $400,000; 5 bedrooms; $400,000; Score=0.9Score=0.9 House 2House 2: 4 bedrooms; $485,000; : 4 bedrooms; $485,000; Score=0.8Score=0.8 House 3House 3: 6 bedrooms; $785,000; : 6 bedrooms; $785,000; Score=0.3Score=0.3

4

Top-K Queries over Precise Relational Data

Support approximate matches Support approximate matches with with minimal changes to the minimal changes to the relational enginerelational engine

Initial focus: Initial focus: Selection queriesSelection queries with “equality” conditionswith “equality” conditions

5

Outline

Definition of top-Definition of top-kk queries queriesExecution alternatives Execution alternatives Mapping of top-Mapping of top-kk queries to queries to

selection queriesselection queriesExperimentsExperiments

6

Top-K Selection Queries

Specify an Specify an nn-dimensional target point-dimensional target pointDefine scoring functionDefine scoring functionSpecify Specify kk

AnswerAnswer:: kk objects with the best score objects with the best score for the target point (i.e., the “top for the target point (i.e., the “top kk” ” objects)objects)

7

Specifying Top-K Queries using SQL

Select *Select *From From RROrder Order [k][k] By By Scoring_FunctionScoring_Function

8

Scoring Functions Measure Degree of Match

Assume attributes defined over Assume attributes defined over metric spacemetric space

Score on any one attribute is Score on any one attribute is well definedwell defined

How to aggregate scores How to aggregate scores acrossacross attributes?attributes?

9

Scoring Functions

Normalize attribute scores to be Normalize attribute scores to be in [0,1] rangein [0,1] range

Combine scores using popular Combine scores using popular aggregate functionsaggregate functions MinMin EuclideanEuclidean Sum, Max, …Sum, Max, …

10

Some Example Scoring Functions

Let Let q=(qq=(q11, …, q, …, qnn)) be the target point be the target point and and t=(tt=(t11, …, t, …, tnn)) a tuple: a tuple:

Min(q, t)Min(q, t) = = min{1-|min{1-|qq11--tt11|, …, 1-||, …, 1-|qqnn--ttnn|}|}

Euclidean(q, t)Euclidean(q, t) = = 1- sqrt((1- sqrt((qq11--tt11))22//nn+ … + (+ … + (qqnn--ttnn))22//nn))

11

Executing Top-K Queries

Known techniques require at least one Known techniques require at least one sequential scansequential scan (or a functional index) (or a functional index) Evaluate Scoring_Function Evaluate Scoring_Function for each tuplefor each tuple SortSort tuples [Carey & Kossman ‘97; ‘98] tuples [Carey & Kossman ‘97; ‘98]

Question: How to avoid sequential Question: How to avoid sequential scans?scans?Exploit implicit selectivity of top-Exploit implicit selectivity of top-kk queries queries

12

Mapping a Top-K Query to a Selection Query

Determine a Determine a search score search score SS such that: such that: Expected # of tuples with Expected # of tuples with score > Sscore > S is is kk No false dismissals No false dismissals

Turn the condition that Turn the condition that score > Sscore > S into a into a range selectionrange selection condition(s) condition(s)

Evaluate selection query using existing Evaluate selection query using existing query processor and access pathsquery processor and access paths

13

Mapping a Top-K Query to a Selection Query

4-bedrooms; $350,000; k=104-bedrooms; $350,000; k=10

Retrieve all tuples with Retrieve all tuples with score > 0.5 score > 0.5 (at least (at least kk=10 tuples expected)=10 tuples expected)

Analyze scoring function to Analyze scoring function to determine selection range: determine selection range: Bedrooms: [3, 5] and Price: [$250K, Bedrooms: [3, 5] and Price: [$250K,

$450K]$450K]

14

Mapping a Search Score to a Selection Range

For For search score search score SS , target point , target point q=(qq=(q11, q, q22)),, and scoring function and scoring function MinMin::

Selection range:Selection range: tt11 IN [ IN [qq11 - (1.0- - (1.0-SS), ), qq11 + (1.0- + (1.0-SS)])]

tt22 IN [IN [qq22 - (1.0- - (1.0-SS), ), qq22 + (1.0- + (1.0-SS)])]

15

Determining a Search Score

MonotonicityMonotonicity: Consider tuple : Consider tuple tt that is no further that is no further from target than from target than t’t’ on any attribute: on any attribute:

Score of t should be at least that of t’Score of t should be at least that of t’ Therefore, Score cannot be high “far away” Therefore, Score cannot be high “far away”

from targetfrom target Sphere for Sphere for EuclideanEuclidean Box for Box for MinMin

……centered at target pointcentered at target point

““Tightness” of enclosing range varies with scoring Tightness” of enclosing range varies with scoring functionsfunctions

a

b

c

16

The Min Scoring Function

17

The Euclidean Scoring Function

18

Comments on Mapping

Search score determines Search score determines efficiencyefficiency, , not correctnessnot correctness

Issues in efficiency:Issues in efficiency: Avoid retrieving too many tuplesAvoid retrieving too many tuples Avoid retrieving fewer than Avoid retrieving fewer than kk top top

tuples tuples (restarts)(restarts)

How to determine good search How to determine good search scores?scores?

19

Determining Search Scores

Find Find kk points in data points in dataCompute their scoreCompute their scoreSet search score to lowest scoreSet search score to lowest score

Challenges:Challenges: Determining the initial Determining the initial kk points to points to

optimize executionoptimize execution Taking original query into accountTaking original query into account

20

Using Histograms

Q4

20

11

10

21

Picking K Representative “Tuples”

Collapse histogram bucket to a single Collapse histogram bucket to a single representative pointrepresentative point Furthest from Furthest from QQ in bucket in bucket (“NoRestarts”)(“NoRestarts”) Closest to Closest to QQ in bucket in bucket (“Restarts”)(“Restarts”)

Assign bucket frequency to the single Assign bucket frequency to the single representative pointrepresentative point

Include closest representative points Include closest representative points until we have until we have kk tuples tuples

22

Using Histograms:“NoRestarts”

Q4

20

11

10

23

Using Histograms:“Restarts”

4

20

11

10

Q

24

Other Strategies for Determining Search Scores

Calculate search score for: Calculate search score for: nn = = NoRestarts NoRestarts (“pessimistic” (“pessimistic”

extreme)extreme) rr = = Restarts Restarts (“optimistic” extreme)(“optimistic” extreme)

Use intermediate scores:Use intermediate scores: InterInter11 = (2 = (2nn + + rr)/3)/3

InterInter22 = (= (nn + 2 + 2rr)/3)/3

0 RestartsNoRestarts 1

25

Evaluating the Generated Selection Query

Sequential scanSequential scanIntersection of a set of indexes, Intersection of a set of indexes,

followed by data access followed by data access Special case: index-only accessSpecial case: index-only access

26

Indexes and Statistics

IndexesIndexesnn-dim (concatenated-key) B-trees-dim (concatenated-key) B-trees

StatisticsStatistics MaxDiffMaxDiff as base 1-dim histogram as base 1-dim histogram

Multidimensional histograms:Multidimensional histograms:AVI, Phased, MHistAVI, Phased, MHist

27

Experimental Evaluation

Is mapping to selection queries an Is mapping to selection queries an effectiveeffective technique? technique?

Sensitivity of relevant parameters:Sensitivity of relevant parameters: Scoring functionsScoring functions Data skew and dimensionalityData skew and dimensionality StatisticsStatistics

28

Data Generation

Characterized by Characterized by ZZ = < = <zz11, …, , …, zznn>>

Generate Generate NN tuples by Zipfian distribution tuples by Zipfian distribution zz11

Group tuples by Group tuples by attrattr11

For a partition with For a partition with attrattr11 = = aa with with NN11 tuples: tuples: Generate Generate NN11 values values ww11, ..., w, ..., wN1N1 using Zipfian using Zipfian

distribution distribution zz22

Create pairs (Create pairs (aa, , ww11), …, (), …, (aa, , wwN1N1))

Repeat steps to fill in all attribute valuesRepeat steps to fill in all attribute values

29

Metrics for Comparison

Fraction of data tuples accessed may Fraction of data tuples accessed may be compared to:be compared to: Ideal: Ideal: kk Worst case: size of data setWorst case: size of data set

% of restarts% of restarts

30

Exploring Limits

Intrinsic limitations of range-query approach: Intrinsic limitations of range-query approach: Enclose actual top-Enclose actual top-kk tuples in tight tuples in tight nn--

rectanglerectangle Retrieve all tuples in Retrieve all tuples in nn-rectangle-rectangle

Less than 1% of database tuples in n-rectangleLess than 1% of database tuples in n-rectangle(k=10; 100,000 tuples)(k=10; 100,000 tuples)

Effect of retrieving tuples with score > Effect of retrieving tuples with score > SS using using an an nn-rectangle-rectangle

31

Effect of Scoring Functions

MinMin has little/no gap between has little/no gap between target region and enclosing target region and enclosing nn--rectanglerectangle

As As kk increases, fraction of retrieved increases, fraction of retrieved tuples grows slowest for tuples grows slowest for MinMin

EuclideanEuclidean performs worse performs worseLess tight Less tight nn-rectangle -rectangle

32

Tuples with Score > S v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

33

Effect of Mapping Strategies and Histograms

Multidimensional histograms aid Multidimensional histograms aid computation of tight search scorescomputation of tight search scores

NoRestartsNoRestarts dominates at high data dominates at high data skewskew

34

Tuples Retrieved v. Data Skew(PHASED histogram of 5KB; n=3)

35

Restarts v. Data Skew(PHASED histogram of 5KB; n=3)

36

Related Work (1)

[Fagin ‘96; ‘98] [Fagin ‘96; ‘98] Multimedia attributes with query “subsystem”Multimedia attributes with query “subsystem” Multiple index scansMultiple index scans Independence assumptionIndependence assumption

[Chaudhuri & Gravano ‘96][Chaudhuri & Gravano ‘96] Multimedia attributes with query “subsystem”Multimedia attributes with query “subsystem” Map top-Map top-kk queries to “selection” queries queries to “selection” queries Independence assumptionIndependence assumption Limited scoring functionsLimited scoring functions

37

Related Work (2)

[Carey & Kossman ‘97; ‘98][Carey & Kossman ‘97; ‘98]Optimized sorting phase using Optimized sorting phase using kk

Nearest-neighbor literatureNearest-neighbor literature [Donjerkovic & Ramakrishnan ‘99][Donjerkovic & Ramakrishnan ‘99]

Probabilistic optimization framework Probabilistic optimization framework No multidimensional scoring functionsNo multidimensional scoring functions Independence assumptionsIndependence assumptions

38

SummaryDefined mapping of top-Defined mapping of top-kk queries to queries to

traditional selection queriestraditional selection queriesExploit existing database statistics and Exploit existing database statistics and

query processorsquery processorsStudied effect of scoring functions, Studied effect of scoring functions,

data skew, statistics on mappingdata skew, statistics on mapping

Full experimental analysis forthcoming!Full experimental analysis forthcoming!

39

Tuples Retrieved v. Histogram Size(Euclidean; n=3; Z21)

40

Tuples Retrieved v. n(PHASED histogram of 5KB; Z21)

41

Restarts v. n(PHASED histogram of 5KB; Z21)

42

Tuples Retrieved v. k(PHASED histogram of 5KB; Z21; n=3)

43

Restarts v. k(PHASED histogram of 5KB; Z21; n=3)

44

Restarts v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

45

Tuples Retrieved v. Histogram Size(Census Database; PHASED)

46

Tuples Retrieved v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

47

The Sum Scoring Function

48

The Max Scoring Function

Recommended