1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia...

Evaluating Top-Evaluating Top-KK Selection QueriesSelection Queries

Surajit ChaudhuriSurajit ChaudhuriMicrosoft ResearchMicrosoft Research

Luis GravanoLuis GravanoColumbia UniversityColumbia University

Motivating Example

Find 4-bedroom houses Find 4-bedroom houses priced at $350,000priced at $350,000

Exact matches often too Exact matches often too restrictiverestrictive

Rank of houses that are closest Rank of houses that are closest to specification more desirableto specification more desirable

Motivating Example (cont.)

Find 4-bedroom houses Find 4-bedroom houses priced at $350,000priced at $350,000

House 1House 1:: 5 bedrooms; $400,000; 5 bedrooms; $400,000; Score=0.9Score=0.9 House 2House 2: 4 bedrooms; $485,000; : 4 bedrooms; $485,000; Score=0.8Score=0.8 House 3House 3: 6 bedrooms; $785,000; : 6 bedrooms; $785,000; Score=0.3Score=0.3

Top-K Queries over Precise Relational Data

Support approximate matches Support approximate matches with with minimal changes to the minimal changes to the relational enginerelational engine

Initial focus: Initial focus: Selection queriesSelection queries with “equality” conditionswith “equality” conditions

Outline

Definition of top-Definition of top-kk queries queriesExecution alternatives Execution alternatives Mapping of top-Mapping of top-kk queries to queries to

selection queriesselection queriesExperimentsExperiments

Top-K Selection Queries

Specify an Specify an nn-dimensional target point-dimensional target pointDefine scoring functionDefine scoring functionSpecify Specify kk

AnswerAnswer:: kk objects with the best score objects with the best score for the target point (i.e., the “top for the target point (i.e., the “top kk” ” objects)objects)

Specifying Top-K Queries using SQL

Select *Select *From From RROrder Order [k][k] By By Scoring_FunctionScoring_Function

Scoring Functions Measure Degree of Match

Assume attributes defined over Assume attributes defined over metric spacemetric space

Score on any one attribute is Score on any one attribute is well definedwell defined

How to aggregate scores How to aggregate scores acrossacross attributes?attributes?

Scoring Functions

Normalize attribute scores to be Normalize attribute scores to be in [0,1] rangein [0,1] range

Combine scores using popular Combine scores using popular aggregate functionsaggregate functions MinMin EuclideanEuclidean Sum, Max, …Sum, Max, …

Some Example Scoring Functions

Let Let q=(qq=(q11, …, q, …, qnn)) be the target point be the target point and and t=(tt=(t11, …, t, …, tnn)) a tuple: a tuple:

Min(q, t)Min(q, t) = = min{1-|min{1-|qq11--tt11|, …, 1-||, …, 1-|qqnn--ttnn|}|}

Euclidean(q, t)Euclidean(q, t) = = 1- sqrt((1- sqrt((qq11--tt11))22//nn+ … + (+ … + (qqnn--ttnn))22//nn))

Executing Top-K Queries

Known techniques require at least one Known techniques require at least one sequential scansequential scan (or a functional index) (or a functional index) Evaluate Scoring_Function Evaluate Scoring_Function for each tuplefor each tuple SortSort tuples [Carey & Kossman ‘97; ‘98] tuples [Carey & Kossman ‘97; ‘98]

Question: How to avoid sequential Question: How to avoid sequential scans?scans?Exploit implicit selectivity of top-Exploit implicit selectivity of top-kk queries queries

Mapping a Top-K Query to a Selection Query

Determine a Determine a search score search score SS such that: such that: Expected # of tuples with Expected # of tuples with score > Sscore > S is is kk No false dismissals No false dismissals

Turn the condition that Turn the condition that score > Sscore > S into a into a range selectionrange selection condition(s) condition(s)

Evaluate selection query using existing Evaluate selection query using existing query processor and access pathsquery processor and access paths

Mapping a Top-K Query to a Selection Query

4-bedrooms; $350,000; k=104-bedrooms; $350,000; k=10

Retrieve all tuples with Retrieve all tuples with score > 0.5 score > 0.5 (at least (at least kk=10 tuples expected)=10 tuples expected)

Analyze scoring function to Analyze scoring function to determine selection range: determine selection range: Bedrooms: [3, 5] and Price: [$250K, Bedrooms: [3, 5] and Price: [$250K,

$450K]$450K]

Mapping a Search Score to a Selection Range

For For search score search score SS , target point , target point q=(qq=(q11, q, q22)),, and scoring function and scoring function MinMin::

Selection range:Selection range: tt11 IN [ IN [qq11 - (1.0- - (1.0-SS), ), qq11 + (1.0- + (1.0-SS)])]

tt22 IN [IN [qq22 - (1.0- - (1.0-SS), ), qq22 + (1.0- + (1.0-SS)])]

Determining a Search Score

MonotonicityMonotonicity: Consider tuple : Consider tuple tt that is no further that is no further from target than from target than t’t’ on any attribute: on any attribute:

Score of t should be at least that of t’Score of t should be at least that of t’ Therefore, Score cannot be high “far away” Therefore, Score cannot be high “far away”

from targetfrom target Sphere for Sphere for EuclideanEuclidean Box for Box for MinMin

……centered at target pointcentered at target point

““Tightness” of enclosing range varies with scoring Tightness” of enclosing range varies with scoring functionsfunctions

The Min Scoring Function

The Euclidean Scoring Function

Comments on Mapping

Search score determines Search score determines efficiencyefficiency, , not correctnessnot correctness

Issues in efficiency:Issues in efficiency: Avoid retrieving too many tuplesAvoid retrieving too many tuples Avoid retrieving fewer than Avoid retrieving fewer than kk top top

tuples tuples (restarts)(restarts)

How to determine good search How to determine good search scores?scores?

Determining Search Scores

Find Find kk points in data points in dataCompute their scoreCompute their scoreSet search score to lowest scoreSet search score to lowest score

Challenges:Challenges: Determining the initial Determining the initial kk points to points to

optimize executionoptimize execution Taking original query into accountTaking original query into account

Using Histograms

Picking K Representative “Tuples”

Collapse histogram bucket to a single Collapse histogram bucket to a single representative pointrepresentative point Furthest from Furthest from QQ in bucket in bucket (“NoRestarts”)(“NoRestarts”) Closest to Closest to QQ in bucket in bucket (“Restarts”)(“Restarts”)

Assign bucket frequency to the single Assign bucket frequency to the single representative pointrepresentative point

Include closest representative points Include closest representative points until we have until we have kk tuples tuples

Using Histograms:“NoRestarts”

Using Histograms:“Restarts”

Other Strategies for Determining Search Scores

Calculate search score for: Calculate search score for: nn = = NoRestarts NoRestarts (“pessimistic” (“pessimistic”

extreme)extreme) rr = = Restarts Restarts (“optimistic” extreme)(“optimistic” extreme)

Use intermediate scores:Use intermediate scores: InterInter11 = (2 = (2nn + + rr)/3)/3

InterInter22 = (= (nn + 2 + 2rr)/3)/3

0 RestartsNoRestarts 1

Evaluating the Generated Selection Query

Sequential scanSequential scanIntersection of a set of indexes, Intersection of a set of indexes,

followed by data access followed by data access Special case: index-only accessSpecial case: index-only access

Indexes and Statistics

IndexesIndexesnn-dim (concatenated-key) B-trees-dim (concatenated-key) B-trees

StatisticsStatistics MaxDiffMaxDiff as base 1-dim histogram as base 1-dim histogram

Multidimensional histograms:Multidimensional histograms:AVI, Phased, MHistAVI, Phased, MHist

Experimental Evaluation

Is mapping to selection queries an Is mapping to selection queries an effectiveeffective technique? technique?

Sensitivity of relevant parameters:Sensitivity of relevant parameters: Scoring functionsScoring functions Data skew and dimensionalityData skew and dimensionality StatisticsStatistics

Data Generation

Characterized by Characterized by ZZ = < = <zz11, …, , …, zznn>>

Generate Generate NN tuples by Zipfian distribution tuples by Zipfian distribution zz11

Group tuples by Group tuples by attrattr11

For a partition with For a partition with attrattr11 = = aa with with NN11 tuples: tuples: Generate Generate NN11 values values ww11, ..., w, ..., wN1N1 using Zipfian using Zipfian

distribution distribution zz22

Create pairs (Create pairs (aa, , ww11), …, (), …, (aa, , wwN1N1))

Repeat steps to fill in all attribute valuesRepeat steps to fill in all attribute values

Metrics for Comparison

Fraction of data tuples accessed may Fraction of data tuples accessed may be compared to:be compared to: Ideal: Ideal: kk Worst case: size of data setWorst case: size of data set

% of restarts% of restarts

Exploring Limits

Intrinsic limitations of range-query approach: Intrinsic limitations of range-query approach: Enclose actual top-Enclose actual top-kk tuples in tight tuples in tight nn--

rectanglerectangle Retrieve all tuples in Retrieve all tuples in nn-rectangle-rectangle

Less than 1% of database tuples in n-rectangleLess than 1% of database tuples in n-rectangle(k=10; 100,000 tuples)(k=10; 100,000 tuples)

Effect of retrieving tuples with score > Effect of retrieving tuples with score > SS using using an an nn-rectangle-rectangle

Effect of Scoring Functions

MinMin has little/no gap between has little/no gap between target region and enclosing target region and enclosing nn--rectanglerectangle

As As kk increases, fraction of retrieved increases, fraction of retrieved tuples grows slowest for tuples grows slowest for MinMin

EuclideanEuclidean performs worse performs worseLess tight Less tight nn-rectangle -rectangle

Tuples with Score > S v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

Effect of Mapping Strategies and Histograms

Multidimensional histograms aid Multidimensional histograms aid computation of tight search scorescomputation of tight search scores

NoRestartsNoRestarts dominates at high data dominates at high data skewskew

Tuples Retrieved v. Data Skew(PHASED histogram of 5KB; n=3)

Restarts v. Data Skew(PHASED histogram of 5KB; n=3)

Related Work (1)

[Fagin ‘96; ‘98] [Fagin ‘96; ‘98] Multimedia attributes with query “subsystem”Multimedia attributes with query “subsystem” Multiple index scansMultiple index scans Independence assumptionIndependence assumption

[Chaudhuri & Gravano ‘96][Chaudhuri & Gravano ‘96] Multimedia attributes with query “subsystem”Multimedia attributes with query “subsystem” Map top-Map top-kk queries to “selection” queries queries to “selection” queries Independence assumptionIndependence assumption Limited scoring functionsLimited scoring functions

Related Work (2)

[Carey & Kossman ‘97; ‘98][Carey & Kossman ‘97; ‘98]Optimized sorting phase using Optimized sorting phase using kk

Nearest-neighbor literatureNearest-neighbor literature [Donjerkovic & Ramakrishnan ‘99][Donjerkovic & Ramakrishnan ‘99]

Probabilistic optimization framework Probabilistic optimization framework No multidimensional scoring functionsNo multidimensional scoring functions Independence assumptionsIndependence assumptions

SummaryDefined mapping of top-Defined mapping of top-kk queries to queries to

traditional selection queriestraditional selection queriesExploit existing database statistics and Exploit existing database statistics and

query processorsquery processorsStudied effect of scoring functions, Studied effect of scoring functions,

data skew, statistics on mappingdata skew, statistics on mapping

Full experimental analysis forthcoming!Full experimental analysis forthcoming!

Tuples Retrieved v. Histogram Size(Euclidean; n=3; Z21)

Tuples Retrieved v. n(PHASED histogram of 5KB; Z21)

Restarts v. n(PHASED histogram of 5KB; Z21)

Tuples Retrieved v. k(PHASED histogram of 5KB; Z21; n=3)

Restarts v. k(PHASED histogram of 5KB; Z21; n=3)

Restarts v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

Tuples Retrieved v. Histogram Size(Census Database; PHASED)

Tuples Retrieved v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

The Sum Scoring Function

The Max Scoring Function

1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia...

Documents

Pushing Data-Induced Predicates Through Joins in Big-Data ... · Pushing Data-Induced Predicates Through Joins in Big-Data Clusters Srikanth Kandula, Laurel Orr, Surajit Chaudhuri

Self-Managing DBMS Technology at Microsoft5 Surajit Chaudhuri Key Pillars “Observe-Predict-React” Feedback cycle Powerful Monitoring Framework (useful in itself) Local Models for

1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research

Data Services Leveraging Bing’s Data Assets Services Leveraging Bing’s Data Assets Kaushik Chakrabarti, Surajit Chaudhuri, Zhimin Chen, Kris Ganjam, Yeye He Microsoft Research

Flexible Database Generators Nicolas Bruno Surajit Chaudhuri DMX Group Microsoft Research VLDB’05

Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer

Including Group-By in Query Optimization - …...Including Group-By in Query Optimization Surajit Chaudhuri Kyuseok Shim* Hewlett-Packard Laboratories Palo Alto, CA 94304 chaudhuri@hpl.hp.com,

Self-Managing DBMS Technology at Microsoftwrong solutions are not helpful 4 Surajit Chaudhuri Microsoft’s Early Focus on Self-Managing Technology 1998: SQL Server 7.0 launch towards

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research

Exact Cardinality Query Optimization for Optimizer Testing€¦ · Exact Cardinality Query Optimization for Optimizer Testing Surajit Chaudhuri ... to understand to what extent the

Surajit Chaudhuri Venkatesh Ganti Dong Xin Microsoft Research Exploiting Web Search to Generate Synonyms for Entities

Automated Selection of Materialized Views and Indexes for SQL Databases SANJAY AGRAWAL SURAJIT CHAUDHURI VIVEK NARASAYYA HASAN KUMAR REDDY A (09005065)

Self-Managing Technology in Database Management Systems Surajit Chaudhuri, Microsoft Research Benoit Dageville, Oracle Guy Lohman, IBM Almaden Research

Dynamic Sample Selection for Approximate Query Processing Brain Babcock (Stanford Univ) Surajit Chaudhuri (Microsoft Research) Gautam Das (Microsoft Research)

An Overview of Query Optimization in Relational Systemszives/03f/cis550/chaudhuri.pdfAn Overview of Query Optimization in Relational Systems Surajit Chaudhuri Microsoft Research One

Self-Managing Technology in Database Management Systemskrunapon/courses/178370/vldb04/Tutorial3-Intro-notes.pdfSelf-Managing Technology in Database Management Systems Surajit Chaudhuri,

Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba

Fine Grained Authorization Through Predicated Grants Surajit Chaudhuri, Tanmoy Dutta, S. Sudarshan (ICDE 2007) Presented By: Ahmad Abusalah abusalah@cs.purdue.edu

Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm