Reverse Top- k Queries

Reverse Top-k Queries

Akrivi Vlachou*, Christos Doulkeridis*, Yannis Kotidis#, Kjetil Nørvåg*

*Norwegian University of Science and Technology (NTNU), Trondheim, Norway#Athens University of Economics and Business (AUEB), Greece

2

Outline

Motivation & PreliminariesMonochromatic Reverse Top-k QueriesBichromatic Reverse Top-k Queries

Threshold-based AlgorithmMaterialized Views

Experimental EvaluationConclusions & Future Work

3

Rank-aware Query Processing

Huge amount of available data

Users prefer to retrieve a limited set of k ranked data objects that best match their preferences (top-k queries)

4

Top-k Query

Given a scoring function f(), retrieve the k object that best match the user preferences

Linear scoring function

f w(p) = Σw[i]*p[i]

Weight w[i]: relative importance of attribute i

Definition TOPk(w): Given a

weighting vector w and a positive integer k, find the k data points p with the minimum f(p) scores

Query line of w at point p: defines the score of pQuery space of w defined by point p: number of enclosed points determines the rank of p

5

From the perspective of manufacturers: it is important that a

product is returned in the highest ranked positions for as many user preferences as possible

estimate the impact of a product compared to their competitors products

advertise a product to potential customers

Reversing the Top-k Query

sales representative

customer customer customer customer

Which customers would be interested?

6

Reverse top-k query: Given a potential product q

and a positive integer k, which are the weighting vectors w for which q is in the top-k query result set?

Two different versions Monochromatic: no knowledge of user

preferences Bichromatic: a dataset with user

preferences is given

Reversing the Top-k Query

sales representative

customer customer customer customer

7

Car Database Example

A database containing information about different cars Different users have different preferences Bob prefers a cheap car, and does not care much about the age

the best choice (top-1) for Bob is the car p1 with score 2.5 Tom prefers a newer car rather than a cheap car

the best choice for Tom and Max is the car p2

8

Car Database Example

Query point q=p2, k=1: Bichromatic reverse top-k: {(0.2,0.8), (0.5,0.5)}

advertise product to Tom and Max Monochromatic reverse top-k: line segment w[price]=[1/7,5/6]

estimate the impact of p2 as 69%

Query point q=p3, k=1: empty result set for the bichromatic query

9

Outline




10

Monochromatic Reverse Top-k Query

mRTOPk(q): Given a point q, a positive number k and a dataset S, the result set of the monochromatic reverse top-k query is the locus for which there exists p in TOPk(wi) such that fwi(q) ≤ fwi(p).

The solution space W can be split into a finite set of non-adjacent partitions such that query point q has the same rank for all the weighting vectors.

For the monochromatic case: we focus on the 2-d space

Solution space

2

2

mRTOP1(q)

1

11

Geometric Interpretation d=2, k =1

If q belongs to the convex hull, then there exists exactly one partition in mRTOP1(q)

Weighting vectors that are perpendicular to pq and qr define the line segment

For weighting vectors with smaller and larger slopes than w1, the relative order of p and q changes

Monochromatic reverse top-k, k>1: The solution space may contain

more than 1 partition

14

Outline




15

Bichromatic Reverse Top-k Query

bRTOPk(q): Given a point q, a positive number k and two datasets S and W, where S represents data points and W is a dataset containing different weighting vectors, a weighting vector wi belongs to the result set, if and only if there exists p in TOPk(wi) such that fwi(q) ≤ fwi(p)

Naïve approach: for each weighting vector process the top-k query test if query point q is in the top-k list

16

Threshold-based Algorithm (RTA)

Goal: reduce the number of top-k evaluations by discarding

weighting vectorsThreshold-based Algorithm (RTA):

sort the weighting vectors based on pairwise similarity top-k queries defined by similar vectors, have similar result

setsevaluate the first top-k query, calculate a thresholdFor each weighting vector

possibly prune based on threshold refine threshold

17

Example of RTA Algorithm (k=2)

Evaluate top-2 query for w1

Set threshold based on w2

fw2(q) > threshold discard w2

Refine threshold for w3

W=[ w1, w2, w3 ]

Buffer: p1, p2

w1 q

p4p1

p2

p3

p5p6

p7

p8p9

p10

w2

w3

18

Materialized Views

Threshold-based Algorithm (RTA) reduce the top-k evaluations by discarding some

weighting vectors that are not in the reverse top-k result set

process at least as many top-k evaluations as the cardinality of the result set

Materialized Views find weighting vectors that belong definitely to

the result without top-k evaluation

19

Materialized Views

Grid-based space partitioningcell Ci

lower left corner CiL

upper right corner CiU

We store for each cell Ci the results of reverse top-k queries for corners Ci

L and CiU

w1, w2, w3

20

Materialized Views

Given a point q enclosed in cell Ciall weighting vectors

in RTOPk(CiU) belong

to the result set of qonly weighting

vectors in RTOPk(Ci

L) - RTOPk(CiU)

have to be examined Materialized views can

be generalized for arbitrary k<K values

w1, w2, w3

w1, w2, w3 , w4

21

Outline




22

Experimental Setup

Comparison between Naïve and RTA (varying dimensionality, cardinality, data distribution – real data)

Queries: uniform and k-skyband pointsMetrics:

time I/Osnumber of top-k evaluations

23

RTA vs. Naïve

RTA outperforms naive by 1 to 2 orders of magnitude as dimensionality increases, |RTOPk(q)| decreases leading to

fewer top-k evaluations

uniform distribution of S and uniform weights W|S|=10K, |W|=10K, top-k=10, skyband query points

24

Scalability of RTA Algorithm

naive requires |W| top-k query evaluations |W|=5K, correlated dataset:

RTA needs on 544 out of 5000 top-k evaluations (saves 89.12% of the cost)

the average size of the result set is 459

various distributions (UN, AC, CO) of S and uniform weights W|S|=10K or |W|=10K, d=5, top-k=10, skyband query points

25

Performance of RTA on Real Data

uniform and clustered weights W (|W|=10K) clustered weights lead to fewer top-k evaluations

NBA consists of 17265 tuples, d=5 (number of points scored, rebounds, assists, steals and blocks)

HOUSE consists of 127930 tuples, d=6 (income spent on gas, electricity, water, heating, insurance, and property tax)

26

Outline

Motivation & PreliminariesExample of Reverse Top-k QueriesMonochromatic Reverse Top-k QueriesBichromatic Reverse Top-k Queries



27

We introduced reverse top-k queries geometric interpretation of the solution spaceefficient algorithm for bichromatic reverse top-k

querymaterialized reverse top-k views

Future Work interpretation of solution space for higher

dimensions (monochromatic reverse top-k) improve the performance of the bichromatic reverse

top-k computation

Conclusions and Future Work

28

Thank you!

More information: http://www.idi.ntnu.no/~vlachou/

Related work:

Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, Kjetil Nørvåg: "Reverse Top-k Queries"

Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Yannis Kotidis: "Identifying the Most Influential Data Objects with Reverse Top-k Queries"

Documents

Reverse Top- k Queries