16
1 Monochromatic and Bichromatic Reverse Top-k Queries Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, and Kjetil Nørv˚ ag Abstract—Nowadays, most applications return to the user a limited set of ranked results based on the individual user’s preferences, which are commonly expressed through top-k queries. From the per- spective of a manufacturer, it is imperative that her products appear in the highest ranked positions for many different user preferences, otherwise the product is not visible to potential customers. In this paper, we define a novel query type, namely the reverse top-k query, that covers this requirement: ”Given a potential product, which are the user preferences that make this product belong to the top-k query result set?”. Reverse top-k queries are essential for manufacturers to assess the impact of their products in the market based on the competition. We formally define reverse top-k queries and introduce two versions of the query, monochromatic and bichromatic. First, we provide a geometric interpretation of the monochromatic reverse top- k query to acquire an intuition of the solution space. Then, we study in detail the case of bichromatic reverse top-k query, and we propose two techniques for query processing, namely an efficient threshold- based algorithm and an algorithm based on materialized reverse top- k views. Our experimental evaluation demonstrates the efficiency of our techniques. Index Terms—reverse top-k query, top-k query, user preferences 1 I NTRODUCTION Recently, the support for rank-aware query processing has attracted much attention in the database research community. Top-k queries retrieve only a ranked set of k objects that best match the user preferences, thus avoiding overwhelming result sets. Since most appli- cations return to the user only a limited set of ranked results based on the individual user’s preferences, it is imperative for a manufacturer that her products appear in the highest ranked positions for many different user preferences, otherwise the product is not visible to potential customers. In this paper, we assume that users express their preferences through linear top-k queries, which are defined by assigning a A. Vlachou, C. Doulkeridis and K. Nørv˚ ag are with the Department of Computer Science, Norwegian University of Science and Technology, Norway. E-mails: {vlachou, cdoulk, noervaag}@idi.ntnu.no Y. Kotidis is with the Department of Informatics, Athens University of Economics and Business, Greece. E-mail: [email protected] weight to each of the scoring attributes, indicating the importance of each attribute to the user. This model is in agreement with the notion of preference [1], [2] and is widely adopted in related work. In this paper, we define a novel query type, namely the reverse top-k query, which can be expressed as follows: ”Given a potential product, which are the user preferences for which this product is in the top-k query result set?”. We formally define re- verse top-k queries and study two versions of the query: monochromatic and bichromatic reverse top-k queries. In the former, there is no knowledge of user preferences and the aim is to estimate the impact of a potential product in the market. In the latter, a dataset with user preferences is given and a reverse top-k query returns those preferences that rank a potential product highly. To the best of our knowledge, this is the first work that addresses this problem. Contributions. First, we introduce and formally define a novel query type called reverse top-k query (Section 3) and present two versions, namely monochromatic and bichromatic. We analyze the ge- ometrical properties for the 2-dimensional case of the monochromatic reverse top-k query and provide an algorithmic solution (Section 4). Furthermore, we dis- cuss the geometric interpretation of the solution space for higher dimensionality. Then, we study in detail the case of bichromatic reverse top-k query. Such a query, if computed in a straightforward manner, requires evaluating a top-k query for each user preference in the database, which is prohibitively expensive even for moderate datasets. Instead, we propose an efficient and progressive threshold-based algorithm, called RTA, for processing bichromatic reverse top-k queries for arbitrary data dimensionality (Section 5). RTA eagerly discards candidate user preferences, without processing the respective top-k queries. In addition, we present an indexing structure based on space partitioning, which materializes reverse top-k views, in order to further improve reverse top-k query pro- cessing (Section 6). We discuss the construction, usage and maintenance of the index based on materialized reverse top-k views. We conduct a thorough exper- imental evaluation that demonstrates the efficiency

Reverse Top K

Embed Size (px)

DESCRIPTION

Reverse top k queries

Citation preview

  • 1Monochromatic and Bichromatic ReverseTop-k Queries

    Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, and Kjetil Nrvag

    F

    AbstractNowadays, most applications return to the user a limitedset of ranked results based on the individual users preferences,which are commonly expressed through top-k queries. From the per-spective of a manufacturer, it is imperative that her products appearin the highest ranked positions for many different user preferences,otherwise the product is not visible to potential customers. In thispaper, we define a novel query type, namely the reverse top-k query,that covers this requirement: Given a potential product, which arethe user preferences that make this product belong to the top-k queryresult set?. Reverse top-k queries are essential for manufacturersto assess the impact of their products in the market based on thecompetition. We formally define reverse top-k queries and introducetwo versions of the query, monochromatic and bichromatic. First, weprovide a geometric interpretation of the monochromatic reverse top-k query to acquire an intuition of the solution space. Then, we studyin detail the case of bichromatic reverse top-k query, and we proposetwo techniques for query processing, namely an efficient threshold-based algorithm and an algorithm based on materialized reverse top-k views. Our experimental evaluation demonstrates the efficiency ofour techniques.

    Index Termsreverse top-k query, top-k query, user preferences

    1 INTRODUCTIONRecently, the support for rank-aware query processinghas attracted much attention in the database researchcommunity. Top-k queries retrieve only a ranked setof k objects that best match the user preferences, thusavoiding overwhelming result sets. Since most appli-cations return to the user only a limited set of rankedresults based on the individual users preferences, itis imperative for a manufacturer that her productsappear in the highest ranked positions for manydifferent user preferences, otherwise the product isnot visible to potential customers. In this paper, weassume that users express their preferences throughlinear top-k queries, which are defined by assigning a

    A. Vlachou, C. Doulkeridis and K. Nrvag are with the Department ofComputer Science, Norwegian University of Science and Technology,Norway.E-mails: {vlachou, cdoulk, noervaag}@idi.ntnu.no

    Y. Kotidis is with the Department of Informatics, Athens Universityof Economics and Business, Greece.E-mail: [email protected]

    weight to each of the scoring attributes, indicating theimportance of each attribute to the user. This modelis in agreement with the notion of preference [1], [2]and is widely adopted in related work.In this paper, we define a novel query type, namely

    the reverse top-k query, which can be expressed asfollows: Given a potential product, which are theuser preferences for which this product is in thetop-k query result set?. We formally define re-verse top-k queries and study two versions of thequery: monochromatic and bichromatic reverse top-kqueries. In the former, there is no knowledge of userpreferences and the aim is to estimate the impact of apotential product in the market. In the latter, a datasetwith user preferences is given and a reverse top-kquery returns those preferences that rank a potentialproduct highly. To the best of our knowledge, this isthe first work that addresses this problem.Contributions. First, we introduce and formallydefine a novel query type called reverse top-kquery (Section 3) and present two versions, namelymonochromatic and bichromatic. We analyze the ge-ometrical properties for the 2-dimensional case of themonochromatic reverse top-k query and provide analgorithmic solution (Section 4). Furthermore, we dis-cuss the geometric interpretation of the solution spacefor higher dimensionality. Then, we study in detail thecase of bichromatic reverse top-k query. Such a query,if computed in a straightforward manner, requiresevaluating a top-k query for each user preference inthe database, which is prohibitively expensive evenfor moderate datasets. Instead, we propose an efficientand progressive threshold-based algorithm, calledRTA, for processing bichromatic reverse top-k queriesfor arbitrary data dimensionality (Section 5). RTAeagerly discards candidate user preferences, withoutprocessing the respective top-k queries. In addition,we present an indexing structure based on spacepartitioning, which materializes reverse top-k views,in order to further improve reverse top-k query pro-cessing (Section 6). We discuss the construction, usageand maintenance of the index based on materializedreverse top-k views. We conduct a thorough exper-imental evaluation that demonstrates the efficiency

  • 2of our algorithms (Section 7). It is noteworthy thatour algorithms consistently outperform a brute forcealgorithm by 1 to 3 orders of magnitude in terms ofrequired number of top-k evaluations. Finally, Sec-tion 8 reviews the related work and we conclude inSection 9.This paper extends our preliminary work [3] and

    provides a more thorough study of the problem. Fur-thermore, we present new experimental results thatlead to interesting findings and conclusions.

    2 PRELIMINARIESGiven a data space D defined by a set of d dimensions{d1, ..., dd} and a dataset S on D with cardinality |S|,an object p S can be represented as a d-dimensionalpoint p = {p[1], . . . , p[d]} where p[i] is a value ondimension di. We assume that each dimension rep-resents a numerical scoring attribute, therefore thevalues p[i] in any dimension di are numerical non-negative values. Furthermore, without loss of gen-erality, we assume that smaller scoring values arepreferable.Top-k queries are defined based on a scoring func-

    tion f that aggregates the individual scores into anoverall scoring value, that in turn enables the ranking(ordering) of the data points. The most importantand commonly used case of scoring functions is theweighted sum function, also called linear. Each di-mension di has an associated query-dependent weightw[i] indicating dis relative importance for the query.The aggregated score fw(p) for data point p is de-fined as a weighted sum of the individual scores:

    fw(p) =d

    i=1 w[i]p[i], where w[i] 0 (1 i d), jsuch that w[j] > 0. The weights represent the relativeimportance between different dimensions, and with-

    out loss of generality we assume thatd

    i=1 w[i] = 1.The weights indicate the user preferences, becausethey alter the ranking of the data points and thereforethe top-k result set. A linear top-k query can berepresented by a vector w and the result of a top-kquery is a ranked list of the k points with the bestscoring values fw.Definition 1: (Top-k query) Given a positive integer

    k and a user-defined weighting vector w, the resultset TOPk(w) of the top-k query is a set of points suchthat TOPk(w) S, |TOPk(w)| = k and p1, p2 : p1 TOPk(w), p2 S TOPk(w) it holds that fw(p1) fw(p2).In a d-dimensional Euclidean space, there exists

    a one-to-one correspondence between a weightingvector w and a hyperplane which is perpendicularto w. We call the (d-1)-dimensional hyperplane, whichis perpendicular to vector w and contains a pointp as the query plane of w crossing p, and denote itas w(p). All points lying on the query plane w(p),have the same scoring value equal to the score fw(p)of point p. Fig. 1 (left) depicts an example, where

    1

    1

    2 3 4 5 6 7 8 9 10

    23456789

    10

    price

    age

    p1p9

    p2

    p4p7

    p8

    p10

    p3

    p6

    p5

    user preferencesuser w[price] w[age]BobTom

    0.9

    Max

    0.10.80.2

    0.5 0.5

    top-1(score)p1 (2.5)p2 (2.2)p2 (2.5)

    w[price]

    w[age]

    5/61/7

    1

    1

    Bob

    TomMax

    Hw(p2)

    fw(p2)

    w = [0.5,0.5]

    Fig. 1. Example of reverse top-k query.

    the query plane (equivalent to a query line in 2d) isperpendicular to the weighting vector w = [0.5, 0.5].All points pi lying on the query line have a scorevalue fw(pi) = fw(p2) = 2.5. Furthermore, the rankof a point p based on a weighting vector w is equalto the number of the points enclosed in the half-space defined by w(p) that contains the origin of thedata space. Hence, p2 is the top-1 result for the query0.5 x + 0.5 y. In the rest of the paper, we referto this half-space as query space of w defined by p anddenote it as Hw(p).

    3 REVERSING TOP-k QUERIESIn this section, we introduce the concept of the reversetop-k query through an example and point out thedifferences to existing query types.

    3.1 Example of Reverse Top-k QueryGiven a database of products, a reverse top-k queryreturns those users (represented by weighting vec-tors) that rank a potential product highly. Considerfor example a database containing information aboutdifferent cars, depicted in Fig. 1 (left). For each car,the price and the age are recorded and minimumvalues on each dimension are preferable. Differentusers have different preferences about a potential carand Fig. 1 (right) also depicts a set of user preferences.For example, Bob prefers a cheap car, and does notcare much about the age of the car. Therefore, thebest choice (top-1) for Bob is the car p1 which hasthe minimum score (namely 2.5) for the particularweights. On the other hand, Tom prefers a newer carrather than a cheap car. Nevertheless, for both Tomand Max the best choice would be car p2.A reverse top-k query (RTOPk) is defined by a user-

    specified product p and returns the weighting vectorsw for which p is in the top-k result set. There existtwo different versions of the reverse top-k query: themonochromatic, which does not require any knowl-edge of user preferences, and the bichromatic, whichassumes that a dataset of preferences is given. Inour example, the bichromatic reverse top-1 result setof p1 contains the weighting vector (0.9, 0.1) defined

  • 31

    1

    2 3 4 5 6 7 8 9 10

    23456789

    10

    price

    age

    p1p9

    p2

    p4 p7

    p8

    p10

    p3

    p6

    p5

    q

    (a) RNN vs RTOPk.

    1

    1

    2 3 4 5 6 7 8 9 10

    23456789

    10

    price

    age

    p1p9

    p2

    p4 p7

    p8

    p10

    p3

    p6

    p5

    q

    (b) RSKY vs RTOPk.

    Fig. 2. Differences to existing query types.

    by Bob. Notice that for p2, two weighting vectorsbelong to the bichromatic reverse top-1 result set{(0.5, 0.5), (0.2, 0.8)}, namely the preferences of Tomand Max. In fact, all weighting vectors with w[price]in the range of [17 ,

    56 ] belong to the bichromatic reverse

    top-1 result set of p2. This segment of line w[price] +w[age] = 1 is the result set of the monochromaticreverse top-k query for the query point p=p2.Conceptually, the solution space of reverse top-k

    queries is the space defined by the weights w[price]and w[age]. Monochromatic reverse top-k queries re-turn partitions of the solution space and are usefulfor estimating the impact of a product when no userpreferences are given, but the distribution of them isknown. In our example, under assumption of uniformdistribution of user preferences, the impact of p2 inthe market can be estimated as ( 56

    17 )100% = 69%.

    On the other hand, bichromatic reverse top-k querieshave even wider applicability, as they identify usersthat are interested in a particular product, given aknown set of user preferences. For instance, the beststrategy for a profile-based marketing service wouldbe to advertise car p1 to Bob and car p2 to Tom andMax. Notice that an empty result set for a product(i.e., car p3) indicates that it is not interesting for anycustomer based on her preferences. Summarizing, forthe bichromatic version of the reverse top-k query,the result set contains a finite number of weightingvectors, while the monochromatic version identifiesthe partitions of the solution space that satisfy thequery.

    3.2 Differences to Existing Query TypesThe reverse nearest neighbor query (RNN) [4] and thereverse skyline query (RSKY) [5] have been proposedfor supporting decision making. In the following, wepoint out the differences between these query typesand RTOPk queries.Reverse top-k queries differ from reverse nearest

    neighbor queries [4]. Given a query point q themonochromatic RNN query retrieves all data pointswhich have q as nearest neighbor. In the case of thecar database, this is equivalent to finding all carsthat are closer to the query point than to any other

    point of the dataset. For example, in Fig. 2(a), aRNN query returns p2 (assuming Euclidean distance),while the result set of RTOPk query for k = 1 isdefined by the line segment [ 17 ,

    56 ] in the space of

    the weighting vectors. The bichromatic versions ofthese query types are also different. In our runningexample, the bichromatic RTOPk query returns allweighting vectors w W such that w[price] [ 17 ,

    56 ].

    On the other hand, the bichromatic RNN is definedbased on two datasets A, B, and returns all pointsthat belong to A, that are closer to q than any pointof B. Thus, while a monochromatic/bichromatic RNNquery retrieves a set of points that are closer to thequery point than any other data point, the RTOPkquery retrieves a set of weighting vectors or partitionsof the solution space defined by the weights w[i].An alternative definition of RTOPk query is, given aquery point q, find the distance functions (in termsof weighting vectors) for which q belongs to the k-nearest neighbors of the point positioned at the originof the data space.

    The reverse skyline query [5], [6] has been proposedto explore the dominance relationships between prod-ucts relatively to the user preferences. The user prefer-ences are described by a data point that represents theideal (non-existing) product for the user. In contrast,the RTOPk query assumes that the user preferencesare expressed as weights that define the relative im-portance of each dimension. Given a query point, theRSKY query retrieves all data points for which thequery point belongs to their dynamic skyline resultset1. Thus, the RSKY query retrieves the data pointsthat are at least in one dimension more similar (interms of absolute difference of attribute values) to qthan all other data points. For example, in Fig. 2(b),the dynamic skyline of p2 is depicted (gray squarepoints). Since the dynamic skyline query is defined byabsolute differences, q belongs to the dynamic skylineof p2, if and only if only p2 and no other data pointis enclosed in the depicted rectangle. As q belongsto the dynamic skyline, p2 is in the result set of thereverse skyline of q. In the case of bichromatic reverseskyline, two datasets A, B are given, each of themcontaining data points sharing the same attributes (inour example price and age). Then, given a query pointq, the goal is to find all points belonging to A thatare more similar in at least one dimension to q thanany point of B. Both monochromatic and bichromaticRSKY queries differ from RTOPk queries, since in theformer the result set is a set of data points to which qis more similar than all other points in at least onedimension, while in the latter user preferences arereturned that define linear scoring functions for whichq is highly ranked.

    1. A point pi dynamically dominates pi based on point q, if dj D: |pi[j] q[j]| |pi[j] q[j]|, and on at least one dimensiondj D: |pi[j] q[j]| < |pi[j] q[j]|.

  • 44 MONOCHROMATIC RTOPk QUERIESWe commence by providing a formal definition ofmonochromatic reverse top-k for a query point q andan integer k.Definition 2: (Monochromatic Reverse top-k) Given a

    point q and a positive number k, as well as a datasetS, the result set of the monochromatic reverse top-k query (mRTOPk(q)) of point q is the locus

    2, i.e.,a collection of d-dimensional vectors {wi}, for whichp TOPk(wi) such that fwi(q) fwi(p).Let us assume that W denotes the set of all valid

    assignments of w. Fig. 3 shows the data and solutionspace of a 2-dimensional monochromatic reverse top-

    k query. All valid weighting vectors (d

    i=1 w[i] = 1and w[i] [0, 1]) of the reverse top-k query form theline w[1] + w[2] = 1 in the 2-dimensional solutionspace that is defined by the axis w[1] and w[2]. Sincethe number of possible vectors w is infinite, it isnot possible to enumerate all possible assignments ofw W . On the other hand, the solution space W canbe split into a finite set of partitions Wi (

    Wi = W ,

    Wi = ), such that the query point q has the sameranking position for all weighting vectors w Wi.For the 2-dimensional case, each partition Wi is a linesegment of the line w[1] + w[2] = 1. Then, the resultset of the monochromatic reverse top-k is a set ofpartitions Wi of the solution space W :

    mRTOPk(q) = {Wi : wj Wi q TOPk(wj)}For the sake of brevity, in the rest of this paper wedenote a query point q TOPk(wi), instead of p TOPk(wi) such that fwi(q) fwi(p).In this paper, we assume that any Wi,Wj

    mRTOPk(q) are non-adjacent partitions, therefore apartition Wi is the maximum partition of the solutionspace in which the rank of q does not change. Themain topic of this section is to identify the parti-tions that form the result set of a monochromaticreverse top-k query. We first present an algorithmfor computing the mRTOPk(q) in the 2-dimensionalcase. Then, we discuss the case of datasets of higherdimensionality.

    4.1 Monochromatic RTOPk Query for 2dProperties of monochromatic RTOPk query. In thefollowing, we present some useful properties ofRTOPk queries and discuss how the boundaries of thepartitions Wi can be determined. We assume that thean ordering w1, . . . , w|W | of the weighting vectors of|W |, such that a weighting vector wi precedes anothervector wj , if wi[1] < wj [1]. Thus, the weighting vectorswi are ordered based on increasing angle of wi withthe y-axis.Lemma 1: Given two points p and q such that

    fw1(q) fw1(p), there exists at most one weighting

    2. In mathematics, locus is the set of points satisfying a particularcondition, often forming a curve of some sort.

    1

    1

    2 3 4 5 6 7 8 9 10

    23456789

    10

    x

    y

    wpq

    wqr

    p3p2

    r

    p7

    p4

    p1q

    p6

    p5p

    (a) Data space

    mRTOP1(q)

    w[1]

    w[2]

    wpq[2]

    wqr[2]

    wqr [1] w pq[1]

    1

    1

    (b) Solution space.

    Fig. 3. Monochromatic reverse top-k query.

    vector w such that fwi(q) < fwi(p) for wi < w, andfwi(q) > fwi(p) for wi > w.Based on the above lemma, the relative order of p

    and q changes for weighting vectors with smaller andlarger angles than w. If p had a lower rank than qfor vectors with smaller angle than w, then p has ahigher rank for vectors with larger angle than w. Ifthere exists such a weighting vector, then we denoteit as wpq and refer to it as the weighting vector for whichthe relative order of q and p changes.Lemma 2: Given two points p and q, if there exists

    a weighting vector wpq for which the relative order ofp and q changes, then it holds that fwpq (q) = fwpq (p).Equivalently, wpq is the weighting vector that is

    perpendicular to the line segment pq, with wpq[1] =pq

    pq1, where pq =

    q[2]p[2]q[1]p[1] is the slope of line

    segment pq. The above equation is derived by theproperty that wpq pq.The boundaries of any partition Wi are defined by

    weighting vectors wpq for which the relative order ofq and points p S changes (additionally, the first andlast partition are defined by the weighting vectors[0, 1] and [1, 0] respectively). Intuitively, as long asthe relative order between any two points does notchange, the top-k result is not affected and thus therank of q remains the same.Lemma 3: There exists at most one partition Wi,

    such that for all the weighting vectors w Wi it holdsthat q TOP1(w).Since the relative order between q and any data

    point p changes only once, if the rank of p becomeshigher than q, then it cannot change again for the nextvectors. Thus, q cannot be in the top-1 result set forany w > wpq. Therefore, the result set mRTOP1(q)contains at most one partition Wi of W .Example of mRTOPk(q) for k=1. Consider for exam-ple the dataset depicted in Fig. 3(a). Since the onlypoints that belong to the convex hull [7] are p, qand r, we conclude that (1) only these points belongto the top-1 result set for any weighting vector, and(2) there exists at least one weighting vector wi forwhich q TOP1(wi), and based on Lemma 3 exactlyone partition Wi mRTOP1(q). The boundaries ofthe partition Wi are defined by the weighting vectors

  • 51

    1

    2 3 4 5 6 7 8 9 10

    2345678910

    p2

    q

    x

    y

    p1

    lw1(q)

    lw2(q)

    lw3(q)

    (a) Multiple partitions.

    1

    1

    2 3 4 5 6 7 8 9 10

    2345678910

    p4

    q

    x

    y

    p2

    p1

    p5

    p6

    p3

    w4

    w3

    w2

    w1

    (b) mRTOPk(q) algorithm.

    Fig. 4. Examples of mRTOPk(q) queries.

    wpq , wqr for which the relative order between q andp or r changes. All weighting vectors w for whichthe following inequality holds are in the reverse top-1 result set of q: wqr[1] w[1] wpq[1]. The resultset of mRTOP1(q) is a segment (partition) of the linew[1] + w[2] = 1 in the 2-dimensional solution spacedefined by wpq and wqr, as shown in Fig. 3(b).

    Even though the result set mRTOPk for k = 1contains at most one partition, for a reverse top-kquery with k > 1, the result set may contain morethan one partitions Wi. Consider for example thethree data points in Fig. 4(a) and assume we areinterested to compute the mRTOPk(q) for k=2. Querypoint q is in the top-2 result set for both weightingvectors w1 and w3. However, when weighting vectorw2 is considered, with angle between w1 and w3, it isobvious that q no longer belongs to the top-2. Thus, inthis small example, the monochromatic reverse top-kquery would return two partitions Wi.

    Monochromatic RTOPk algorithm. Algorithm 1 de-scribes the monochromatic reverse top-k algorithm forthe 2-dimensional case. Data points that are domi-nated3 by q are always ranked after q for any weight-ing vector w, while points that dominate q are rankedbefore q for any weighting vector w. For example inFig. 4(b), p5 is worse (ranked lower) than q, whereasp6 is better (ranked higher) than q for any w. Pointsof the dataset that are neither dominated by nor dom-inate q are ranked higher than q for some weightingvectors and lower than q for other vectors. Thus, ouralgorithm examines4 only such incomparable points{pi} to q (line 5), because they alter the rank of q. Theboundaries of the partitions of mRTOPk are definedby a subset of the weighting vectors wi = wpiq,therefore we keep them in list W (line 7). Then, weidentify the partitions for which q belongs to the top-k, by processing W , as explained in the followingexample.

    In Fig. 4(b), after the sorting by increasing valueof wi[1] (line 10) the set W

    is {w1, w2, w3, w4} cor-

    3. A point p dominates q (p q) , if di D, p[i] q[i]; and onat least one dimension dj D, p[j] < q[j].4. This is similar to the approach in [2], which is used to compute

    a robust layered index.

    Algorithm 1 Monochromatic RTOPk Algorithm.

    1: Input: S, q2: Output: mRTOPk(q)3: W , R , RES 4: for (pi S) do5: if (q 6 pi and pi 6 q) then

    6: wi[1] piq

    piq1, wi[2] 1 wi[1]

    7: W W {wi}8: end if9: end for10: sort W based on increasing value of wi[1]11: w0 [0, 1], w|W |+1 [1, 0]12: R {p : p lies in Hw0(q)}13: kw |R| //number of points in R14: for (wi W

    ) do15: if (kw k) then16: RES RES {(wi, wi+1)}17: end if18: if (pi+1 R) then19: kw kw 120: else21: kw kw + 122: end if23: end for24: return RES

    responding to the lines p1q, p2q, p3q, p4q respectively.Then, vectors w0 and w5 are added toW

    . For the firstweighting vector w0 all data points that lie in Hw0(q)are retrieved (line 12). Recall that the rank kw of q withrespect to w0 is determined by the number of pointscontained in Hw0(q) (line 13). In our example, the setR is {p4, p6, p1} and therefore the rank of q is 4. Therank of q cannot change before w1. If we assume thatk=3, then for the first partition W0=[w0, w1] the rankof q is higher than k and the partitionW0 can be safelydiscarded. Then, the next partition is W1 = [w1, w2].Since p1 R (line 18), this means that the relativeorder of the points p1 and q changes in W1, andnow the rank of q is 3. Therefore, W1 is added tomRTOP3(q) (line 16). Similarly, we can compute therank of q for all Wi. In our example, W1 is the onlypartition that qualifies for the mRTOP3(q) result set.Notice that adjacent partitions can be easily detectedand merged into one partition.Given a query point q, let I S be the set of

    incomparable points to q. Then, Algorithm 1 producesat most |I| + 1 partitions. Since adjacent partitionsare merged, in worst case every second partitionwill be in the result set of mRTOPk(q). Thus, themaximum number of partitions in mRTOPk(q) is|I|+1

    2

    . If all data points are incomparable to q then

    |I| = |S|, which leads to an upper bound for thenumber of partitions. Notice that the expected numberof incomparable points is much smaller. For example,assuming uniform data distribution and given that0 p[i] 1,p S and 1 i 2, the expectednumber of incomparable points |I| is equal to theaggregate area of the upper left quadrant and lowerright quadrant defined by q multiplied with |S|:

  • 61 1

    1

    wWA

    WB

    WC

    W[2]

    W[3]W[1]

    (a) Partitioning for S1.

    WB WC

    1 1

    1W[2]

    WA

    W[3]W[1]

    w

    (b) Partitioning for S2.

    Fig. 5. Solution space for 3-dimensional data.

    |I| = |S| |S| q[1] q[2] |S| (1 q[1]) (1 q[2])= |S| (q[1] + q[2] 2 q[1] q[2])

    4.2 Higher Dimensional DataIn higher dimensions (d>2), all valid weighting vec-tors of the RTOPk query form a (d-1)-dimensionalhyperplane that contains the points wi[j]=0 j 6= iand wi[j]=1 for j=i and 1 i d. A monochro-matic RTOPk query returns the partitions Wi of thehyperplane, for which the query point q is in theTOPk(w),w Wi. In the following, we provide anexample for finding the partitions for d>2.Let us consider a 3-dimensional dataset S1 con-

    taining only three points A=[1, 0, 0], B=[0, 1, 0] andC=[0, 0, 1]. We denote as WA, WB and WC the par-titions for which A, B and C are the top-1 data pointrespectively. Similar to the 2-dimensional case, theborders of the partitions are defined by the weightingvectors for which the relative order between twopoints changes. To define the borders of the partitionsWA andWB , we need to examine the locus of weightsw for which fw(A)=fw(B). From this equation, wederive that w=[1c2 ,

    1c2 , c] where c [0, 1]. The vec-

    tors w form a line that divides the solution space intotwo partitions.Furthermore, we seek only the weighting vectors

    for which fw(A)=fw(B) 13 . Therefore, theborder between the partitions WA and WB is theline segment defined by the points [1/3, 1/3, 1/3] and[1/2, 1/2, 0]. By repeating the same procedure for theother pairs of points, the result is the partitioningdepicted in Fig. 5(a). Notice that there exists a singleweighting vector w=[1/3, 1/3, 1/3] for which all threedata points have the same score for the dataset S1.Fig. 5(b) depicts the partitions of the solution

    space for another dataset S2 containing the pointsA=[1/2, 0, 1/2], B=[1/2, 1/2, 0] and C=[0, 1/2, 1/2]. Fordataset S2, in order to detect the border between WAand WB , only the constraint changes to c Limit.

    In our experimental evaluation, we also examine astraightforward approach, namely UNIFORM, wherethe algorithm decides to split the cell that has thelargest volume, without using the cost function. Thestopping condition follows the space-bounded strat-egy, i.e., splitting stops when a specified number ofcells are created.

    6.4 Supporting Arbitrary k ValuesIn this section, we generalize our approach to supportreverse top-k queries for arbitrary values of k, usinga common RTOP-Grid. Given an upper limit Kmax,the RTOP-Grid is constructed forKmax and additionalinformation is stored that enables processing queriesfor any k value (k Kmax). For each weighting vectorwz , the rank of the cell corner, i.e., the minimumk for which the corner is in the top-k result set ofwz , is additionally maintained. Thus, the material-ized view can be described as LLi = {(wz, k

    Lz )} and

    LUi = {(wz, kUz )}.

    Algorithm 3 can be adjusted to process reversetop-k queries over a grid constructed for arbitraryk Kmax. First, the cell Ci that encloses q is

    determined. Then, the weighting vectors that arecontained in LLi are examined, while weightingvectors that are not in LLi cannot contribute to thereverse top-k result set of q. For any wz L

    Li , the

    following cases are distinguished (the following codereplaces lines 6-10 of Algorithm 3):

    IF (kLz k) THENIF (wz LUi and kUz k)THEN

    W W {wz}

    ELSEWcand Wcand {wz}

    6.5 UpdatesUpdates that occur either in W or S affect the ma-terialized RTOPk views, therefore they should besupported efficiently. In case of insertion of a newweighting vector wins, we need to progressively ex-amine the corners of the grid, starting from the originof the data space. If a corner CLi (C

    Ui ) does not qualify

    as top-k object for wins, then we can safely discard allcorners dominated by CLi (C

    Ui ). Deletion of an existing

    weighting vector wdel is simple, as it requires removalof wdel from the lists of any corner of the grid.Insertion of a data point pins is more costly, since

    only grid corners that dominate pins are discarded.For the remaining corners, we cannot avoid com-puting the reverse top-k query. However, GRTA canbe used and only weighting vectors that belong tothe materialized views of the cell corner have to beevaluated, since no weighting vectors can be added,but only some of them may be removed from thematerialized view. Similarly a data point pdel thatis removed from the dataset probably influences allnon-dominating cell corners, therefore we need torecompute the materialized views for them, since newweighting vectors may have to be added.

    7 EXPERIMENTAL EVALUATIONIn this section, we present an extensive experimentalevaluation of reverse top-k queries. All algorithms areimplemented in Java and the experiments run on a3GHz Dual Core AMD processor equipped with 2GBRAM. The block size is 8KB.As far as the dataset S is concerned, both real

    and synthetic data collections, namely uniform (UN),correlated (CO) and anticorrelated (AC), are used. Forthe uniform dataset, all attribute values are gener-ated independently using a uniform distribution. Thecorrelated and anticorrelated datasets are generatedas described in [9]. We also use two real datasets:NBA (17265 tuples, d=5) and HOUSE (127930 tuples,d=6) [3]. For the dataset W , two different data dis-tributions are examined, namely uniform (UN) andclustered (CL). For the clustered dataset W , first CW

  • 12

    10

    100

    1000

    10000

    100000

    2 3 4 5

    Tim

    e (m

    sec)

    Dimensionality (d)

    UNACCO

    (a) Average time.

    1

    10 100

    1000 10000

    100000 1e+006 1e+007

    2 3 4 5

    I/Os

    Dimensionality (d)

    UNACCO

    (b) I/Os.

    1

    10

    100

    1000

    10000

    100000

    1e+006

    2 3 4 5

    #Top

    -k E

    valu

    atio

    ns

    Dimensionality (d)

    UNACCO

    (c) Number of top-k evaluations.

    Fig. 8. Performance of RTA for varying d [naive (outer bar) vs. RTA (inner bar)].

    10

    100

    1000

    10000

    100000

    2 3 4 5

    Tim

    e (m

    sec)

    Dimensionality (d)

    UNACCO

    (a) Average time.

    1

    10 100

    1000 10000

    100000 1e+006 1e+007

    2 3 4 5

    I/Os

    Dimensionality (d)

    UNACCO

    (b) I/Os.

    1

    10

    100

    1000

    10000

    100000

    1e+006

    2 3 4 5

    #Top

    -k E

    valu

    atio

    ns

    Dimensionality (d)

    UNACCO

    (c) Number of top-k evaluations.

    Fig. 9. Performance of RTA for varying d for k-skyband queries [naive (outer bar) vs. RTA (inner bar)].

    cluster centroids that belong to the (d-1)-dimensionalhyperplane defined by

    wi = 1 are selected ran-

    domly. Then, each coordinate is generated on the(d-1)-dimensional hyperplane by following a normaldistribution on each axis with variance 2W , and amean equal to the corresponding coordinate of thecentroid.We evaluate the performance of RTA against an

    alternative technique (referred as naive) that evaluatesa top-k query for each weight in the dataset W .In particular, both for RTA and naive, the datasetS is indexed by an R-tree and top-k processing isperformed using a state-of-the-art branch-and-boundalgorithm. Our metrics include: a) the time (wall-clock time), b) the I/Os, and c) the number of top-kevaluations required by each algorithm. We presentaverage values over the 1000 queries in all cases.Notice that we do not measure the I/Os that occurby reading W , since this cost is the same for everymethod.

    7.1 Performance Evaluation of RTAIn Fig. 8, we study the behavior of RTA for increasingdimensionality d, for various distributions (UN, AC,CO) of dataset S and uniform weights W . We use|S|=10K, |W |=10K, top-k=10 and 1000 random queriesthat follow the data distribution. Notice that the y-axisis in logarithmic scale. In the bar charts, each of thethree bars (for a specific dimensionality) representsa dataset: UN, AC, and CO respectively. The totallength of the bar represents the performance of naive,while the inner bar depicts the performance of RTA.Regarding time, RTA is 2 orders of magnitude better

    than naive, in all examined data distributions. Interms of I/Os, again RTA outperforms naive by 1to 3 orders of magnitude, while larger savings areobtained for datasets UN and CO. The reason behindRTAs superiority is clearly demonstrated in Fig. 8(c),where the number of top-k evaluations necessary forcomputing an RTOPk query is shown. The thresholdemployed by RTA reduces significantly the numberof top-k evaluations, saving around 1.5 to 3 ordersof magnitude compared to naive. Notice that naiverequires |W | (=10K) top-k evaluations in all cases.An interesting observation is that only a small

    percentage (around 2%) of the queries actually returnnon-empty result sets. Since the queries are generatedfollowing the data distribution, many queries are notin the top-k result for any weighting vector. Reversetop-k queries with empty result sets are also very in-formative for a product manufacturer, since they indi-cate that the particular product is not popular for anycustomer, compared to their competitors products.On the other hand, RTA processes RTOPk queriesthat have a small or empty result set efficiently byoften requiring only one top-k evaluation, because thethreshold employed eliminates candidate weightingvectors that do not belong to the result set. In contrast,naive does not have this ability and always com-putes |W | top-k queries. In order to generate a morechallenging query workload for RTA, we increase theprobability that a query point belongs to a top-kresult, by selecting random query points from the k-skyband7 of the dataset. Obviously, these query points

    7. A k-skyband query returns the set of points which are domi-nated by at most k-1 other points.

  • 13

    are more likely to produce non-empty reverse top-kresults. This query workload corresponds to queriesabout products that seem popular to customers, andmanufacturers are expected to pose such queries withhigh probability.Fig. 9 depicts the results obtained by using k-

    skyband queries for the same experimental setupdepicted in Fig. 8. Although we witness a smalldeterioration in the results of RTA, our algorithm con-sistently outperforms naive by 1 to 2 orders of mag-nitude. Some interesting observations can be madeby studying Fig. 9(c). First, we notice that the cor-related dataset requires more top-k evaluations. Thereason is that the k-skyband of a correlated datasetcontains few points that are close to the origin of thedata space, and therefore such points are in the top-k for many weighting vectors. Second, we observea decreasing tendency as dimensionality increases,which seems counterintuitive at first. However, this isbecause again the cardinality of bRTOPk(q) decreasesas the dimensionality increases. For the rest of ourexperiments, we use k-skyband queries and we do notshow the results of naive, as it performs consistentlyworse than RTA by few orders of magnitude.Thereafter, we perform a scalability study of RTA in

    Fig. 10. We use as metric the number of top-k evalua-tions, as it is the dominant factor for the performanceof RTA. First, we increase the cardinality of W usingdifferent data distributions of S (Fig. 10(a)). We fix theremaining parameters to |S|=10K, d=5 and top-k=10.We observe that RTA is highly efficient, especiallyfor the costly CO dataset. For instance, for |W |=5K,RTA needs on average 544 top-k evaluations, whilethe average mandatory cost is 459 (this is the numberof queries that cannot be avoided, also equal to theaverage size of the result set). Thus, RTA needs only544 (10.88%) out of 5000 query evaluations (100%),which is just marginally more than the mandatory 459(9.18%), and therefore RTA saves 89.12% of the cost.In Fig. 10(b), we set |W |=10K and gradually increase

    the cardinality of S to 100K. For the CO dataset, weobserve that fewer top-k evaluations are necessarywith increasing |S|. The main reason is that the dataspace has more data points, thus becomes denser,and k-skyband queries lead to result sets with fewerweighting vectors, hence smaller processing cost. InFig. 10(c), we use |S|=10K and |W |=10K, and studyhow the value of k affects the performance of RTA.It is clear that RTA is highly efficient for UN andAC datasets, and its performance is affected onlyfor CO. Higher values of k increase the probabilitythat a query point belongs to top-k for some weight-ing vector, and therefore the average cardinality ofbRTOPk(q) increases, leading to more top-k evalua-tions.We also study the performance of RTA for a clus-

    tered dataset W , using CW=5 clusters of weightingvectors. A clustered dataset W simulates the case

    where user preferences are not independent, but thereexist some groups of common user preferences. Inthis experiment, we use the default setup and varythe dimensionality. Thus, Fig. 11(a) is analogous andalso comparable to Fig. 9(c) which was for a uniformdataset W . The results show that, in the case ofclustered dataset W , RTA performs better than foruniform W for all data distributions, neverthelessthe general trends remain the same as dimensionalityincreases. In Fig. 11(b), we assess RTA using the NBAdataset. The performance of RTA is in accordance withthe case of synthetic data. We try both a uniform andclustered dataset W and the results show again thatfewer top-k evaluations are required for the clustereddataset W . In Fig. 11(c), a similar experiment is con-ducted using the HOUSE dataset.

    7.2 Performance Evaluation of RTOP-GridIn the sequel, we evaluate the performance of RTOP-Grid in Fig. 12. Unless mentioned explicitly, we use|S|=10K, |W |=10K, d=5 and top-k=10. First, we pro-vide a comparison of the RTOP-Grid space-boundedstrategy to the UNIFORM approach and to RTA(Fig. 12(a)), for increasing number of cells. RTOP-Grid performs consistently better than UNIFORM,demonstrating the advantages of using the cost-basedsplitting strategy, instead of splitting the cell withthe maximum volume. RTOP-Grid also provides animprovement to RTA, in terms of the required numberof top-k evaluations as expected, and in this setupit achieves a reduction of top-k evaluations between18.5% (100 cells) and 26.3% (1000 cells).In Fig. 12(b), we test the RTOP-Grid guaranteed

    cost strategy versus RTA, with increasing cost bound,for top-k={10, 20}. The chart shows that RTOP-Gridreduces the number of top-k evaluations comparedto RTA by 30%, when the cost bound is set to 100.As expected, when the bound imposed on cost issmaller, RTOP-Grid improves RTA more. Notice thatin most cases the actual number of top-k evaluationsis smaller than the bound set on average cost. Thisis because the average cost is estimated based on thenumber of weighting vectors in the views, and it doesnot take into account the additional savings in top-kquery evaluations caused by the threshold mechanismof RTA, employed also by RTOP-Grid. In Fig. 12(c),we show the number of cells created by RTOP-Gridfor the same experiment. Clearly, the number of cellsincreases rapidly when the cost bound is set too low.However, similar improvements can be obtained byrelaxing the cost bound, i.e., notice that setting thebound to 200 achieves similar performance to thebound of 100, using much fewer cells. Furthermore,we study the scalability of RTOP-Grid for varying val-ues of |W |, |S| and top-k. Fig. 13(a) shows the resultsobtained by increasing |W |. RTOP-Grid consistentlyoutperforms UNIFORM and improves RTA. Then, in

  • 14

    0

    200

    400

    600

    800

    1000

    1200

    5000 10000 15000

    #Top

    -k E

    valu

    atio

    ns

    Cardinality |W|

    RTA-UNRTA-ACRTA-CO

    (a) Scalability with |W |.

    0 100 200 300 400 500 600 700 800 900

    1000

    10000 50000 100000

    #Top

    -k E

    valu

    atio

    ns

    Cardinality |S|

    RTA-UNRTA-ACRTA-CO

    (b) Scalability with |S|.

    0

    500

    1000

    1500

    2000

    10 20 30 40 50

    #Top

    -k E

    valu

    atio

    ns

    top-k

    RTA-UNRTA-ACRTA-CO

    (c) Scalability with top-k.

    Fig. 10. Scalability study of RTA.

    0

    500

    1000

    1500

    2000

    2500

    3000

    2 3 4 5

    #Top

    -k E

    valu

    atio

    ns

    Dimensionality (d)

    RTA-UNRTA-ACRTA-CO

    (a) Clustered dataset W .

    0

    100

    200

    300

    400

    500

    600

    10 20 30 40 50

    #Top

    -k E

    valu

    atio

    ns

    top-k

    RTA/UNRTA/CL

    (b) Real dataset (NBA).

    0 100 200 300 400 500 600 700 800

    10 20 30 40 50

    #Top

    -k E

    valu

    atio

    ns

    top-k

    RTA/UNRTA/CL

    (c) Real dataset (HOUSE).

    Fig. 11. Performance of RTA for clustered weights W and for real data (NBA and HOUSE).

    0

    50

    100

    150

    200

    100 500 1000

    #Top

    -k E

    valu

    atio

    ns

    Number of Cells

    RTOP-GridUNIFORM

    RTA

    (a) Space-bounded strategy.

    0

    50

    100

    150

    200

    250

    300

    100 200 300 400 500

    # To

    p-k

    Eval

    uatio

    ns

    Cost bound

    RTOP-Grid, top-10RTA, top-10

    RTOP-Grid, top-20RTA, top-20

    (b) Guaranteed cost strategy.

    0

    2000

    4000

    6000

    8000

    10000

    100 200 300 400 500

    Num

    ber o

    f Con

    stru

    cted

    Cel

    ls

    Cost bound

    top-10top-20

    (c) Number of constructed cells.

    Fig. 12. Performance evaluation of the strategies of RTOP-Grid.

    Fig. 13(b), we set |W |=10K and increase |S|. Onceagain, the gain of RTOP-Grid over RTA is sustainedin all setups. Finally, in Fig. 13(c), the chart showshow the cost is affected by increasing k. RTOP-Gridperforms better than RTA and UNIFORM for all kvalues and the benefit increases with k.

    7.3 Monochromatic vs. Bichromatic AlgorithmIn Fig. 14, we set d=2 and study the comparative per-formance of RTA against the monochromatic RTOPkalgorithm. To this end, we apply the monochromaticalgorithm to identify the partitions of W that belongto the solution set, and then retrieve the subset ofweighting vectors W that belong to these partitions.The data distribution for both S and W is uniform,denoted as UN/UN. The default values used are|S|=10K, |W |=10K, and top-k=10.In Fig. 14(a), we measure the time required by

    each algorithm employing a logarithmic scale. Weobserve that RTA scales better with |S|, compared tothe monochromatic algorithm. In Fig. 14(b), we vary

    the cardinality of W , from 5K to 15K vectors. It turnsout that the monochromatic reverse top-k algorithmscales better with |W |, because it is immune to theactual cardinality of W , as it only needs to identifythe partitions of the space that provide solutions tothe query. The number of weighting vectors doesnot affect significantly the time required to identifythe partitions. However, RTA needs to examine (andpotentially process) more top-k queries as the cardi-nality ofW increases, therefore its total time increases.Similarly, in Fig. 14(c), RTA requires more time tocompute the result for increasing values of top-k.Again, the monochromatic reverse top-k algorithm ispractically unaffected by the increased values of k.

    All in all, RTA is more sensitive to higher valuesof |W | and top-k, thus the monochromatic RTOPk al-gorithm scales better. In contrast, the monochromaticalgorithm is sensitive to the cardinality of S, thereforeRTA performs better for increased values of |S|.

  • 15

    50

    100

    150

    200

    5000 10000 15000

    #Top

    -k E

    valu

    atio

    ns

    Cardinality |W|

    RTOP-GridUNIFORM

    RTA

    (a) Scalability with |W |.

    20 40 60 80

    100 120 140 160 180 200

    10000 50000 100000

    #Top

    -k E

    valu

    atio

    ns

    Cardinality |S|

    RTOP-GridUNIFORM

    RTA

    (b) Scalability with |S|.

    0 100 200 300 400 500 600 700

    10 20 30 40 50

    #Top

    -k E

    valu

    atio

    ns

    top-k

    RTOP-GridUNIFORM

    RTA

    (c) Scalability with top-k.

    Fig. 13. Scalability study of RTOP-Grid for the space-bounded strategy.

    1

    10

    100

    1000

    10000

    100000

    10000 50000 100000

    Tim

    e (m

    sec)

    Cardinality |S|

    MONO-UN/UNRTA-UN/UN

    (a) Scalability with |S|.

    0 200 400 600 800

    1000 1200 1400

    5000 10000 15000

    Tim

    e (m

    sec)

    Cardinality |W|

    MONO-UN/UNRTA-UN/UN

    (b) Scalability with |W |.

    0

    500

    1000

    1500

    2000

    2500

    3000

    10 20 30 40 50

    Tim

    e (m

    sec)

    top-k

    MONO-UN/UNRTA-UN/UN

    (c) Scalability with top-k.

    Fig. 14. Performance of monochromatic algorithm vs. RTA.

    2

    4

    6

    8

    10

    12

    2 3 4 5

    eva

    luat

    ed k

    Dimensionality (d)

    RTAINC

    (a) Evaluated value of k.

    1

    20000

    40000

    60000

    80000

    2 3 4 5

    #Top

    -k E

    valu

    atio

    ns

    Dimensionality (d)

    RTAINC

    (b) Number of top-k evaluations.

    Fig. 15. Incremental threshold refinement vs. RTA.

    1

    10

    100

    1000

    10000

    100000

    2 3 4 5

    #Top

    -k E

    valu

    atio

    ns

    Dimensionality (d)

    RTA-NoSort/UNRTA-Simple/UN

    RTA/UN

    Fig. 16. Effect of sorting W .

    7.4 RTA with Incremental Threshold Refinement

    In the next experiment (Fig. 15), we study the vari-ant of RTA algorithm that incrementally refines thethreshold during each top-k evaluation, as discussedin Section 5.4. We denote this variant of RTA as INC.We use the default setup of uniform data distributionfor S and W , |S|=10K, |W |=10K, top-k=10 and wevary the dimensionality from 2 to 5. In Fig. 15(a),the average value of k used for the top-k queries isshown. RTA always issues top-k queries with k=10.In contrast, INC results in retrieving in each top-kevaluation significantly fewer objects than k. On theother hand, this leads to an increased number of top-k query evaluations performed by INC (Fig. 15(b)).By retrieving fewer than k data objects, the bufferis partially updated, leading to an inaccurate thresh-old, which in turn triggers more top-k evaluations.Concluding, INC requires a higher number of top-kquery evaluations, however retrieving fewer than kdata objects on average.

    7.5 Effect of Sorted Weighting VectorsIn Fig. 16, we examine the improvement of the per-formance of RTA caused by the sorting of W basedon pairwise similarity. We compare RTA against aversion that does not employ sorting (RTA-NoSort)and accesses the vectors in a random order. In ad-dition, we test the performance of sorting based onsimilarity to a predetermined vector (RTA-Simple),namely the diagonal vector of the space. We evaluateall approaches with weighting vectors that followa uniform data distribution. The results show thatdepending on the dimensionality RTA requires upto one order of magnitude fewer top-k evaluationsthan RTA-NoSort. Furthermore, the performance ofRTA using pairwise similarity is clearly much betterthan RTA-Simple, while the latter is only slightlybetter than RTA-NoSort. This experiment verifies thatour proposed sorting based on pairwise similarityis appropriate for the RTA algorithm. Nevertheless,all variants perform much better than naive, whichevaluates |W |=10K top-k queries.

  • 16

    8 RELATED WORKAs reverse top-k queries are inherently related totop-k query processing, we summarize some repre-sentative work here. One family of algorithms arethose based on preprocessing techniques. Onion [7]precomputes and stores the convex hulls of datapoints in layers. Then, the evaluation of a linear top-k query is accomplished by processing the layersinwards, starting from the outmost hull. Prefer [1]uses materialized views of top-k result sets, accord-ing to arbitrary scoring functions. Onion and Pre-fer are mostly appropriate for static data, due tothe high cost of preprocessing. Efficient maintenanceof materialized views for top-k queries is discussedin [10]. The robust index [2] is a sequential indexingapproach that improves the performance of Onion [7]and Prefer [1]. The main idea is that a tuple shouldbe placed at the deepest layer possible, to reducethe probability of accessing it at query processingtime, without compromising the correctness of theresult. Later, in [11], the dominant graph is proposedas a structure that captures dominance relationshipsbetween points. Another family of algorithms focuseson computing the top-k queries over multiple sources,where each source provides a ranking of a subset ofattributes only. Fagin et al. [12] introduce TA and NRAalgorithms. Variations of them have been proposedleading to various threshold-based algorithms [13],[14], [15], [16].Reverse nearest neighbor (RNN) queries were orig-

    inally proposed in [4]. An RNN query finds theset of points that have the query point as theirnearest neighbor. Recently, reverse furthest neighborqueries [17] are introduced, that are similar to RNNqueries. The reverse skyline query [5], [6] identifiescustomers that would be interested in a productbased on the dominance of the competitors products.DADA [18] aims to help manufactures position theirproducts in the market, based on three types of dom-inance relationship analysis queries. Creating com-petitive products has been recently studied in [19].Nevertheless in these approaches, user preferencesare expressed as data points that represent preferableproducts, whereas reverse top-k queries examine userpreferences in terms of weighting vectors. Miah etal. [20] study a different problem, again from theperspective of manufacturers. The authors proposean algorithm that selects the subset of attributes thatincreases the visibility of a new product. Finding themost influential products with reverse top-k querieshas been studied in [21].

    9 CONCLUSIONSIn this paper, we introduce the reverse top-k querywhich retrieves all weighting vectors for which thequery point belongs to the top-k result set. The pro-posed query type is important for market analysis

    and for estimating the impact of a product based onthe user preferences and the competitors products. Westudy two versions of reverse top-k queries, namelymonochromatic and bichromatic. For the monochro-matic reverse top-k query only a dataset of productsis given, while for the bichromatic reverse top-k aset of weighting vectors is also available. The exper-imental evaluation demonstrates the efficiency of theproposed algorithms.

    REFERENCES[1] V. Hristidis, N. Koudas, and Y. Papakonstantinou, Prefer: A

    system for the efficient execution of multi-parametric rankedqueries, in SIGMOD, 2001, pp. 259270.

    [2] D. Xin, C. Chen, and J. Han, Towards robust indexing forranked queries, in VLDB, 2006, pp. 235246.

    [3] A. Vlachou, C. Doulkeridis, Y. Kotidis, and K. Nrvag, Re-verse top-k queries, in ICDE, 2010, pp. 365376.

    [4] F. Korn and S. Muthukrishnan, Influence sets based onreverse nearest neighbor queries, in SIGMOD, 2000, pp. 201212.

    [5] E. Dellis and B. Seeger, Efficient computation of reverseskyline queries, in VLDB, 2007, pp. 291302.

    [6] X. Lian and L. Chen, Monochromatic and bichromatic reverseskyline search over uncertain databases, in SIGMOD, 2008,pp. 213226.

    [7] Y.-C. Chang, L. D. Bergman, V. Castelli, C.-S. Li, M.-L. Lo,and J. R. Smith, The onion technique: Indexing for linearoptimization queries, in SIGMOD, 2000, pp. 391402.

    [8] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis II, An anal-ysis of several heuristics for the traveling salesman problem,SIAM Journal on Computing, vol. 6, no. 3, pp. 563581, 1977.

    [9] S. Borzsonyi, D. Kossmann, and K. Stocker, The skylineoperator. in ICDE, 2001, pp. 421430.

    [10] K. Yi, H. Yu, J. Yang, G. Xia, and Y. Chen, Efficient mainte-nance of materialized top-k views, in ICDE, 2003, pp. 189200.

    [11] L. Zou and L. Chen, Dominant graph: An efficient indexingstructure to answer top-k queries, in ICDE, 2008, pp. 536545.

    [12] R. Fagin, A. Lotem, and M. Naor, Optimal aggregation algo-rithms for middleware. in PODS, 2001, pp. 102113.

    [13] R. Akbarinia, E. Pacitti, and P. Valduriez, Best position algo-rithms for top-k queries, in VLDB, 2007, pp. 495506.

    [14] S. Chaudhuri and L. Gravano, Evaluating top-k selectionqueries. in VLDB, 1999, pp. 397410.

    [15] U. Guntzer, W.-T. Balke, and W. Kieling, Optimizing multi-feature queries for image databases. in VLDB, 2000, pp. 419428.

    [16] A. Marian, N. Bruno, and L. Gravano, Evaluating top-kqueries over web-accessible databases. ACM Transactions onDatabase Systems, vol. 29, no. 2, pp. 319362, 2004.

    [17] B. Yao, F. Li, and P. Kumar, Reverse furthest neighbors inspatial databases, in ICDE, 2009.

    [18] C. Li, B. C. Ooi, A. K. H. Tung, and S. Wang, Dada: a datacube for dominant relationship analysis, in SIGMOD, 2006,pp. 659670.

    [19] Q. Wan, R. C.-W. Wong, I. F. Ilyas, M. T. Ozsu, and Y. Peng,Creating competitive products, PVLDB, vol. 2, no. 1, pp.898909, 2009.

    [20] M. Miah, G. Das, V. Hristidis, and H. Mannila, Standing outin a crowd: Selecting attributes for maximum visibility, inICDE, 2008, pp. 356365.

    [21] A. Vlachou, C. Doulkeridis, K. Nrvag, and Y. Kotidis, Iden-tifying the Most Influential Data Objects with Reverse top-kqueries, in PVLDB, vol. 3, no. 1, pp. 364372, 2010.

    IntroductionPreliminariesReversing Top-k QueriesExample of Reverse Top-k QueryDifferences to Existing Query Types

    Monochromatic RTOPk QueriesMonochromatic RTOPk Query for 2dHigher Dimensional Data

    Bichromatic RTOPk QueriesThreshold-based Algorithm (RTA)Sorted Access to Weighting VectorsRTA ExampleIncremental Threshold Refinement

    Materialized RTOPk ViewsDefinitions and PropertiesGrid-based RTOPk Algorithm (GRTA)RTOP-Grid ConstructionSupporting Arbitrary k ValuesUpdates

    Experimental EvaluationPerformance Evaluation of RTAPerformance Evaluation of RTOP-GridMonochromatic vs. Bichromatic AlgorithmRTA with Incremental Threshold RefinementEffect of Sorted Weighting Vectors

    Related WorkConclusionsReferences