IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …gtsat/collection/diversity/book.pdf · IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 1 λ-Diverse

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 1

λ-Diverse Nearest Neighbors Browsing forMulti-dimensional Data

Onur Kucuktunc, Hakan Ferhatosmanoglu

Abstract—Traditional search methods try to obtain the most relevant information and rank it according to the degree of similarity tothe queries. Diversity in query results is also preferred by a variety of applications since results very similar to each other cannotcapture all aspects of the queried topic. In this work, we focus on the λ-diverse k-nearest neighbor search problem on spatial andmulti-dimensional data. Unlike the approach of diversifying query results in a post-processing step, we naturally obtain diverse resultswith the proposed geometric and index-based methods. We first make an analogy with the concept of natural neighbors and proposea natural neighbor-based method for 2D and 3D data and an incremental browsing algorithm based on Gabriel graphs for higherdimensional spaces. We then introduce a diverse browsing method based on the distance browsing feature of spatial index structures,such as R-trees. The algorithm maintains a priority queue with mindivdist of the objects depending on both relevancy and angulardiversity and efficiently prunes non-diverse items and nodes. We experiment with a number of spatial and high-dimensional datasets,including Factual’s US points-of-interest dataset of 13M entries. On the experimental setup, the diverse browsing method is shown tobe more efficient (regarding disk accesses) than k-NN search on R-trees, and more effective (regarding Maximal Marginal Relevance)than the diverse nearest neighbor search techniques found in the literature.

Index Terms—diversity, diverse nearest neighbor search, angular similarity, natural neighbors, Gabriel graph

✦

1 INTRODUCTION

MOST similarity search methods in the literatureproduce results based on the ranked degree of

similarity to the query. However, the results could beunsatisfactory, especially when there is an ambiguity inthe query or when the search results include redundantlysimilar data. It is typically better to answer the querywith diverse search results instead of homogeneous re-sults representing similar cases. A reasonable strategy forresponding to an ambiguous query is to return a mixtureof results covering all aspects of the query. Differentusers may have different intentions while searching, andat least the initial result should include a diverse set ofresults to better match a variety of users’ expectations.Redundantly repeating search results is another problemof conventional similarity search techniques, particularlyfor search spaces that include many duplicate data. Inthis case, similar but homogeneous information will fillup the top results. This situation has been discussedin several application areas, such as recommender sys-tems [33], online shopping [29], and web search [8].

Similar problems also exist when querying and brows-ing spatial data. In some applications, diversity is pre-ferred over similarity due to the information overload(see Fig. 1). Suppose a criminal is spotted in NYC by

• O. Kucuktunc is with the Department of Computer Science and Engineer-ing, The Ohio State University, Columbus, OH, 43210.E-mail: [email protected]

• H. Ferhatosmanoglu is with the Department of Computer Engineering,Bilkent University, Ankara, Turkey.E-mail: [email protected] work was done in part while the author was at The Ohio StateUniversity.

a camera at 6th Ave & 33rd St (C0). Back in policedepartment, the police have access to a number ofcameras in the city, but have limited screens (say k=5)to display the view of different cameras. The camerasare labeled in order by their distance from C0. Insteadof returning the closest point-of-interests (POIs), a resultset containing close yet diverse results is preferred forsuch an application.

6th

Ave

W 32nd St

W 33rd St

W 34th St

C0

C1

C2

C4

C9

C8

C10

C6

C3

C11

C7

C5

Bro

adw

ay

Fig. 1. A location-based service application can be ad-versely affected with the information overload. A conven-tional similarity search (e.g., k-nearest neighbor search)returns four cameras in W 33rd street and one fromBroadway (blue squares); whereas, diverse browsing cancapture spatial distribution around the first camera andprovide superior results (red circles).

Diversity in k-NN search is not limited to the spatialdomain. A diverse k-NN classifier can be potentiallyuseful for medical diagnosis since it is more likely tounveil minority reports by grouping and eliminating


similar cases. Suppose that a number of medical recordsare labeled with ‘+’ and ‘-’ labels depending on whethera patient has disease D or not, respectively. Given apatient’s medical records q, the aim of k-NN classifieris to classify the patient as D+ or D− by finding themajority class label for the closest k records. Fig. 2depicts the classification results obtained from the k-NNclassifier and diverse k-NN classifier.

+

+

+

----------

q

+

+

+

----------

q

Fig. 2. A classification example with k-NN (left) anddiverse k-NN (right) classifiers for k = 5. Although fourout of five closest medical data points to q are D−, adiverse perspective unveils minority reports and classifiesthe patient as D+ with 60% confidence.

The relation between diversity and relevance was in-vestigated before, especially in text retrieval and summa-rization. Researchers have proposed linear combinationsof diversity and relevance [4]. However, maximizingdiversity of a result set is known to be NP-hard [5], [6].Hence, some studies develop heuristic techniques [18],[31] to optimize the results.

Considering the diversification problem in the spatialdomain, it is possible to present an intuitive solutionbased on clustering. Data can be initially clustered andthen representatives of clusters around the query pointcan be returned as search results. Although clusteringcan be computationally expensive, there are methodsto generate those representatives with tree-based ap-proaches [21]. The problem of clustering-based methodsis that initial clusters may be unsatisfactory dependingon the settings of the query. Furthermore, if data needsto be clustered for each query, the method is obviouslynot scalable. Today, most database management systemssupport a spatial index (e.g., R-tree or one of its variants).Therefore, diversity can be obtained by taking advantageof the spatial index without the extra cost of clustering.

In this work, we first give a geometric definition ofdiversity by making an analogy with the concept ofnatural neighbors. We propose a natural neighbor-basedmethod and an incremental browsing algorithm basedon Gabriel graph for diverse nearest neighbor searchproblem. We also introduce a diverse browsing methodbased on the popular distance browsing feature of R-treeindex structures. The method maintains a priority-queuewith mindivdist of the objects depending on both rele-vancy and diversity, and efficiently prunes non-diverseitems and nodes. Providing a measure that captures bothrelevancy and diversity, we show that pruning internalnodes with respect to their diversity from the items

in the result set helps us achieve more diverse results.The contributions of this paper can be summarized asfollows:

• We formalize the λ-diverse k-nearest neighborsearch problem based on angular similarity anddevelop measures to evaluate the relevancy anddiversity of the retrieved results.

• We propose two geometric diverse browsing ap-proaches for static databases, each of which effec-tively captures the spatial distribution around aquery point and hence gives a diverse set of results.

• Extending the distance browsing feature, we intro-duce an efficient λ-diverse k-nearest neighbor searchalgorithm on R-trees, namely diverse browsing, andprove its correctness.

• We conduct experiments on 2D and multi-dimensional datasets to evaluate the performanceof the proposed methods.

Key advantages of the proposed geometric and index-based methods are:

• Geometric methods are appropriate for staticdatabases and perform very efficiently once thegraphs are built,

• Index-based diverse browsing does not require anychange in the index structure, therefore, it can easilybe integrated into various databases,

• With effective pruning, diverse browsing performsmore efficiently than k-NN search on R-trees andalso performs better than the state-of-the-art tech-niques regarding MMR metric.

The rest of the paper is organized as follows: Relatedwork is discussed in Section 2. Problem formulation isgiven in Section 3. Geometric approaches are defined inSection 4, and index-based diverse browsing is proposedin Section 5. Experiments are reported in Section 6.Section 7 gives conclusions and future work.

2 RELATED WORK

There are notable works on diverse ranking in the liter-ature. Carbonell and Goldstein [4] describe the MaximalMarginal Relevance (MMR) method for text retrieval andsummarization. MMR attempts to find a result set bymaximizing the query relevance and also minimizing thesimilarity between documents in the result set. The pro-posed method combines relevancy and novelty with auser-defined parameter (λ), which affects the importanceof relevancy and diversity of the results.

Since the problem of finding diverse results is knownto be NP-hard, Jain et al. [15], [18] investigate the k-nearest diverse neighbor search problem and developtwo greedy approaches to optimize the results in termsof both relevancy and diversity. Both proposed methodsemploy the advantages of an available R-tree index.Immediate Greedy (IG) incrementally grows the result setR by including nearest points only if they are diverseenough from the data points already in R. BufferedGreedy (BG) tries to overcome some deficiencies of IG.


They use the R-tree index only for getting the query’snearest neighbors in the dataset. Yu et al. [30], [31]address the issue of diversification in recommendationsystems and introduce two heuristic algorithms to maxi-mize the diversity under different relevance constraints.They state that maximizing diversity is about finding abalance between relevance and diversity. The proposedSwap algorithm basically tries to swap elements which areless likely to contribute to the set diversity with diverseones. Greedy algorithm, similar to IG, includes the nextmost relevant item to the result set only if that item isdiverse with respect to the items already in the resultset.

Some other studies attack the diversity problem invarious ways. Liu and Jagadish [21] employ the ideaof clustering to find a solution to the Many-AnswersProblem. They suggest that taking one representativefrom each cluster results in more diverse results. Theypropose a tree-based approach for efficiently finding therepresentatives, even if the search space is constrainedat runtime. Halvey et al. [14] compare dissimilarity andclustering-based diverse re-ranking methods to intro-duce diversity in video retrieval results.

The notions of diversity and novelty are generallydiscussed in the context of information retrieval andrecommendation systems. Clarke et al. [8] investigatethe problems of ambiguity in queries and redundancyin results and propose an evaluation framework. Chenand Karger [7] describe a retrieval method which as-signs negative feedback to the documents that areincluded in the result list for maximizing diversity.Vee et al. [29] present inverted list algorithms for com-puting diverse query results in online shopping appli-cations. Ziegler et al. [32], [33] present an algorithmicframework to increase the diversity of a top-k list ofrecommended products. In order to show its efficiency,they also introduce a new intra-list similarity metric.

There are also works on content diversity over contin-uous data of publish/subscribe systems, such as newsalerts, RSS feeds, social network notifications [9], and thediverse skyline [28]. Greedy heuristics were proposed forthe problems although the discussions on how relevanceand diversity should be combined, and how well greedyapproaches approximate the optimal solution is fairlyuseful. Interested readers may refer to [22] for furtherinformation on diversity.

3 PRELIMINARIES

Before we state our diversity and λ-diverse k-nearestneighbor search definitions, let us first analyze the ap-proach used by KNDN-IG and KNDN-BG [15], [18]. ForKNDN-IG the objective is to find a fully diverse set ofresults R close to the query point q. This means that forall r1, r2 ∈ R,

divdist(r1, r2, V (q)) =

L∑

j=1

(Wj × δj) > MinDiv,

where V (q) is the diversity attributes, L is the number ofdimensions to be diversified (in our case L=d), W is theset of weighting factors for the differences (δ1, . . . , δL)sorted in decreasing order, and MinDiv is the minimumdiversity distance. The diversity computation in KNDN-IG is simply based on the Gower coefficient [13] withmonotonically decaying weights W calculated with

Wj =aj−1 × (1− a)

1− aL, for 1 ≤ j ≤ L,

where a is the rate of decay.KNDN-BG applies the same technique; however, it

stores the eliminated points. Then the method checksif two of those points (say p′ and p′′) are pairwise-diverse, and also eliminated because of the same result-ing point ri ∈R. If it finds such a pair, R is updated as(R \ {ri}) ∪ {p

′, p′′}.MinDiv setting assumes that data is in [0, 1]d space;

otherwise, it must be set with a knowledge of densityand the range of data. When a is selected around 0.1,the dimension with the highest difference is overrated.This suggests that KNDN favors the points along theaxes. Hence, it tries to find 2d diverse points in a d-dimensional space. This behavior is problematic since kcould be any number. Consequently, the results do notguarantee that KNDN accurately captures the distribu-tion around the query point. Finally, these methods donot allow to select how diverse/relevant the results willbe, unless the importance of diversity is embedded intoeither MinDiv or a parameters.

Our diversity definition employs the angularsimilarity between two points regarding the querypoint. In other words, a diverse k-nearest neighborsearch method should maximize the pairwise angularsimilarity while minimizing the overall distance.Angular similarity and diverse k-nearest neighborsearch is defined in Definitions 3.1 and 3.2, respectively.

Definition 3.1: Angular similarity. Given a query pointq, two points p1 and p2, and an angle θ, angular similarity(simθ) of p1 with respect to q and another point p2 is:

simθ(p1, q, p2) =

{

1− p1qp2/θ, if p1qp2 < θ

0, otherwise.(1)

simθ results in 0 if the angle p1qp2 is greater than θ. Itbecomes 1 if both of them point at the same direction.

Definition 3.2: λ-diverse k-nearest neighbor search.Given a set of points S, a query point q, and a diver-sity ratio λ, the λ-diverse k-nearest neighbor search on qretrieves a set of k resulting points R = {p1, . . . , pk},such that R =

argminR⊆S|R|=k

α

k∑

i=1

dist(q, pi) +2β

k(k − 1)

k∑

i=1

k∑

j=i+1

simθ(pi, q, pj)

(2)


where α = 1 − λ and β = λ. This function minimizespairwise angular similarity (depending on λ) andmaximizes relevancy (depending on 1−λ) of the results.

Note that the relative importance of diversity versusrelevance is adjusted with the λ parameter. When λ=0,the method reduces to k-NN search. For 0.5 ≤ λ ≤ 0.9,the results are expected to be diverse enough withoutsacrificing relevancy. As a result, Definition 3.2 providesa more flexible and user-tunable setting for diversifica-tion of k-NN queries.

The rest of the paper includes proposed geometric andindex-based methods for efficiently solving λ-diverse k-nearest neighbor search problem. The notation used inthe paper is given in Table 1.

TABLE 1Notation

Symbol Descriptiond # dimensionsS dataset composed of d-dimensional pointsn |S|, size of the datasetk # search resultsq d-dimensional query pointλ importance of diversity over relevanceR set of result pointsK set of result points for k-NN search

simθ angular similarity of two points regarding q

DT(S) Delaunay triangulation of points in S

GG(S) Gabriel graph of points in Sk′ # natural neighbors for q in S ∪ {q}W weights wi of each natural neighbor

adj[p] adjacent nodes/points of p in a graphlGG(k) # layers required to obtain k Gabriel neighbors⊛

qp pruning sector from q towards p

−→qp vector from query point q in the direction of p

ǫ small fraction to relax ⊛qp

θ⊛ central angle of the pruning sectorr⊛ radius of the pruning sectorpi a point in Spnn the nearest neighbor point of qBi an index node in R-tree

mindist minimum distance measure [25]mindivdist measure that combines mindist and simθ

PQ priority queuects current timestamp

ts[Node] timestamp of a node in PQ

4 GEOMETRIC DIVERSE BROWSING

In the spatial domain, diverse k-nearest neighbor searchis conceptually similar to the idea of natural neighbors,which is calculated with Voronoi diagrams (VD) andDelaunay triangulation (DT) [2], [19]. In this section,we first present an analogy of diversity with naturalneighbors. Based on the analogy, we propose a naturalneighbor-based method along with the techniques thatare used to efficiently retrieve natural neighbors of aquery point. Although the discussions are mostly onthe 2D space, the method can be extended to workon the 3D space as well. Observing the limitations ofthis method in higher dimensional spaces, we presentanother geometric approach, namely the Gabriel graph-based diverse browsing method. More information onthe computation geometry concepts is given below.

4.1 Voronoi diagrams, Delaunay triangulations, andGabriel graphs

Voronoi diagram, Delaunay triangulation, and Gabrielgraph are popular concepts in computational geometry.For the readers who are unfamiliar with these concepts,the definitions and their relationships with each otherand with other popular subgraphs are given in thissection.

Definition 4.1: Voronoi cell. Given a finite set S ={p1, p2, . . . , pn} of d-dimensional points, the Voronoi cellV (p) of a point p∈S is the set of all points of Rd (defininga non-empty, open, and convex region) closer to p thanto any other point in S:

V (p) = {x ∈ Rd|∀q ∈ S, dist(x, p) ≤ dist(x, q)}. (3)

Definition 4.2: Voronoi diagram. The Voronoi diagramVD(S) is the collection of all Voronoi cells of the pointsin S, forming a cell complex partitioning Rd.

Definition 4.3: Delaunay triangulation. Given a finiteset S = {p1, p2, . . . , pn} of d-dimensional points, theDelaunay triangulation of S is a triangulation DT(S) suchthat no point in S is inside the circumcircle (circum-hypersphere for d>2) of any simplex in DT(S).

Delaunay triangulation is a dual graph of Voronoidiagram for the same set of points S.

Definition 4.4: Gabriel graph. The Gabriel graph [12]is the set of edges eij that is a subset of DT(S), for whichthe circle (hypersphere for d > 2) with diameter [pipj ]contains no other points from S:

GG(S) = {eij⊆DT(S)| ∀pk∈S, |pkpi|2+|pkpj|

2 ≥ |pipj |2}.(4)

Given a finite set S = {p1, p2, . . . , pn} of d-dimensionalpoints, the following subgraph relationships hold:

MST(S) ⊆ RNG(S) ⊆ GG(S) ⊆ DT(S), (5)

where MST is the minimum spanning tree, and RNG isthe relative neighborhood graph [11].

4.2 Analogy with Natural Neighbors

The natural neighbors (NatN) of a point p ∈ S are thepoints in S sharing an edge with p in DT(S). They alsocorrespond to Voronoi cells that are neighbors of Vp.In case of a point x /∈ S, its natural neighbors are thepoints in S whose Voronoi cells would be modified ifx is inserted in VD(S). The insertion of x creates a newVoronoi cell V +

x that steals volume from the Voronoi cellsof its potential natural neighbors (see Fig. 3).

To capture the influence of each NatN, we use naturalneighbor weights in natural neighbor interpolation [27].Let D be the VD(S), and D+=D ∪{x}. The Voronoi cellof a point p in D is defined by Vp, and V +

p is its cell inD+. The natural neighbor weight of x with respect to apoint pi is

wi(x) =Vol(Vpi

∩ V +x )

Vol(V +x )

, (6)


x

w2

w3w4w5

w6

w1

p1

p2

p3

p4

p5

p6

p1

p2

p3

p4

p5

p6

x

Fig. 3. Natural neighbor weights wi of x in 2D (left), DTwith and without x (right).

where Vol(Vpi) represents the volume of Vpi

, and 0 ≤wi(x)≤ 1. The natural neighbor weights are affected byboth the distance from x to pi and the spatial distributionof pi around x.

4.3 Natural Neighbor-based Method

Based on the property of the natural neighbor conceptwhich captures both the distance to a query point andalso the spatial distribution around it, we claim that thenatural neighbors of a query point q give a diverse set ofsimilarity search results, if the natural neighbor weightsW are used as ranking measures. The method works asfollows: (1) simulate the insertion of q into DT(S), (2)find the natural neighbors of q: {p1, . . . , pk′} along withthe weights W = {w1, . . . , wk′}, and (3) report resultsaccording to the weights in descending order. Details areprovided in the following sections.

4.3.1 Offline Generation of Delaunay TriangulationIn the preprocessing stage, DT(S) is calculated for all thedata points in S. Although the weights are calculatedwith the overlapping areas of these cells, and naturalneighbors are defined (and also easy to understand) interms of Voronoi cells, performing operations on DT iscomputationally more efficient.

There are I/O- and memory-efficient methods forbuilding Delaunay triangulation in 2D and 3D [1], [17].These methods can generate DTs for billions of points.There are also publicly available implementations for2D [26] and for higher dimensions [3]. We use Qhullimplementation [3] to generate DT(S).

4.3.2 Step 1 - Flip-based Incremental InsertionWe simulate insertion of q into DT(S) with a flip-basedinsertion algorithm. It is easy to determine the simplexof DT(S) containing q in linear time by inspecting alltriangles.

Let τ be the simplex in DT(S) containing q. All verticesof τ automatically become natural neighbors of q once itis inserted to DT(S). Then, the necessary edge flips arecarried out until no further edge needs to be flipped,revealing even more natural neighbors of q.

The number of flips needed to insert q is proportionalto the degree of q (the number of incident edges) afterits insertion. The average degree of a vertex in a 2D

DT is six. This number is proportional to the numberof dimensions.

4.3.3 Step 2 - Find NatNs and weights

Vertices {p1, . . . , pk′} adjacent to q in DT(S∪q) are thenatural neighbors of q. The volume of a d-dimensionalVoronoi cell is computed by decomposing it into d-simplices and summing their volumes. The volume ofa d-simplex τ with vertices (v0, . . . , vn) [19] is computedas:

Vol(τ) =1

d!

∣

∣det(

v1 − v0 · · · vn − v0)∣

∣ , (7)

where each column of the n × n determinant is thedifference between the vectors representing two vertices.Weights W are then calculated with Eq. 6.

4.3.4 Step 3 - Report for Diverse kNN

NatN-based method naturally returns k′ results as ananswer to query point q. The result set R is rankedaccording to the weights of each neighbor. If k′ ≥ k,we report the top-k ranked results. Otherwise, k′ pointsare returned.

Overview of the method is given in Algorithm 1.Note that if the number of natural neighbors is greaterthan k, the points with smaller weights are eliminated.Otherwise, the method may return less than k results.

Algorithm 1 Algorithm NatNDiversitySearch

1: procedure NATNDIVERSITYSEARCH(q, k,DT)2: DT′ ← INSERT(DT, q)3: W ← {}4: for each point pi in adj[q] do5: wi ← CALCULATEWEIGHT(DT′, pi, q)6: W ←W ∪ {wi}7: end for8: W ′ ← SORT(W )9: if adj[q] > k then

10: W ′ ←W ′[1 : k]11: end if12: return W ′.i13: end procedure

4.4 Limitations

The drawback of using natural neighbors in diverse k-nearest neighbor search is that there is always a fixednumber of natural neighbors of a point, and the numberis proportional to the number of dimensions. This canbe seen as an advantage as well, since parameter kis inherently captured by the process, not specified bythe user. However, for browsing purposes, one cannotrestrict the search with only natural neighbor results asthe user may demand more search results. The searchneeds to continue incrementally through the neighborsof neighbors with Voronoi cells. Without any assump-tions on the distribution of the data, the average degree


of a vertex in a 2D DT is 6 [19]. In this case, diversek-nearest neighbor search with the NatN-based methodfor 2D space may not return a result set with k items. Asa result the method is forced to investigate the neighborsof neighbors with Voronoi cells which were not modifiedwith the insertion of q.

For higher dimensional spaces, the average degreeof a point in DT grows quickly with d (approximatelydd) [10]. The problem of selecting a subset of elements inthis set to obtain a diverse set of k items cannot be triv-ially solved with a NatN-based approach. Because of thedisadvantages, NatN-based method is more appropriatefor low-dimensional data and small k values.

4.5 Gabriel Neighbor-based Method

In high dimensions, DT is intractable in terms of bothconstruction complexity O(n⌈d/2⌉) and browsing effi-ciency. For better scalability and browsing capability inhigh dimensional spaces, we use Gabriel graphs (GG)instead of DT.

Gabriel graph contains those edges of DT that intersecttheir Voronoi faces [23]. Hence, GG can be constructedin O(n logn) time by first constructing DT and VD, andthen adding each edge in DT to GG if it intersects itsVoronoi face. Without DT and VD, GG can always beconstructed by brute-force in O(n3) time.

The advantage of working with GG is that both nearestneighbor graph (NNG) and minimum spanning tree(MST) are subgraphs of it; therefore, GG still capturesproximity relationships among data points. Furthermore,GG is reasonably sparse and simple: for planar graphs|GG(S)|≤ 3n−8 [23]. For this reason, the Gabriel graphis found to be effective in constructing power-efficienttopology for wireless and sensor networks [20].

Our solution for diverse k-nearest neighbor searchproblem is to browse GG layer-by-layer, starting fromthe nearest point pnn to the query point q. Fig. 4 showsan example of GG layers connected with B-spline. Forefficiency, the query point is not inserted into GG(S),but rather the spatial location of q is imitated with itsnearest neighbor pnn.

q q

1st degree

2nd degree

3rd degree

pnn

(i) (ii)Only DT edge DT and GG edge

Fig. 4. Incremental browsing of a Gabriel graph. (i)Delaunay triangulation of the points with Gabriel edgeshighlighted. (ii) For a query point q, diverse results aregathered layer-by-layer starting with pnn.

After finding pnn, the algorithm iteratively searches then-degree neighbors of pnn in GG(S), starting with n=1.GGDIVERSITYSEARCH stops when k or more points areincluded in R. Note that the resulting points are addedlayer-by-layer; therefore, there is a ranking among layers.However, they are not sorted within layers, since there isno concept similar to natural neighbor weights in Gabrielgraphs. In addition, |R| ≥ k, meaning that the algorithmmay return more than k results. The method is given inAlgorithm 2.

Algorithm 2 Algorithm GGDiversitySearch

1: procedure GGDIVERSITYSEARCH(q, k,GG, S)2: pnn ← NEARESTNEIGHBOR(S, q)3: R← {}4: R′ ← {pnn}5: while |R| < k do6: R← R ∪R′

7: R′′ ← {}8: for each point p in R′ do9: R′′ ← R′ ∪ adj[p]

10: end for11: R′ ← (R′′ \R)12: end while13: return R14: end procedure

It is possible to return exactly k results by examiningthe last layer of points added into R. One method isto choose a subset of points from the last layer, whichoptimizes the overall diversity of R. We are using asimilar approach in the experiments. The problem canbe defined as follows:

Let q be a query point, and k be the number of resultsfor a diverse k-nearest neighbor search. Suppose R is theGabriel neighbors of q, inserted layer-by-layer up to thelayer lGG(k)−1, satisfying |R| < k. The problem is toselect a set of k−|R| objects from the layer lGG(k) sothat the diversity of R′ will be maximum. L refers to thelast layer to be investigated. Algorithm 3 finds a localmaximum for diversity of the results.

Algorithm 3 Algorithm GGOptimizeLastLayer

1: procedure GGOPTIMIZELASTLAYER(q, k, R, L)2: R′ ← R3: L′ ← L4: while |R′| < k do5: o← argminobj∈L′ DIV(R′ ∪ obj)6: R′ ← R′ ∪ {o}7: L′ ← L′ \ {o}8: end while9: return R′

10: end procedure


5 INDEX-BASED DIVERSE BROWSING

Spatial databases mostly come with an index structure,such as the widely used R-tree [25]. A popular methodcalled distance browsing [16] tries to visit the k-nearestneighbors (k-NN) of a point in a spatial database thatuses the R-tree index. We introduce diverse browsing forthe diverse k-nearest neighbor search over an R-treeindex.

The principal idea of diverse browsing is to use thedistance browsing method with a pruning mechanismthat omits non-diverse data points and minimum bound-ing rectangles (MBRs). A priority queue is maintainedwith respect to mindivdist measure, which is a combina-tion of the mindist [16] and the angular similarity (Eq. 1)for the object obj (either a data point or an R-tree indexnode). In each iteration, the closest object is investigated(see Fig. 5).

In the following sections, mindivdist measure andpruning mechanisms (5.1), correctness of the algorithm(5.2), and their integration into our incremental diversebrowsing method (5.3) are explained. Note that thealgorithm is given in two parts (Algorithms 4 and 5).

5.1 MinDivDist measure and Pruning

When a point p is added to the result set R, we drawan imaginary sector ⊛

qp from the query point q in the

direction of p with θ⊛ = 2 × θs angle and r⊛ = rs×|−→qp|.

Every point in this sector will eventually be pruned.By default, rs = 1 + λ and θs = 2π

k+ǫ , unless specifiedotherwise.

We use the term mindivdist as an alternative to mindistin distance browsing. mindivdist of points and MBRs inthe priority queue are calculated according to the angu-lar similarity and distance with respect to the elements inR (see Algorithm 4). Note that the points with mindivdistcloser to 0 are more likely to be included in the resultset.

(i) (ii)

q q

p1

p3

θ

q

p1

p3

p4

B2p4

(iii)

p2

B2B2

B1

mindist

query point

p2

Fig. 5. Example for diverse browsing. (i) Suppose B1 andB2 are two internal nodes of the R-tree index. (ii) Whenthe closest MBR is investigated and the closest point p1is added to the result set, p2 is pruned because of highangular and distance similarity to p1. mindivdist[p3] alsoincreases here due to its high angular similarity to p1. (iii)Next, p4 is added to the result set and causes B2 to bepruned since none of the items in B2 can be diverse.

Algorithm 4 Algorithm MinDivDist

1: procedure MINDIVDIST(q, obj, R, λ)2: θs ←

2πk+ǫ

3: rs ← 1 + λ4: δ ← mindist(q, obj)5: if obj is a point then6: for each point p in R do7: s[p]← simθ(obj, q, p)8: if s[p] > 0 and δ < |−→qp| × rs then9: return PRUNE(obj)

10: end if11: end for12: mindivdist← λ×max(s) + (1− λ)× δ13: else if obj is a rectangle then14: for each point p in R do15: s← miny∈obj.corners(simθ(y, q, p))16: if ∀y ∈ obj.corners in ⊛

qp then

17: return PRUNE(obj)18: else if ∃y ∈ obj.corners in ⊛

qp then

19: δ ← min(|−→qp| × rs, |−→qy|)

20: end if21: mdds[p]← λ× s+ (1− λ) × δ22: end for23: mindivdist← max(mdds)24: end if25: return mindivdist26: end procedure

Points without enough angular diversity and distancefrom another point in R are pruned. Similarly, the al-gorithm also prunes MBRs only if none of the cornersof the object are diverse enough to be in the result set.The advantage here is that all the pruned data can bedisplayed as similar results of each resulting point witha small modification since we have the information whya point is pruned.

5.2 Correctness of the Algorithm

The efficiency of the proposed method comes from thediverse browsing of the R-tree structure. As in incrementalnearest neighbor search algorithms and distance brows-ing [16], a min-priority queue (PQ) is maintained aftereach operation. However, instead of mindist metric weuse the result of the MINDIVDIST function as the valueof each object in PQ. MINDIVDIST gives a non-negativevalue of a point or an MBR depending on its angulardiversity and distance to the query point depending onboth R and λ. In each iteration, the object with the lowestmindivdist (top of PQ) is investigated.

As mindivdist of each object in PQ depends on thecurrent state of R, some of mindivdist values willbe obsolete after another point is inserted into R.But, instead of updating all the objects in PQ (whichwould be inefficient), we argue to update only thetop of PQ with a timestamp-based approach until anup-to-date object is acquired. Current timestamp (cts) is


incremented every time a point is included in the resultset. The proposed method based on timestamp-basedupdate of mindivdist in PQ is proved by Lemma 5.1 andTheorem 5.2.

Lemma 5.1: Update operation on an object, which ison top of PQ and has an earlier timestamp, can eitherincrease mindivdist of the object or does not affect it atall.

Proof: An object obj is updated only when ts[obj] <cts. Since cts increases when a new item is added toR, there are exactly (cts-ts[obj]) new items in the resultset R+ compared to the time when mindivdist[obj] wascalculated.

Suppose that the new mindivdist of obj at the currenttimestamp is mindivdist+[obj]. If obj is a point, threepossible outcomes of the update are:

1) ∃p ∈ R+, obj resides in ⊛qp

⇒ PRUNE(obj)2) ∃p ∈ R+, simθ(obj, q, p) > maxr∈R(simθ(obj, q, r))⇒ mindivdist+[obj] > mindivdist[obj]

3) ∀p ∈ R+, simθ(obj, q, p) ≤ maxr∈R(simθ(obj, q, r))⇒ mindivdist+[obj] = mindivdist[obj]

On the other hand, if obj is a leaf or internal node, themindivdist depends on the corners of the MBR:

1) ∃p ∈ R+, ∀y ∈ obj.corners, y resides in ⊛qp

⇒ PRUNE(obj)2) ∃p ∈ R+, ∃y ∈ obj.corners,

simθ(y, q, p) > maxr∈R(simθ(obj, q, r))⇒ mindivdist+[obj] > mindivdist[obj]

3) ∀p ∈ R+, ∃y ∈ obj.corners, y resides in ⊛qp

simθ(y, q, p) ≤ maxr∈R(simθ(obj, q, r))⇒ mindivdist+[obj] = mindivdist[obj]

We have shown that some updated objects are pruned.If not, its mindivdist either increases or stays the same.

Theorem 5.2: An object on top of PQ with the currenttimestamp provides a lower-bound for the mindivdist ofall objects in PQ, even if there are other objects in PQwith earlier timestamps.

Proof: Suppose obj is on top of PQ with thecurrent timestamp, and let obj′ be another objectin PQ with an older timestamp (ts[obj′] < cts).Following Lemma 5.1, even if the mindivdistof obj′ is updated, it is either pruned ormindivdist+[obj′] ≥ mindivdist[obj′]. Since obj′ isnot on top of PQ, mindivdist[obj′] ≥ mindivdist[obj].Hence mindivdist+[obj′] ≥ mindivdist[obj]. Therefore,mindivdist[obj] is still a lower-bound for the mindivdistof the objects in PQ.

5.3 Incremental Diverse Browsing

After extending the distance browsing feature of R-treeswith diverse choices, incremental browsing of an R-tree gives diverse results depending on λ. Details ofthe method are given in Algorithm 5, excluding certainboundary conditions, i.e., when λ = 1, or PQ becomes

Algorithm 5 Algorithm DiverseKNNSearch

1: procedure DIVERSEKNNSEARCH(q, k, λ,R-tree)2: R← {}3: cts← 04: PQ ← MINPRIORITYQUEUE()5: ENQUEUE(PQ, 〈R-tree.root, cts, 0〉)6: while not ISEMPTY(PQ) and |R| < k do7: while ts[TOP(PQ)] < cts do8: e← DEQUEUE(PQ)9: ENQUEUE(PQ, 〈e, cts, MINDIVDIST(e)〉)

10: end while11: e← DEQUEUE(PQ)12: if e is a point then13: R← R ∪ {e}14: cts← cts+ 115: else16: for each obj in node e do17: ENQUEUE(PQ, 〈obj, cts, MINDIVDIST(obj)〉)18: end for19: end if20: end while21: return R22: end procedure

empty. Note that q is the query point in a d-dimensionalsearch space where other points are already partitionedwith an R-tree index.

The proposed algorithm has the following properties:

Property 5.3: Diverse k-nearest neighbor results ob-tained by the diverse browsing method always containpnn, the nearest neighbor of the query point q.

Proof: Initially R = Ø. Therefore mindivdist ofevery object obj ∈ PQ is calculated solely depending onthe mindist to the query point (see Algorithm 4). Thealgorithm’s behavior is similar to that of the distancebrowsing method at this stage. When the first point pis dequeued from PQ, it is included to R. p is also thepoint with the minimum distance to q. Hence, pnn ∈ R.

Property 5.4: Diverse browsing can capture the set Pc

which comprises k points uniformly distributed aroundq with the same distances as pnn.

Proof: The method selects the points in Pc withoutpruning any of them. We guarantee that the points inPc are retrieved without any assumptions about theorder. In addition, min piqpj = 2π/k where pi, pj ∈ Pc.Therefore [2π/k] ≥ [2π/(k + ǫ)] = θs. Hence, no point inPc is pruned.

Dissimilarity-based diversification methods (e.g.,[18]) do not support this property since they are likelyto prune some points in Pc, especially when k > 6.


6 EXPERIMENTS

We define the evaluation measures in Section 6.1. Realand synthetic datasets used in the experiments are sum-marized in Section 6.2. Evaluation and discussion of themethods for spatial and high-dimensional datasets aregiven in Sections 6.3 and 6.4.

6.1 Evaluation Measures

In order to measure how well the methods capturethe relevancy and the spatial distribution around thequery point, we use the evaluation measures given inDefinitions 6.1, 6.2, and 6.3.

Definition 6.1: Angular diversity. Given a query pointq and a set of results R, angular diversity measures thespatial diversity around the query point:

DIV(q, R) = 1−

∥

∥

∥

∥

∑

pi∈R

−→qpi

‖−→qpi‖

∥

∥

∥

∥

|R|. (8)

The intuition behind this measure is that each of thepoints in R tries to influence the overall result in thedirection of the point itself. If the result set is fullydiverse, sum of these “forces” will be closer to the center;therefore, the average influence on the query point givesan idea of how diverse the result set is. This measure caneasily be applied to higher dimensions since it consistsof simple vector additions and normalization. See Fig. 6for the angular diversity of the points in Fig. 3.

q

p1

p2

p3

p4

p5

p6

q

p1

p2

p3

p4

p5

p6

qqq

(i) (ii)

(iii) (iv) (v)

2π/3

∑

|R|

Fig. 6. Angular diversity measure. (i) Suppose diverse6-nearest neighbor search for q retrieves {p1, . . . , p6}. (ii)Angular diversity of the result set is calculated by the sumof vectors ~qpi on a unit circle/sphere. (iii) When the resultset is diverse, the average of these vectors will be closeto the center; otherwise, (iv) the average will be closeto the circle. (v) For example, maximum angular diversity(DIV=1) in 2D for k = 3 can be achieved with points whichhave an angle of 2π/3 pairwise.

However, the angular diversity measure is not ade-quate to evaluate the diversity of a result set since an

algorithm can always return a better set of items moredistant than the nearest neighbors if the distance factoris omitted.

Definition 6.2: Relevance. Given a query point q, itsk-nearest neighbors K , and a set of results R, relevancemeasure calculates the normalized average distance ofthe points in R with respect to the k-nearest neighbors:

REL(q,K,R) =

∑

pj∈K ‖−→qpj‖

∑

pi∈R ‖−→qpi‖

. (9)

The result of REL is in the interval [0,1]. Magnitude ofeach vector is calculated in the Euclidean space althoughany metric distance measure can be applied to the func-tion, as long as the measure is consistent with the oneused in calculation of k-NN.

Definition 6.3: Diverse-relevance. Given a query pointq, its k-nearest neighbors K , a set of results R, and aparameter λ, diverse-relevance measures both relevancyand angular diversity of the results:

DIVREL(q,K,R) = λ×DIV(q, R) + (1− λ)× REL(q, R).(10)

DIVREL is based on the Maximal Marginal Relevance(MMR) method [4]. When λ = 0, the measure evaluatesthe relevance of the results excluding the diversity. Theaim of our methods is to maximize the diverse-relevanceof a result set depending on the value of λ.

TABLE 2Properties of datasets

Dataset d Card. Description

ROAD 2 64K road crossings, Montgomery ctNE 2 124K postal addresses, northeast of USCAL 2 105K points of interest in CaliforniaUSPOI 2 5,8M POIs in the US, Factual data

NORM 6 500K normal distribution (µ = 0, σ2= 1)

UNI 6 500K uniform distributionSKEW 6 500K skew normal dist. (µ=0,σ2=1,α=1)HDIM 10 1M uniform distribution

6.2 Datasets

We conduct our experiments on both real datasets ofPOIs in 2D and synthetic high-dimensional datasets toevaluate the efficiency and effectiveness of the proposedmethods. The properties of our datasets are summarizedin Table 2.Real datasets. We use four real-life datasets in our ex-periments. ROAD is the latitudes and longitudes of roadcrossings in Montgomery County MD [24]. 10% of theROAD dataset is randomly selected as query points. TheNorth East NE dataset contains postal addresses fromthree metropolitan areas (New York, Philadelphia andBoston).1 The CAL dataset consists of points of interest inCalifornia.2 To avoid querying outside of the data region,

1. http://www.rtreeportal.org2. http://www.cs.fsu.edu/∼lifeifei/SpatialDataset.htm


0 0.2 0.4 0.6 0.8 10.5

0.6

0.7

0.8

0.9

1

! (ROAD, k=#NatN)

DIV

KNN

Natural neighbor

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(a)

0 0.2 0.4 0.6 0.8 10.5

0.6

0.7

0.8

0.9

1

! (NE, k=#NatN)

DIV

KNN

Natural neighbor

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(b)

0 0.2 0.4 0.6 0.8 10.5

0.6

0.7

0.8

0.9

1

! (CAL, k=#NatN)

DIV

KNN

Natural neighbor

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(c)

0 0.2 0.4 0.6 0.8 1

0.5

0.6

0.7

0.8

0.9

1

! (USPOI, k=#NatN)

DIV

KNN

Natural neighbor

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(d)

0 0.2 0.4 0.6 0.8 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

! (ROAD, k=#NatN)

DIV

RE

L

KNN

Natural neighbor

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(e)

0 0.2 0.4 0.6 0.8 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

! (NE, k=#NatN)

DIV

RE

L

KNN

Natural neighbor

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(f)

0 0.2 0.4 0.6 0.8 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

! (CAL, k=#NatN)

DIV

RE

L

KNN

Natural neighbor

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(g)

0 0.2 0.4 0.6 0.8 10.3

0.4

0.5

0.6

0.7

0.8

0.9

1

! (USPOI, k=#NatN)

DIV

RE

L

KNN

Natural neighbor

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(h)

0 0.2 0.4 0.6 0.8 13

4

5

6

7

8

9

10

11

12

! (ROAD, k=#NatN, pagesize=64)

Dis

k A

ccess

KNN

KNDN!IG

KNDN!BG

Diverse browsing

(i)

0 0.2 0.4 0.6 0.8 13

4

5

6

7

8

9

10

11

12

! (NE, k=#NatN, pagesize=64)

Dis

k A

ccess

KNN

KNDN!IG

KNDN!BG

Diverse browsing

(j)

0 0.2 0.4 0.6 0.8 13

4

5

6

7

8

9

10

! (CAL, k=#NatN, pagesize=64)

Dis

k A

ccess

KNN

KNDN!IG

KNDN!BG

Diverse browsing

(k)

0 0.2 0.4 0.6 0.8 17

8

9

10

11

12

13

14

15

! (USPOI, k=#NatN, pagesize=64)

Dis

k A

ccess

KNN

KNDN!IG

KNDN!BG

Diverse browsing

(l)

Fig. 7. Comparison of the algorithms on ROAD (a,e,i), NE (b,f,j), CAL (c,g,k), and USPOI (d,h,l) datasets. Aim ofthe methods is to maximize the diverse-relevance (DIVREL) of the results. Angular diversity (DIV) of the geometricapproaches are stable because the natural and Gabriel neighbors of a point are fixed.

500 points from both datasets are randomly selected asqueries.

Finally, the USPOI dataset is extracted from more than13M points-of-interest in the US, gathered by Factual.3

Among those, 8.8M have the location information and5.8M are unique. We have randomly chosen 1000 POIsas queries.Synthetic datasets. We generate four synthetic high-dimensional datasets. NORM is a 6D dataset generatedwith normal distribution (µ= 0, σ2 = 1). UNI and HDIMare 6D and 10D datasets generated with uniform distri-bution. SKEW is generated with skew normal distribution(µ = 0, σ2 = 1, α = 1). For each of these syntheticdatasets, 200 query points are produced with the originaldistribution and parameters.

6.3 Evaluations

Real datasets. We compare the results of geometricapproaches (NatN-based and GG-based) with diversebrowsing of R-tree and KNDN-IG and KNDN-BG [18]on ROAD, NE, CAL, and USPOI datasets. In order to

3. http://www.factual.com

be consistent, the results of the NatN-based methodis obtained first. Depending on the number of naturalneighbors of each query k′, we run other algorithmswith k = k′. R-trees are built with a page size of 512bytes (which holds 64 data points) and a fill factor of0.5. Immediate greedy (IG) and buffered greedy (BG)approaches of the KNDN method are adopted for thespatial domain: the threshold parameter MinDiv is setto 0.1, which is also modified according to the valueof λ. Note that when λ = 0, both KNDN and diversebrowsing on R-tree methods reduce to k-NN.

Similar results for all real datasets (see Fig. 7) indi-cate that geometric methods naturally produce diverseresults in terms of the DIV measure. Since the naturaland Gabriel neighbors of a point are fixed in a dataset,these methods are the most efficient ones, only if (1) thepurpose of the search is angular diversity, and (2) thespatial database is stable. The GG-based method has anadvantage over the NatN-based method while it enablesfor incremental diverse browsing.

In most cases a spatial index is used to represent cer-tain points-of-interests, and those databases are updatedconstantly. Roads are built, new restaurants are opened,


0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

! (NORM, k=10, pagesize=64)

DIV

KNN

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(a)

0 0.2 0.4 0.6 0.8 10.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

! (UNI, k=10, pagesize=64)

DIV

KNN

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(b)

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

! (SKEW, k=10, pagesize=64)

DIV

KNN

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(c)

0 0.2 0.4 0.6 0.8 10.3

0.4

0.5

0.6

0.7

0.8

0.9

! (HDIM, k=10, pagesize=64)

DIV

KNN

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(d)

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

! (NORM, k=10, pagesize=64)

DIV

RE

L

KNN

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(e)

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

! (UNI, k=10, pagesize=64)

DIV

RE

L

KNN

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(f)

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

! (SKEW, k=10, pagesize=64)

DIV

RE

L

KNN

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(g)

0 0.2 0.4 0.6 0.8 10.5

0.6

0.7

0.8

0.9

1

! (HDIM, k=10, pagesize=64)

DIV

RE

L

KNN

Gabriel graph

KNDN!IG

KNDN!BG

Diverse browsing

(h)

0 50 100 15010

2

103

104

105

Queries (NORM, k=10, !=0.9, sorted)

Dis

k A

ccess (

log

)

Diverse browsing

KNDN!BG

KNDN!IG

(i)

0 50 100 150

102

103

104

105

Queries (UNI, k=10, !=0.9, sorted)

Dis

k A

ccess (

log

)

Diverse browsing

KNDN!BG

KNDN!IG

(j)

0 50 100 15010

2

103

104

105

Queries (SKEW, k=10, !=0.9, sorted)

Dis

k A

ccess (

log

)

Diverse browsing

KNDN!BG

KNDN!IG

(k)

0 20 40 60 80 100 120 14010

3

104

105

Queries (HDIM, k=10, !=0.9, sorted)

Dis

k A

ccess (

log

)

Diverse browsing

KNDN!BG

KNDN!IG

(l)

Fig. 8. Comparison of the algorithms on NORM (a,e,i), UNI (b,f,j), SKEW (c,g,k), HDIM (d,h,l) datasets.

old buildings are replaced by new ones. Because geomet-ric methods require a preprocessing time for building theentire DT or GG, they are not appropriate for dynamicdatabases. In addition, users may want to adjust howdiverse vs. relevant the search results should be. Index-based methods provide such flexibility. We will discussthe advantages and disadvantages of each method inSection 6.4.

If we focus on index-based search methods, bothdiverse browsing and KNDN start with DIVREL ≈ 1when λ = 0, and they try to adjust their results asthe user asks for more diversity. It is seen that diversebrowsing performs better in producing a more diverseresult set for the spatial domain compared to KNDN-IGand KNDN-BG (about 20% improvement for λ = 1, 10%improvement overall). The diverse browsing methodalso gives a high diverse-relevant set of results (seeTable 3) as the user seeks diversity in the results (about15% to 25% improvement).

Synthetic datasets. The purpose of experimenting inmulti-dimensional space is to show the efficiency ofeach algorithm and the diverse-relevance of the results.Fig. 8 shows the comparison of diverse browsing, GG-based and KNDN methods on synthetic 6D datasets. TheNatN-based method is omitted, because it is not scalable

to high dimensions due to its high average degree (seeSection 4.4). In order to measure the efficiency of eachindex-based method, we spot the page accesses whenλ = 1, for which the algorithms investigate the highestnumber of internal nodes.

For the queries where relevance is preferred overdiversity (i.e., λ < 0.5), diverse browsing and KNDN-BGperform better than the GG-based method since they areboth based on the distance browsing feature of R-trees.On the other hand, the Gabriel graph-based method isextremely powerful for diversity-dominant queries (i.e.,λ ≥ 0.5) in terms of both computational efficiency andthe diverse-relevance of the results. After retrieving thenearest neighbor pnn in the database, it only takes pageaccesses equal to the number of layers lGG(k) requiredto obtain k Gabriel neighbors. From our observations,lGG(k) ≤ 2 for k = O(d2). The GG-based method alsoimproves the diverse-relevance of the results up to 25%when λ ≥ 0.5.

The most challenging dataset we experiment on isHDIM dataset. Because generating the GG efficiently inhigh-dimensions is not the concern of this paper, wedecided to extract only the necessary Gabriel-edges forthis experiment. Fig. 8-(d,h) suggest that the GG-basedmethod is highly effective for diversity-intended queries,


0 15 30 45 60 75 90 105 1200.6

0.65

0.7

0.75

0.8

0.85

!s (ROAD, k = #NatN, r

s= 1+" )

DIV

" = 0.5

" = 0.6

" = 0.7

" = 0.8

" = 0.9

" = 1.0

(a)

0 15 30 45 60 75 90 105 1204

4.5

5

5.5

6

!s (ROAD, k = #NatN, r

s= 1+" )

Dis

k A

ccess

" = 0.5

" = 0.6

" = 0.7

" = 0.8

" = 0.9

" = 1.0

(b)

1 1.2 1.4 1.6 1.8 20.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

rs (ROAD, k = #NatN, !

s= 2#/(k+1)

DIV

" = 0.5

" = 0.6

" = 0.7

" = 0.8

" = 0.9

" = 1.0

(c)

1 1.2 1.4 1.6 1.8 23.6

3.8

4

4.2

4.4

4.6

4.8

5

rs (ROAD, k = #NatN, !

s= 2#/(k+1)

Dis

k A

ccess

" = 0.5

" = 0.6

" = 0.7

" = 0.8

" = 0.9

" = 1.0

(d)

Fig. 9. Effects of the parameters θs and rs on ROADdataset. (a) Maximum diversity is achieved when λ = 1and θs is set to 2π/(k+ǫ). (b) θs also increases the numberof disk accesses. (c) Diversity increases with rs for smallλ values. (d) rs = 1 + λ is shown to be an optimal pointbetween diversity and efficiency.

where index-based methods return similar results for dif-ferent λ values. This is obviously because the Euclideandistance in higher dimensions may not accurately mea-sure the similarity. But still, diverse browsing method isable to produce the same results for λ≤0.5 as KNDN-IGand KNDN-BG with 36% less page accesses. For λ>0.5,diverse browsing is more effective and efficient.

Diverse browsing makes less disk accesses as it suc-cessfully prunes out the index nodes that are not diversewith respect to the results found. Hence, it does not makeany unnecessary disk accesses. However, both KNDNmethods iteratively investigate all nearest neighbors tofind the next diverse element.

Effects of the parameters. So far we investigate theeffects of parameter λ on the diversity of the results anddisk accesses. However, there are two other parametersthat can affect the results and may be preferred to bespecified manually: pruning angle θs and radius rs.

We previously mentioned that the pruning angle pa-rameter θs is empirically set to 2π/(k + ǫ). To see howdifferent θs values change the diversity and efficiency ofdiverse browsing, we experiment on ROAD dataset forλ = {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. Figure 9-a shows that themaximum DIV is achieved around θs = 60 ∼ 2π/(k + ǫ)for k = 6 on 2D space. On the other hand, increasingthis parameter will cause more pruning; as a result,the algorithm will have to investigate more MBRs. Inorder to keep the page accesses low, a smaller θs can beprefered (see Fig 9-b).

rs parameter directly affects the pruning area. The

points included in the result set become less relevant;although, the diversity may increase for some cases,especially when more diverse results are included in-stead of more relevant ones for small λ values (seeFig 9-c). Because the algorithm prunes more MBRs withincreasing rs, it will access more pages to find the result.The results show that our preference for this parameteras rs = 1+λ is an optimal choice between the diversityand efficiency tradeoff (see Fig 9-d).

6.4 Discussions

Proposed geometric and an index-based diverse brows-ing methods have their own advantages in terms ofpreprocessing, querying, flexibility, and scalability. Asummary of the proposed methods are given in Table 4.

TABLE 4Comparison of the methods

Method Results Ordered Prep. Incr.NatN k′ by wi build DT NOGG ≥ k by layer build GG YESDiverse

k by mindivdist NO YESBrowsing

Preprocessing. The advantage of the index-based di-verse browsing method is that it does not require anypreprocessing and is ready to execute on any spatialdatabase that use data partitioning, such as R-tree, R*-tree. On the other hand, geometric methods require tobuild DT or GG, which can be very complex depend-ing on dimensionality and cardinality. As a result, wesuggest index-based diverse browsing for a dynamicdatabase, which is more likely to be based an index thathandles insert, delete and update operations efficiently;whereas geometric methods for static databases, whichwould not cause DT and GG to be frequently calculated.

Querying. As mentioned before, the NatN-basedmethod naturally returns a result set with a fixed numberof points. If the user does not specify k and the purposeis to find a perfectly balanced diverse and relevant set ofresults (see DIVREL graphs at λ ≈ 0.5), the NatN-basedmethod is appropriate. However, diverse browsing ismore suitable for diverse k-NN search, which requiresexactly k results to be returned. If the query asks forat least k results, the GG-based method can be used aswell.

Flexibility. We can investigate this property in twodifferent ways. The first is the flexibility of setting theimportance of diversity over relevance. Only diversebrowsing method adjusts itself for various λ values,since the natural and Gabriel neighbors are fixed in agraph. Also note that among the index-based methods,only diverse browsing can provide diverse results whenthe user is willing to sacrifice relevancy (see Fig. 10). Thesecond is the flexibility of incremental diverse browsing,where the user demands more search results. Both index-


TABLE 3Diversity (DIV), diverse-relevance (DIVREL) and the number of disk accesses (DA) for the USPOI dataset for

λ = {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.

KNN Natural Neigh. Gabriel Graph KNDN-IG KNDN-BG Diverse Browsingλ DIV DIVREL DA DIV DIVREL DIV DIVREL DIV DIVREL DA DIV DIVREL DA DIV DIVREL DA

0.5 0.524 0.762 9.4 0.763 0.686 0.739 0.723 0.504 0.519 10.9 0.558 0.557 10.4 0.745 0.630 8.50.6 0.524 0.715 9.4 0.763 0.701 0.739 0.726 0.497 0.489 11.5 0.564 0.542 10.7 0.788 0.659 8.50.7 0.524 0.667 9.4 0.763 0.717 0.739 0.730 0.499 0.479 12.2 0.561 0.534 11.0 0.834 0.707 8.50.8 0.524 0.620 9.4 0.763 0.732 0.739 0.733 0.494 0.475 12.9 0.558 0.535 11.3 0.867 0.763 8.50.9 0.524 0.572 9.4 0.763 0.748 0.739 0.736 0.481 0.471 13.6 0.563 0.548 11.6 0.881 0.820 8.51.0 0.524 0.529 9.4 0.763 0.762 0.739 0.739 0.484 0.482 14.3 0.567 0.566 11.9 0.885 0.878 8.5

AVG 0.524 0.644 9.4 0.763 0.724 0.739 0.731 0.493 0.486 12.6 0.562 0.547 11.1 0.833 0.742 8.5

0.2 0.4 0.6 0.8 10.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

relevance

div

ersi

ty

KNDN−IGKNDN−BGDiverse browsing

Gabriel Graph

KNN

Natural Neighbor

Fig. 10. Relevance vs. diversity for USPOI dataset. Eachpoint corresponds to the average values of DIV and RELmetrics for a run with a different λ (limλ→0 REL → 1 forall index-based methods).

based diverse browsing and GG-based methods enablethe retrieval of additional diverse results.

Scalability. For multi-dimensional spaces, the NatN-based method is intractable (see Section 4.4). Since datapartitioning methods are shown to be inefficient for highdimensional data, the GG-based method can be pre-ferred over index-based diverse browsing. Experiments(see Fig. 8-(g,h)) show that the GG-based method isin fact very efficient (∼10K vs. lGG(k) page accesses)and effective (0.6 vs. 0.85 DIVREL for λ = 1) in multi-dimensional datasets.

7 CONCLUSIONS

In this work, we investigate the diversification problemin multi-dimensional nearest neighbor search. Becausediverse k-nearest neighbor search is conceptually similarto the idea of natural neighbors, we give a definitionof diversity by making an analogy with the conceptof natural neighbors and propose a natural neighbor-based method. Observing the limitations of NatN-basedmethod in higher dimensional spaces, we present aGabriel graph-based method that scales well with di-mensionality. We also introduce an index-based diverse

browsing method, which maintains a priority queuewith mindivdist of the objects depending on both rele-vancy and diversity, and efficiently prunes non-diverseitems and nodes in order to efficiently get the diversenearest neighbors. To evaluate the diversity of a givenresult set to a query point, a measure that capturesboth the relevancy and angular diversity is presented.We experiment on spatial and multi-dimensional, realand synthetic datasets to observe the efficiency andeffectiveness of proposed methods, and compare withindex-based techniques found in the literature.

Results suggest that geometric approaches are suitablefor static data, and index-based diverse browsing is fordynamic databases. Our index-based diverse browsingmethod performed more efficient than k-NN search withdistance browsing on R-tree (in terms of the number ofdisk accesses) and more effective than other methodsfound in the literature (in terms of MMR). In addi-tion, Gabriel graph-based method performed well inhigh dimensions, which can be investigated more andapplied to other research fields where search in multi-dimensional space is required. Since there are numerousapplication areas of diverse k-nearest neighbor search,we plan to extend our method to work with differenttypes of data and distance metrics.

ACKNOWLEDGMENTS

We would like to thank Factual for sharing USPOI data,and the reviewers for their constructive comments. Thisresearch has been supported in part by NSF IIS-0546713.

REFERENCES

[1] P. K. Agarwal, L. Arge, and K. Yi. I/O-efficient construction ofconstrained Delaunay triangulations. In Proc. 13th European Symp.Algorithms (ESA’05), pages 355–366, 2005.

[2] F. Aurenhammer, T. U. Graz, R. Klein, F. Hagen, and P. I. Vi.Voronoi diagrams. In Handbook of Computational Geometry, pages201–290, 2000.

[3] C. B. Barber, D. P. Dobkin, and H. Huhdanpaa. The quickhullalgorithm for convex hulls. ACM Trans. Math. Softw., 22(4):469–483, 1996.

[4] J. Carbonell and J. Goldstein. The use of MMR, diversity-basedreranking for reordering documents and producing summaries.In Proc. 21st Int’l ACM SIGIR Conf. Research and Development inInformation Retrieval (SIGIR’98), pages 335–336, 1998.


[5] B. Carterette. An analysis of NP-completeness in novelty anddiversity ranking. In Proc. 2nd Int’l Conf. Theory of InformationRetrieval (ICTIR’09), pages 200–211, 2009.

[6] B. Carterette. An analysis of NP-completeness in novelty anddiversity ranking. Information Retrieval, 14(1):89–106, 2011.

[7] H. Chen and D. R. Karger. Less is more: probabilistic modelsfor retrieving fewer relevant documents. In Proc. 29th Int’lACM SIGIR Conf. Research and Development in Information Retrieval(SIGIR’06), pages 429–436, 2006.

[8] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan,S. Buttcher, and I. MacKinnon. Novelty and diversity in infor-mation retrieval evaluation. In Proc. 31st Int’l ACM SIGIR Conf.Research and Development in Information Retrieval (SIGIR’08), pages659–666, 2008.

[9] M. Drosou and E. Pitoura. Diversity over continuous data. IEEEData Eng. Bull., 32(4):49–56, 2009.

[10] R. A. Dwyer. Higher-dimensional Voronoi diagrams in linearexpected time. In SCG’89, pages 326–333, 1989.

[11] S. Fortune. Voronoi Diagrams and Delaunay Triangulations. 1992.[12] K. R. Gabriel and R. R. Sokal. A new statistical approach to

geographic variation analysis. Systematic Zoology, 18(3):259–278,1969.

[13] J. C. Gower. A general coefficient of similarity and some of itsproperties. Biometrics, 27(4):857–871, 1971.

[14] M. Halvey, P. Punitha, D. Hannah, R. Villa, F. Hopfgartner,A. Goyal, and J. M. Jose. Diversity, assortment, dissimilarity,variety: A study of diversity measures using low level featuresfor video retrieval. In Proc. 31th European Conf. IR Research onAdvances in Information Retrieval (ECIR’09), pages 126–137, 2009.

[15] J. R. Haritsa. The KNDN problem: A quest for unity in diversity.IEEE Data Eng. Bull., 32(4):15–22, 2009.

[16] G. R. Hjaltason and H. Samet. Distance browsing in spatialdatabases. ACM Trans. Database Syst., 24(2):265–318, 1999.

[17] M. Isenburg, Y. Liu, J. Shewchuk, and J. Snoeyink. Streamingcomputation of Delaunay triangulations. ACM Trans. Graph.,25:1049–1056, July 2006.

[18] A. Jain, P. Sarda, and J. R. Haritsa. Providing diversity in k-nearestneighbor query results. In Proc. 8th Pacific-Asia Conf. Advances inKnowledge Discovery and Data Mining (PAKDD’04), pages 404–413,2003.

[19] H. Ledoux and C. Gold. An efficient natural neighbour inter-polation algorithm for geoscientific modelling. In Proc. 11th Int’lSymp. Spatial Data Handling (SDH’04), pages 23–25, 2004.

[20] X.-Y. Li, P.-J. Wan, Y. Wang, and O. Frieder. Sparse power efficienttopology for wireless networks. In Proc. 35th Annual Hawaii Int’lConf. System Sciences (HICSS’02), page 296.2, 2002.

[21] B. Liu and H. V. Jagadish. Using trees to depict a forest. Proc.VLDB Endow., 2(1):133–144, 2009.

[22] D. B. Lomet, editor. IEEE Data Eng. Bull. (Special Issue on ResultDiversity), volume 32(4). IEEE Computer Society, 2009.

[23] D. Matula and R. Sokal. Properties of Gabriel graphs relevant togeographical variation research and the clustering of points onthe plane. Geographical Analysis, 12:205–222, 1980.

[24] B.-U. Pagel, F. Korn, and C. Faloutsos. Deflating the dimension-ality curse using multiple fractal dimensions. In Proc. 16th Int’lConf. Data Engineering (ICDE’00), pages 589–598, 2000.

[25] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighborqueries. In Proc. ACM SIGMOD Int’l Conf. Management of Data(SIGMOD’95), pages 71–79, 1995.

[26] J. R. Shewchuk. Triangle: engineering a 2D quality mesh generatorand Delaunay triangulator, volume 1148 of Lecture Notes in ComputerScience, pages 203–222. May 1996.

[27] R. Sibson. A brief description of natural neighbor interpolation.Interpolating Multivariate Data, pages 21–36, 1981.

[28] Y. Tao. Diversity in skylines. IEEE Data Eng. Bull., 32(4):65–72,2009.

[29] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, andS. Amer-Yahia. Efficient computation of diverse query results. InProc. 24th Int’l Conf. Data Engineering (ICDE’08), pages 228–236,2008.

[30] C. Yu, L. Lakshmanan, and S. Amer-Yahia. It takes variety to makea world: diversification in recommender systems. In Proc. 12thInt’l Conf. Extending Database Technology (EDBT’09), pages 368–378, 2009.

[31] C. Yu, L. Lakshmanan, and S. Amer-Yahia. Recommendationdiversification using explanations. In Proc. 25th Int’l Conf. DataEngineering (ICDE’09), pages 1299–1302, 2009.

[32] C.-N. Ziegler and G. Lausen. Making product recommendationsmore diverse. IEEE Data Eng. Bull., 32(4):23–32, 2009.

[33] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen.Improving recommendation lists through topic diversification. InProc. 14th Int’l Conf. World Wide Web (WWW’05), pages 22–32, 2005.

Onur Kucuktunc received the BS and MScdegrees from the Department of Computer En-gineering, Bilkent University, Turkey in 2007and 2009, respectively. During his masters, heworked on content-based video copy detectionand visual similarity-based tag suggestion, par-ticipated in TRECVID’08. He is currently pur-suing the PhD degree in Computer Scienceand Engineering department at The Ohio StateUniversity (OSU). His research interests includesimilarity and diversity search, sentiment analy-

sis and opinion retrieval, and biclustering.

Hakan Ferhatosmanoglu is an Associate Pro-fessor at Bilkent University, Turkey. He was withThe Ohio State University (OSU) before join-ing Bilkent. His research interests include high-performance management and mining of multi-dimensional data, scientific databases, andbioinformatics. He received Career Awards fromUS Department of Energy and National Sci-ence Foundation; IBM Faculty, OSU LumleyResearch and OSU Large Interdisciplinary Re-search Awards. He received his PhD in Com-

puter Science from University of California, Santa Barbara in 2001.

Documents

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …gtsat/collection/diversity/book.pdf · IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXX 1 λ-Diverse