14
Fast k-Nearest Neighbour Search via Prioritized DCI Ke Li 1 Jitendra Malik 1 Abstract Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential de- pendence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) (Li & Malik, 2016) offers a promising way of circumventing the curse and successfully re- duces the dependence of query time on intrinsic dimensionality from exponential to sublinear. In this paper, we propose a variant of DCI, which we call Prioritized DCI, and show a remarkable improvement in the dependence of query time on intrinsic dimensionality. In particular, a linear increase in intrinsic dimensionality, or equiva- lently, an exponential increase in the number of points near a query, can be mostly counteracted with just a linear increase in space. We also demonstrate empirically that Prioritized DCI sig- nificantly outperforms prior methods. In particu- lar, relative to Locality-Sensitive Hashing (LSH), Prioritized DCI reduces the number of distance evaluations by a factor of 14 to 116 and the mem- ory consumption by a factor of 21. 1. Introduction The method of k-nearest neighbours is a fundamental building block of many machine learning algorithms and also has broad applications beyond artificial intelligence, including in statistics, bioinformatics and database sys- tems, e.g. (Biau et al., 2011; Behnam et al., 2013; El- dawy & Mokbel, 2015). Consequently, since the problem of nearest neighbour search was first posed by Minsky & Papert (1969), it has for decades intrigued the artificial in- telligence and theoretical computer science communities alike. Unfortunately, the myriad efforts at devising effi- cient algorithms have encountered a recurring obstacle: the 1 University of California, Berkeley, CA 94720, United States. Correspondence to: Ke Li <[email protected]>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). curse of dimensionality, which describes the phenomenon of query time complexity depending exponentially on di- mensionality. As a result, even on datasets with moderately high dimensionality, practitioners often have resort to na¨ ıve exhaustive search. Two notions of dimensionality are commonly considered. The more familiar notion, ambient dimensionality, refers to the dimensionality of the space data points are embedded in. On the other hand, intrinsic dimensionality 1 character- izes the intrinsic properties of the data and measures the rate at which the number of points inside a ball grows as a function of its radius. More precisely, for a dataset with intrinsic dimension d 0 , any ball of radius r contains at most O(r d 0 ) points. Intuitively, if the data points are uniformly distributed on a manifold, then the intrinsic dimensionality is roughly the dimensionality of the manifold. Most existing methods suffer from some form of curse of dimensionality. Early methods like k-d trees (Bentley, 1975) and R-trees (Guttman, 1984) have query times that grow exponentially in ambient dimensionality. Later meth- ods (Krauthgamer & Lee, 2004; Beygelzimer et al., 2006; Dasgupta & Freund, 2008) overcame the exponential de- pendence on ambient dimensionality, but have not been able to escape from an exponential dependence on intrin- sic dimensionality. Indeed, since a linear increase in the intrinsic dimensionality results in an exponential increase in the number of points near a query, the problem seems fundamentally hard when intrinsic dimensionality is high. Recently, Li & Malik (2016) proposed an approach known as Dynamic Continuous Indexing (DCI) that successfully reduces the dependence on intrinsic dimensionality from exponential to sublinear, thereby making high-dimensional nearest neighbour search more practical. The key obser- vation is that the difficulties encountered by many existing methods, including k-d trees and Locality-Sensitive Hash- ing (LSH) (Indyk & Motwani, 1998), may arise from their reliance on space partitioning, which is a popular divide- and-conquer strategy. It works by partitioning the vector space into discrete cells and maintaining a data structure 1 The measure of intrinsic dimensionality used throughout this paper is the expansion dimension, also known as the KR- dimension, which is defined as log 2 c, where c is the expansion rate introduced in (Karger & Ruhl, 2002). arXiv:1703.00440v2 [cs.LG] 20 Jul 2017

Fast k-Nearest Neighbour Search via Prioritized DCI

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

Ke Li 1 Jitendra Malik 1

Abstract

Most exact methods for k-nearest neighboursearch suffer from the curse of dimensionality;that is, their query times exhibit exponential de-pendence on either the ambient or the intrinsicdimensionality. Dynamic Continuous Indexing(DCI) (Li & Malik, 2016) offers a promising wayof circumventing the curse and successfully re-duces the dependence of query time on intrinsicdimensionality from exponential to sublinear. Inthis paper, we propose a variant of DCI, whichwe call Prioritized DCI, and show a remarkableimprovement in the dependence of query time onintrinsic dimensionality. In particular, a linearincrease in intrinsic dimensionality, or equiva-lently, an exponential increase in the number ofpoints near a query, can be mostly counteractedwith just a linear increase in space. We alsodemonstrate empirically that Prioritized DCI sig-nificantly outperforms prior methods. In particu-lar, relative to Locality-Sensitive Hashing (LSH),Prioritized DCI reduces the number of distanceevaluations by a factor of 14 to 116 and the mem-ory consumption by a factor of 21.

1. IntroductionThe method of k-nearest neighbours is a fundamentalbuilding block of many machine learning algorithms andalso has broad applications beyond artificial intelligence,including in statistics, bioinformatics and database sys-tems, e.g. (Biau et al., 2011; Behnam et al., 2013; El-dawy & Mokbel, 2015). Consequently, since the problemof nearest neighbour search was first posed by Minsky &Papert (1969), it has for decades intrigued the artificial in-telligence and theoretical computer science communitiesalike. Unfortunately, the myriad efforts at devising effi-cient algorithms have encountered a recurring obstacle: the

1University of California, Berkeley, CA 94720, United States.Correspondence to: Ke Li <[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

curse of dimensionality, which describes the phenomenonof query time complexity depending exponentially on di-mensionality. As a result, even on datasets with moderatelyhigh dimensionality, practitioners often have resort to naıveexhaustive search.

Two notions of dimensionality are commonly considered.The more familiar notion, ambient dimensionality, refers tothe dimensionality of the space data points are embeddedin. On the other hand, intrinsic dimensionality1 character-izes the intrinsic properties of the data and measures therate at which the number of points inside a ball grows asa function of its radius. More precisely, for a dataset withintrinsic dimension d′, any ball of radius r contains at mostO(rd

′) points. Intuitively, if the data points are uniformly

distributed on a manifold, then the intrinsic dimensionalityis roughly the dimensionality of the manifold.

Most existing methods suffer from some form of curseof dimensionality. Early methods like k-d trees (Bentley,1975) and R-trees (Guttman, 1984) have query times thatgrow exponentially in ambient dimensionality. Later meth-ods (Krauthgamer & Lee, 2004; Beygelzimer et al., 2006;Dasgupta & Freund, 2008) overcame the exponential de-pendence on ambient dimensionality, but have not beenable to escape from an exponential dependence on intrin-sic dimensionality. Indeed, since a linear increase in theintrinsic dimensionality results in an exponential increasein the number of points near a query, the problem seemsfundamentally hard when intrinsic dimensionality is high.

Recently, Li & Malik (2016) proposed an approach knownas Dynamic Continuous Indexing (DCI) that successfullyreduces the dependence on intrinsic dimensionality fromexponential to sublinear, thereby making high-dimensionalnearest neighbour search more practical. The key obser-vation is that the difficulties encountered by many existingmethods, including k-d trees and Locality-Sensitive Hash-ing (LSH) (Indyk & Motwani, 1998), may arise from theirreliance on space partitioning, which is a popular divide-and-conquer strategy. It works by partitioning the vectorspace into discrete cells and maintaining a data structure

1The measure of intrinsic dimensionality used throughoutthis paper is the expansion dimension, also known as the KR-dimension, which is defined as log2 c, where c is the expansionrate introduced in (Karger & Ruhl, 2002).

arX

iv:1

703.

0044

0v2

[cs

.LG

] 2

0 Ju

l 201

7

Page 2: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

that keeps track of the points lying in each cell. At querytime, these methods simply look up of the contents of thecell containing the query and possibly adjacent cells andperform brute-force search over points lying in these cells.While this works well in low-dimensional settings, wouldit work in high dimensions?

Several limitations of this approach in high-dimensionalspace are identified in (Li & Malik, 2016). First, becausethe volume of space grows exponentially in dimensionality,either the number or the volumes of cells must grow expo-nentially. Second, the discretization of the space essentiallylimits the “field of view” of the algorithm, as it is unawareof points that lie in adjacent cells. This is especially prob-lematic when the query lies near a cell boundary, as therecould be points in adjacent cells that are much closer tothe query. Third, as dimensionality increases, surface areagrows faster than volume; as a result, points are increas-ingly likely to lie near cell boundaries. Fourth, when thedataset exhibits varying density across space, choosing agood partitioning is non-trivial. Furthermore, once chosen,the partitioning is fixed and cannot adapt to changes in den-sity arising from updates to the dataset.

In light of these observations, DCI is built on the ideaof avoiding partitioning the vector space. Instead, it con-structs a number of indices, each of which imposes an or-dering of all data points. Each index is constructed so thattwo points with similar ranks in the associated ordering arenearby along a certain random direction. These indices arethen combined to allow for retrieval of points that are closeto the query along multiple random directions.

In this paper, we propose a variant of DCI, which assigns apriority to each index that is used to determine which indexto process in the upcoming iteration. For this reason, wewill refer to this algorithm as Prioritized DCI. This simplechange results in a significant improvement in the depen-dence of query time on intrinsic dimensionality. Specif-ically, we show a remarkable result: a linear increase inintrinsic dimensionality, which could mean an exponentialincrease in the number of points near a query, can be mostlycounteracted with a corresponding linear increase in thenumber of indices. In other words, Prioritized DCI canmake a dataset with high intrinsic dimensionality seem al-most as easy as a dataset with low intrinsic dimensionality,with just a linear increase in space. To our knowledge, therehad been no exact method that can cope with high intrinsicdimensionality; Prioritized DCI represents the first methodthat can do so.

We also demonstrate empirically that Prioritized DCI sig-nificantly outperforms prior methods. In particular, com-pared to LSH, it achieves a 14- to 116-fold reduction in thenumber of distance evaluations and a 21-fold reduction inthe memory usage.

2. Related WorkThere is a vast literature on algorithms for nearest neigh-bour search. They can be divided into two categories:exact algorithms and approximate algorithms. Early ex-act algorithms are deterministic and store points in tree-based data structures. Examples include k-d trees (Bent-ley, 1975), R-trees (Guttman, 1984) and X-trees (Berchtoldet al., 1996; 1998), which divide the vector space into ahierarchy of half-spaces, hyper-rectangles or Voronoi poly-gons and keep track of the points that lie in each cell. Whiletheir query times are logarithmic in the size of the dataset,they exhibit exponential dependence on the ambient dimen-sionality. A different method (Meiser, 1993) partitions thespace by intersecting multiple hyperplanes. It effectivelytrades off space for time and achieves polynomial querytime in ambient dimensionality at the cost of exponentialspace complexity in ambient dimensionality.

Karger & Ruhl Navigating Net Cover Tree Rank Cover Tree

Spill Tree RP Tree

DCI

Prioritized DCI (Proposed Method)

Figure 1. Visualization of the query time complexities of variousexact algorithms as a function of the intrinsic dimensionality d′.Each curve represents an example from a class of similar querytime complexities. Algorithms that fall into each particular classare shown next to the corresponding curve.

To avoid poor performance on worst-case configurations ofthe data, exact randomized algorithms have been proposed.Spill trees (Liu et al., 2004), RP trees (Dasgupta & Fre-und, 2008) and virtual spill trees (Dasgupta & Sinha, 2015)extend the ideas behind k-d trees by randomizing the ori-entations of hyperplanes that partition the space into half-spaces at each node of the tree. While randomization en-ables them to avoid exponential dependence on the ambientdimensionality, their query times still scale exponentially inthe intrinsic dimensionality. Whereas these methods relyon space partitioning, other algorithms (Orchard, 1991;Clarkson, 1999; Karger & Ruhl, 2002) have been proposedthat utilize local search strategies. These methods start witha random point and look in the neighbourhood of the cur-rent point to find a new point that is closer to the querythan the original in each iteration. Like space partitioning-based approaches, the query time of (Karger & Ruhl, 2002)scales exponentially in the intrinsic dimensionality. While

Page 3: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

Method Query Time Complexity

Exact Algorithms:RP Tree O((d′ log d′)d

′+ logn)

Spill Tree O(d′d′+ logn)

Karger & Ruhl (2002) O(23d′logn)

Navigating Net 2O(d′) logn

Cover Tree O(212d′logn)

Rank Cover Tree O(2O(d′ log h)n2/h) for h ≥ 3

DCI O(dmax(logn, n1−1/d′))

Prioritized DCI O(dmax(logn, n1−m/d′)

(Proposed Method) +(m logm)max(logn, n1−1/d′))for m ≥ 1

Approximate Algorithms:k-d Tree O((1/ε)d logn)BBD Tree O((6/ε)d logn)

LSH ≈ O(dn1/(1+ε)2)

Table 1. Query time complexities of various algorithms for 1-NNsearch. Ambient dimensionality, intrinsic dimensionality, datasetsize and approximation ratio are denoted as d, d′, n and 1 + ε.A visualization of the growth of various time complexities as afunction of the intrinsic dimensionality is shown in Figure 1.

the query times of (Orchard, 1991; Clarkson, 1999) do notexhibit such undesirable dependence, their space complexi-ties are quadratic in the size of the dataset, making them im-practical for large datasets. A different class of algorithmsperforms search in a coarse-to-fine manner. Examples in-clude navigating nets (Krauthgamer & Lee, 2004), covertrees (Beygelzimer et al., 2006) and rank cover trees (Houle& Nett, 2015), which maintain sets of subsampled datapoints at different levels of granularity and descend throughthe hierarchy of neighbourhoods of decreasing radii aroundthe query. Unfortunately, the query times of these methodsagain scale exponentially in the intrinsic dimensionality.

Due to the difficulties of devising efficient algorithms forthe exact version of the problem, there has been extensivework on approximate algorithms. Under the approximatesetting, returning any point whose distance to the queryis within a factor of 1 + ε of the distance between thequery and the true nearest neighbour is acceptable. Manyof the same strategies are employed by approximate al-gorithms. Methods based on tree-based space partition-ing (Arya et al., 1998) and local search (Arya & Mount,1993) have been developed; like many exact algorithms,their query times also scale exponentially in the ambientdimensionality. Locality-Sensitive Hashing (LSH) (Indyk& Motwani, 1998; Datar et al., 2004; Andoni & Indyk,2006) partitions the space into regular cells, whose shapesare implicitly defined by the choice of the hash function. Itachieves a query time of O(dnρ) using O(dn1+ρ) space,where d is the ambient dimensionality, n is the datasetsize and ρ ≈ 1/(1 + ε)2 for large n in Euclidean space,

though the dependence on intrinsic dimensionality is notmade explicit. In practice, the performance of LSH de-grades on datasets with large variations in density, dueto the uneven distribution of points across cells. Con-sequently, various data-dependent hashing schemes havebeen proposed (Pauleve et al., 2010; Weiss et al., 2009; An-doni & Razenshteyn, 2015); unlike data-independent hash-ing schemes, however, they do not allow dynamic updatesto the dataset. A related approach (Jegou et al., 2011) de-composes the space into mutually orthogonal axis-alignedsubspaces and independently partitions each subspace. Ithas a query time linear in the dataset size and no knownguarantee on the probability of correctness under the ex-act or approximate setting. A different approach (Anag-nostopoulos et al., 2015) projects the data to a lower di-mensional space that approximately preserves approximatenearest neighbour relationships and applies other approxi-mate algorithms like BBD trees (Arya et al., 1998) to theprojected data. Its query time is also linear in ambient di-mensionality and sublinear in the dataset size. Unlike LSH,it uses space linear in the dataset size, at the cost of longerquery time than LSH. Unfortunately, its query time is ex-ponential in intrinsic dimensionality.

Our work is most closely related to Dynamic ContinuousIndexing (DCI) (Li & Malik, 2016), which is an exact ran-domized algorithm for Euclidean space whose query timeis linear in ambient dimensionality, sublinear in dataset sizeand sublinear in intrinsic dimensionality and uses space lin-ear in the dataset size. Rather than partitioning the vec-tor space, it uses multiple global one-dimensional indices,each of which orders data points along a certain randomdirection and combines these indices to find points that arenear the query along multiple random directions. The pro-posed algorithm builds on the ideas introduced by DCI andachieves a significant improvement in the dependence onintrinsic dimensionality.

A summary of the query times of various prior algorithmsand the proposed algorithm is presented in Table 1 and theirgrowth as a function of intrinsic dimensionality is illus-trated in Figure 1.

3. Prioritized DCIDCI constructs a data structure consisting of multiple com-posite indices of data points, each of which in turn consistsof a number of simple indices. Each simple index ordersdata points according to their projections along a particu-lar random direction. Given a query, for every compositeindex, the algorithm finds points that are near the query inevery constituent simple index, which are known as candi-date points, and adds them to a set known as the candidateset. The true distances from the query to every candidatepoint are evaluated and the ones that are among the k clos-

Page 4: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

est to the query are returned.

More concretely, each simple index is associated with arandom direction and stores the projections of every datapoint along the direction. They are implemented usingstandard data structures that maintain one-dimensional or-dered sequences of elements, like self-balancing binarysearch trees (Bayer, 1972; Guibas & Sedgewick, 1978)or skip lists (Pugh, 1990). At query time, the algorithmprojects the query along the projection directions associ-ated with each simple index and finds the position wherethe query would have been inserted in each simple index,which takes logarithmic time. It then iterates over, or vis-its, data points in each simple index in the order of theirdistances to the query under projection, which takes con-stant time for each iteration. As it iterates, it keeps track ofhow many times each data point has been visited across allsimple indices of each composite index. If a data point hasbeen visited in every constituent simple index, it is addedto the candidate set and is said to have been retrieved fromthe composite index.

Algorithm 1 Data structure construction procedure

Require: A datasetD of n points p1, . . . , pn, the number of sim-ple indicesm that constitute a composite index and the numberof composite indices Lfunction CONSTRUCT(D,m,L)ujlj∈[m],l∈[L] ← mL random unit vectors in RdTjlj∈[m],l∈[L] ← mL empty binary search trees or skip

listsfor j = 1 to m do

for l = 1 to L dofor i = 1 to n do

pijl ← 〈pi, ujl〉Insert (pijl, i) into Tjl with pijl being the key andi being the value

end forend for

end forreturn (Tjl, ujl)j∈[m],l∈[L]

end function

DCI has a number of appealing properties compared tomethods based on space partitioning. Because points arevisited by rank rather than location in space, DCI performswell on datasets with large variations in data density. Itnaturally skips over sparse regions of the space and concen-trates more on dense regions of the space. Since construc-tion of the data structure does not depend on the dataset, thealgorithm supports dynamic updates to the dataset, whilebeing able to automatically adapt to changes in data den-sity. Furthermore, because data points are represented inthe indices as continuous values without being discretized,the granularity of discretization does not need to be chosenat construction time. Consequently, the same data structurecan support queries at varying desired levels of accuracy,which allows a different speed-vs-accuracy trade-off to be

made for each individual query.

Prioritized DCI differs from standard DCI in the order inwhich points from different simple indices are visited. Instandard DCI, the algorithm cycles through all constituentsimple indices of a composite index at regular intervals andvisits exactly one point from each simple index in eachpass. In Prioritized DCI, the algorithm assigns a priorityto each constituent simple index; in each iteration, it visitsthe upcoming point from the simple index with the highestpriority and updates the priority at the end of the iteration.The priority of a simple index is set to the negative absolutedifference between the query projection and the next datapoint projection in the index.

Algorithm 2 k-nearest neighbour querying procedure

Require: Query point q in Rd, binary search trees/skip lists andtheir associated projection vectors (Tjl, ujl)j∈[m],l∈[L], thenumber of points to retrieve k0 and the number of points to visitk1 in each composite indexfunction QUERY(q, (Tjl, ujl)j,l, k0, k1)

Cl ← array of size n with entries initialized to 0 ∀l ∈ [L]qjl ← 〈q, ujl〉 ∀j ∈ [m], l ∈ [L]Sl ← ∅ ∀l ∈ [L]Pl ← empty priority queue ∀l ∈ [L]for l = 1 to L do

for j = 1 to m do(p

(1)jl , h

(1)jl )← the node in Tjl whose key is the

closest to qjlInsert (p(1)jl , h

(1)jl ) with priority −|p(1)jl − qjl|

into Plend for

end forfor i′ = 1 to k1 − 1 do

for l = 1 to L doif |Sl| < k0 then

(p(i)jl , h

(i)jl )← the node with the highest priority

in PlRemove (p(i)jl , h

(i)jl ) from Pl and insert the node

in Tjl whose key is the next closest to qjl,which is denoted as (p(i+1)

jl , h(i+1)jl ), with

priority −|p(i+1)jl − qjl| into Pl

Cl[h(i)jl ]← Cl[h

(i)jl ] + 1

if Cl[h(i)jl ] = m then

Sl ← Sl ∪ h(i)jl

end ifend if

end forend forreturn k points in

⋃l∈[L] Sl that are the closest in

Euclidean distance in Rd to qend function

Intuitively, this ensures data points are visited in the orderof their distances to the query under projection. Becausedata points are only retrieved from a composite index whenthey have been visited in all constituent simple indices, data

Page 5: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

Property Complexity

Construction O(m(dn+ n logn))

Query O(dkmax(log(n/k), (n/k)1−m/d

′)+

mk logm(max(log(n/k), (n/k)1−1/d′)

))Insertion O(m(d+ logn))Deletion O(m logn)Space O(mn)

Table 2. Time and space complexities of Prioritized DCI.

points are retrieved in the order of the maximum of theirdistances to the query along multiple projection directions.Since distance under projection forms a lower bound on thetrue distance, the maximum projected distance approachesthe true distance as the number of projection directions in-creases. Hence, in the limit as the number of simple indicesapproaches infinity, data points are retrieved in the ideal or-der, that is, the order of their true distances to the query.

The construction and querying procedures of Priori-tized DCI are presented formally in Algorithms 1 and2. To ensure the algorithm retrieves the exact k-nearest neighbours with high probability, the analy-sis in the next section shows that one should choosek0 ∈ Ω(kmax(log(n/k), (n/k)1−m/d

′)) and k1 ∈

Ω(mkmax(log(n/k), (n/k)1−1/d′)), where d′ denotes the

intrinsic dimensionality. Though because this assumesworst-case configuration of data points, it may be overlyconservative in practice; so, these parameters may be cho-sen by cross-validation.

We summarize the time and space complexities of Priori-tized DCI in Table 2. Notably, the first term of the querycomplexity, which dominates when the ambient dimension-ality d is large, has a more favourable dependence on the in-trinsic dimensionality d′ than the query complexity of stan-dard DCI. In particular, a linear increase in the intrinsic di-mensionality, which corresponds to an exponential increasein the expansion rate, can be mitigated by just a linear in-crease in the number of simple indices m. This suggeststhat Prioritized DCI can better handle datasets with highintrinsic dimensionality than standard DCI, which is con-firmed by empirical evidence later in this paper.

4. AnalysisWe analyze the time and space complexities of PrioritizedDCI below and derive the stopping condition of the algo-rithm. Because the algorithm uses standard data structures,analysis of the construction time, insertion time, deletiontime and space complexity is straightforward. Hence, thissection focuses mostly on analyzing the query time.

In high-dimensional space, query time is dominated by the

time spent on evaluating true distances between candidatepoints and the query. Therefore, we need to find the num-ber of candidate points that must be retrieved to ensure thealgorithm succeeds with high probability. To this end, wederive an upper bound on the failure probability for anygiven number of candidate points. The algorithm fails ifsufficiently many distant points are retrieved from eachcomposite index before some of the true k-nearest neigh-bours. We decompose this event into multiple (dependent)events, each of which is the event that a particular distantpoint is retrieved before some true k-nearest neighbours.Since points are retrieved in the order of their maximumprojected distance, this event happens when the maximumprojected distance of the distant point is less than that ofa true k-nearest neighbour. We start by finding an upperbound on the probability of this event. To simplify nota-tion, we initially consider displacement vectors from thequery to each data point, and so relationships between pro-jected distances of triplets of points translate relationshipsbetween projected lengths of pairs of displacement vectors.

We start by examining the event that a vector under ran-dom one-dimensional projection satisfies some geometricconstraint. We then find an upper bound on the probabil-ity that some combinations of these events occur, which isrelated to the failure probability of the algorithm.Lemma 1. Let vl, vs ∈ Rd be such that

∥∥vl∥∥2>∥∥vs∥∥

2,

u′jMj=1

be i.i.d. unit vectors in Rd drawn uniformly

at random. Then Pr(maxj

∣∣〈vl, u′j〉∣∣ ≤ ∥∥vs∥∥2) =(1− 2

π cos−1(∥∥vs∥∥

2/∥∥vl∥∥

2

))M.

Proof. The event

maxj∣∣〈vl, u′j〉∣∣ ≤ ‖vs‖2 is equiv-

alent to the event that∣∣〈vl, u′j〉∣∣ ≤ ‖vs‖2 ∀j, which is

the intersection of the events∣∣〈vl, u′j〉∣∣ ≤ ‖vs‖2. Be-

cause u′j’s are drawn independently, these events are inde-pendent.

Let θj be the angle between vl and u′j , so that 〈vl, u′j〉 =∥∥vl∥∥2

cos θj . Since u′j is drawn uniformly, θj is uniformlydistributed on [0, 2π]. Hence,

Pr

(maxj

∣∣∣〈vl, u′j〉∣∣∣ ≤ ‖vs‖2)=

M∏j=1

Pr(∣∣∣〈vl, u′j〉∣∣∣ ≤ ‖vs‖2)

=

M∏j=1

Pr

(|cos θj | ≤

‖vs‖2‖vl‖2

)

=

M∏j=1

(2Pr

(θj ∈

[cos−1

(‖vs‖2‖vl‖2

), π − cos−1

(‖vs‖2‖vl‖2

)]))

=

(1− 2

πcos−1

(‖vs‖2‖vl‖2

))M

Page 6: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

Lemma 2. For any set of events EiNi=1, the probabilitythat at least k′ of them happen is at most 1

k′

∑Ni=1 Pr (Ei).

Proof. For any set T ⊆ [N ], define ET to be the intersec-tion of events indexed by T and complements of events notindexed by T , i.e. ET =

(⋂i∈T Ei

)∩(⋂

i/∈T Ei). Observe

thatET

T⊆[N ]

are disjoint and that for any I ⊆ [N ],⋂i∈I Ei =

⋃T⊇I ET . The event that at least k′ of Ei’s

happen is⋃I⊆[N ]:|I|=k′

⋂i∈I Ei, which is equivalent to⋃

I⊆[N ]:|I|=k′⋃T⊇I ET =

⋃T⊆[N ]:|T |≥k′ ET . We will

henceforth use T to denote T ⊆ [N ] : |T | ≥ k′. SinceT is a finite set, we can impose an ordering on its elementsand denote the lth element as Tl. The event can thereforebe rewritten as

⋃|T |l=1 ETl

.

Define E′i,j to be Ei \(⋃|T |

l=j+1 ETl

). We claim

that∑Ni=1 Pr

(E′i,j

)≥ k′

∑jl=1 Pr

(ETl

)for all j ∈

0, . . . , |T |. We will show this by induction on j.

For j = 0, the claim is vacuously true because probabili-ties are non-negative. For j > 0, we observe that E′i,j =(E′i,j \ ETj

)∪(E′i,j ∩ ETj

)= E′i,j−1 ∪

(E′i,j ∩ ETj

)for all i. Since E′i,j \ ETj and E′i,j ∩ ETj are disjoint,

Pr(E′i,j

)= Pr

(E′i,j−1

)+ Pr

(E′i,j ∩ ETj

).

Consider the quantity∑i∈Tj

Pr(E′i,j

), which is∑

i∈Tj

(Pr(E′i,j−1

)+ Pr

(E′i,j ∩ ETj

))by the above

observation. For each i ∈ Tj , ETj⊆ Ei, and so

ETj\(⋃|T |

l=j+1 ETl

)⊆ Ei \

(⋃|T |l=j+1 ETl

)= E′i,j . Be-

causeETl

|T |l=j

are disjoint, ETj \(⋃|T |

l=j+1 ETl

)= ETj .

Hence, ETj ⊆ E′i,j and so E′i,j ∩ ETj = ETj . Thus,∑i∈Tj

Pr(E′i,j

)= |Tj |Pr

(ETj

)+∑i∈Tj

Pr(E′i,j−1

).

It follows that∑Ni=1 Pr

(E′i,j

)= |Tj |Pr

(ETj

)+∑

i∈TjPr(E′i,j−1

)+

∑i/∈Tj

Pr(E′i,j

). Because

Pr(E′i,j

)= Pr

(E′i,j−1

)+ Pr

(E′i,j ∩ ETj

)≥

Pr(E′i,j−1

)and |Tj | ≥ k′,

∑Ni=1 Pr

(E′i,j

)≥

k′Pr(ETj

)+∑Ni=1 Pr

(E′i,j−1

). By the inductive

hypothesis,∑Ni=1 Pr

(E′i,j−1

)≥ k′

∑j−1l=1 Pr

(ETl

).

Therefore,∑Ni=1 Pr

(E′i,j

)≥ k′

∑jl=1 Pr

(ETl

), which

concludes the induction argument.

The lemma is a special case of this claim when j = |T |,since E′i,|T | = Ei and

∑|T |l=1 Pr

(ETl

)= Pr

(⋃|T |l=1 ETl

).

Combining the above yields the following theorem, theproof of which is found in the supplementary material.

Theorem 1. LetvliNi=1

andvsi′N ′

i′=1be sets of vec-

tors such that∥∥vli∥∥2 >

∥∥vsi′∥∥2 ∀i ∈ [N ], i′ ∈[N ′]. Furthermore, let

u′iji∈[N ],j∈[M ]

be random uni-formly distributed unit vectors such that u′i1, . . . , u

′iM

are independent for any given i. Consider the events∃vsi′ s.t. maxj

∣∣〈vli, u′ij〉∣∣ ≤ ∥∥vsi′∥∥2Ni=1. The prob-

ability that at least k′ of these events occur is atmost 1

k′

∑Ni=1

(1− 2

π cos−1(∥∥vsmax

∥∥2/∥∥vli∥∥2))M , where∥∥vsmax

∥∥2

= maxi′∥∥vsi′∥∥2. Furthermore, if k′ = N , it is

at most mini∈[N ]

(1− 2

π cos−1(∥∥vsmax

∥∥2/∥∥vli∥∥2))M.

We now apply the results above to analyze specific proper-ties of the algorithm. For convenience, instead of workingdirectly with intrinsic dimensionality, we will analyze thequery time in terms of a related quantity, global relativesparsity, as defined in (Li & Malik, 2016). We reproduceits definition below for completeness.Definition 1. Given a dataset D ⊆ Rd, let Bp(r) be theset of points in D that are within a ball of radius r arounda point p. A dataset D has global relative sparsity of (τ, γ)if for all r and p ∈ Rd such that |Bp(r)| ≥ τ , |Bp(γr)| ≤2 |Bp(r)|, where γ ≥ 1.

Global relative sparsity is related to the expansionrate (Karger & Ruhl, 2002) and intrinsic dimensionality inthe following way: a dataset with global relative sparsityof (τ, γ) has (τ, 2(1/ log2 γ))-expansion and intrinsic dimen-sionality of 1/ log2 γ.

Below we derive two upper bounds on the probability thatsome of the true k-nearest neighbours are missing from theset of candidate points retrieved from a given compositeindex, which are in expressed in terms of k0 and k1 respec-tively. These results inform us how k0 and k1 should bechosen to ensure the querying procedure returns the correctresults with high probability. In the results that follow, weuse p(i)ni=1 to denote a re-ordering of the points pini=1

so that p(i) is the ith closest point to the query q. Proofs arefound in the supplementary material.Lemma 3. Consider points in the order they are re-trieved from a composite index that consists of msimple indices. The probability that there are atleast n0 points that are not the true k-nearest neigh-bours but are retrieved before some of them is at most

1n0−k

∑ni=2k+1

(1− 2

π cos−1(∥∥p(k) − q∥∥

2/∥∥p(i) − q∥∥

2

))m.

Lemma 4. Consider point projections in a compositeindex that consists of m simple indices in the order theyare visited. The probability that n0 point projections thatare not of the true k-nearest neighbours are visited beforeall true k-nearest neighbours have been retrieved is at most

mn0−mk

∑ni=2k+1

(1− 2

π cos−1(∥∥p(k) − q∥∥

2/∥∥p(i) − q∥∥

2

)).

Page 7: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

Lemma 5. On a dataset with globalrelative sparsity (k, γ), the quantity∑ni=2k+1

(1− 2

π cos−1(∥∥p(k) − q∥∥

2/∥∥p(i) − q∥∥

2

))mis

at most O(kmax(log(n/k), (n/k)1−m log2 γ)

).

Lemma 6. For a dataset with global relative spar-sity (k, γ) and a given composite index consist-ing of m simple indices, there is some k0 ∈Ω(kmax(log(n/k), (n/k)1−m log2 γ)) such that the proba-bility that the candidate points retrieved from the compositeindex do not include some of the true k-nearest neighboursis at most some constant α0 < 1.

Lemma 7. For a dataset with global relative spar-sity (k, γ) and a given composite index consist-ing of m simple indices, there is some k1 ∈Ω(mkmax(log(n/k), (n/k)1−log2 γ)) such that the proba-bility that the candidate points retrieved from the compositeindex do not include some of the true k-nearest neighboursis at most some constant α1 < 1.

Theorem 2. For a dataset with global relative spar-sity (k, γ), for any ε > 0, there is some L,k0 ∈ Ω(kmax(log(n/k), (n/k)1−m log2 γ)) and k1 ∈Ω(mkmax(log(n/k), (n/k)1−log2 γ)) such that the algo-rithm returns the correct set of k-nearest neighbours withprobability of at least 1− ε.

Now that we have found a choice of k0 and k1 that sufficesto ensure correctness with high probability, we can derivea bound on the query time that guarantees correctness. Wethen analyze the time complexity for construction, inser-tion and deletion and the space complexity. Proofs of thefollowing are found in the supplementary material.

Theorem 3. For a given number of simple indices m, thealgorithm takes O

(dkmax(log(n/k), (n/k)1−m/d

′)+

mk logm(

max(log(n/k), (n/k)1−1/d′)))

time to re-

trieve the k-nearest neighbours at query time, where d′ de-notes the intrinsic dimensionality.

Theorem 4. For a given number of simple indices m, thealgorithm takesO(m(dn+n log n)) time to preprocess thedata points in D at construction time.

Theorem 5. The algorithm requires O(m(d+log n)) timeto insert a new data point and O(m log n) time to delete adata point.

Theorem 6. The algorithm requires O(mn) space in ad-dition to the space used to store the data.

5. ExperimentsWe compare the performance of Prioritized DCI to thatof standard DCI (Li & Malik, 2016), product quantiza-tion (Jegou et al., 2011) and LSH (Datar et al., 2004),which is perhaps the algorithm that is most widely used

in high-dimensional settings. Because LSH operates un-der the approximate setting, in which the performance met-ric of interest is how close the returned points are to thequery rather than whether they are the true k-nearest neigh-bours. All algorithms are evaluated in terms of the timethey would need to achieve varying levels of approxima-tion quality.

Evaluation is performed on two datasets, CIFAR-100 (Krizhevsky & Hinton, 2009) and MNIST (LeCunet al., 1998). CIFAR-100 consists of 60, 000 colour im-ages of 100 types of objects in natural scenes and MNISTconsists of 70, 000 grayscale images of handwritten digits.The images in CIFAR-100 have a size of 32× 32 and threecolour channels, and the images in MNIST have a size of28× 28 and a single colour channel. We reshape each im-age into a vector whose entries represent pixel intensitiesat different locations and colour channels in the image. So,each vector has a dimensionality of 32×32×3 = 3072 forCIFAR-100 and 28 × 28 = 784 for MNIST. Note that thedimensionalities under consideration are much higher thanthose typically used to evaluate prior methods.

For the purposes of nearest neighbour search, MNIST is amore challenging dataset than CIFAR-100. This is becauseimages in MNIST are concentrated around a few modes;consequently, data points form dense clusters, leading tohigher intrinsic dimensionality. On the other hand, im-ages in CIFAR-100 are more diverse, and so data pointsare more dispersed in space. Intuitively, it is much harderto find the closest digit to a query among 6999 other digitsof the same category that are all plausible near neighboursthan to find the most similar natural image among a fewother natural images with similar appearance. Later resultsshow that all algorithms need fewer distance evaluations toachieve the same level of approximation quality on CIFAR-100 than on MNIST.

We evaluate performance of all algorithms using cross-validation, where we randomly choose ten different splitsof query vs. data points. Each split consists of 100 pointsfrom the dataset that serve as queries, with the remainderdesignated as data points. We use each algorithm to retrievethe 25 nearest neighbours at varying levels of approxima-tion quality and report mean performance and standard de-viation over all splits.

Approximation quality is measured using the approxima-tion ratio, which is defined to be the ratio of the radius ofthe ball containing the set of true k-nearest neighbours tothe radius of the ball containing the set of approximate k-nearest neighbours returned by the algorithm. The closerthe approximation ratio is to 1, the higher the approxima-tion quality. In high dimensions, the time taken to computetrue distances between the query and the candidate pointsdominates query time, so the number of distance evalua-

Page 8: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

1.15 1.20 1.25 1.30 1.35 1.40Approximation Ratio

0

200

400

600

800

1000

1200

Num

ber

of

Dis

tance

Evalu

ati

ons

LSHPQDCI (m=25, L=2)Prioritized DCI (m=25, L=2)Prioritized DCI (m=10, L=2)

(a)

1.16 1.18 1.20 1.22 1.24 1.26 1.28 1.30Approximation Ratio

0

2000

4000

6000

8000

10000

12000

14000

Num

ber

of

Dis

tance

Evalu

ati

ons

LSHPQDCI (m=15, L=3)Prioritized DCI (m=15, L=3)Prioritized DCI (m=10, L=2)

(b)

1.16 1.18 1.20 1.22 1.24 1.26 1.28 1.30Approximation Ratio

0

1000

2000

3000

4000

5000

6000

Num

ber

of

Dis

tance

Evalu

ati

ons

LSHPQDCI (m=15, L=3)Prioritized DCI (m=15, L=3)Prioritized DCI (m=10, L=2)

(c)

Figure 2. Comparison of the number of distance evaluations needed by different algorithms to achieve varying levels of approximationquality on (a) CIFAR-100 and (b,c) MNIST. Each curve represents the mean over ten folds and the shaded area represents ±1 standarddeviation. Lower values are better. (c) Close-up view of the figure in (b).

tions can be used as an implementation-independent proxyfor the query time.

For LSH, we used 24 hashes per table and 100 tables, whichwe found to achieve the best approximation quality giventhe memory constraints. For product quantization, we useda data-independent codebook with 256 entries so that thealgorithm supports dynamic updates. For standard DCI, weused the same hyparameter settings used in (Li & Malik,2016) (m = 25 and L = 2 on CIFAR-100 and m = 15 andL = 3 on MNIST). For Prioritized DCI, we used two differ-ent settings: one that matches the hyperparameter settingsof standard DCI, and another that uses less space (m = 10and L = 2 on both CIFAR-100 and MNIST).

We plot the number of distance evaluations that each algo-rithm requires to achieve each desired level of approxima-tion ratio in Figure 2. As shown, on CIFAR-100, under thesame hyperparameter setting used by standard DCI, Prior-itized DCI requires 87.2% to 92.5% fewer distance evalu-ations than standard DCI, 91.7% to 92.8% fewer distanceevaluations than product quantization, and 90.9% to 93.8%fewer distance evaluations than LSH to achieve same levelsapproximation quality, which represents a 14-fold reduc-tion in the number of distance evaluations relative to LSHon average. Under the more space-efficient hyperparametersetting, Prioritized DCI achieves a 6-fold reduction com-pared to LSH. On MNIST, under the same hyperparame-ter setting used by standard DCI, Prioritized DCI requires96.4% to 97.0% fewer distance evaluations than standardDCI, 87.1% to 89.8% fewer distance evaluations than prod-uct quantization, and 98.8% to 99.3% fewer distance eval-uations than LSH, which represents a 116-fold reductionrelative to LSH on average. Under the more space-efficienthyperparameter setting, Prioritized DCI achieves a 32-foldreduction compared to LSH.

We compare the space efficiency of Prioritized DCI to that

of standard DCI and LSH. As shown in Figure 3 in thesupplementary material, compared to LSH, Prioritized DCIuses 95.5% less space on CIFAR-100 and 95.3% less spaceon MNIST under the same hyperparameter settings usedby standard DCI. This represents a 22-fold reduction inmemory consumption on CIFAR-100 and a 21-fold reduc-tion on MNIST. Under the more space-efficient hyperpa-rameter setting, Prioritized DCI uses 98.2% less space onCIFAR-100 and 97.9% less space on MNIST relative toLSH, which represents a 55-fold reduction on CIFAR-100and a 48-fold reduction on MNIST.

In terms of wall-clock time, our implementation of Priori-tized DCI takes 1.18 seconds to construct the data structureand execute 100 queries on MNIST, compared to 104.71seconds taken by LSH.

6. ConclusionIn this paper, we presented a new exact randomized algo-rithm for k-nearest neighbour search, which we refer to asPrioritized DCI. We showed that Prioritized DCI achievesa significant improvement in terms of the dependence ofquery time complexity on intrinsic dimensionality com-pared to standard DCI. Specifically, Prioritized DCI can toa large extent counteract a linear increase in the intrinsicdimensionality, or equivalently, an exponential increase inthe number of points near a query, using just a linear in-crease in the number of simple indices. Empirical resultsvalidated the effectiveness of Prioritized DCI in practice,demonstrating the advantages of Prioritized DCI over priormethods in terms of speed and memory usage.

Acknowledgements. This work was supported byDARPA W911NF-16-1-0552. Ke Li thanks the NaturalSciences and Engineering Research Council of Canada(NSERC) for fellowship support.

Page 9: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

ReferencesAnagnostopoulos, Evangelos, Emiris, Ioannis Z, and Psar-

ros, Ioannis. Low-quality dimension reduction andhigh-dimensional approximate nearest neighbor. In 31stInternational Symposium on Computational Geometry(SoCG 2015), pp. 436–450, 2015.

Andoni, Alexandr and Indyk, Piotr. Near-optimal hashingalgorithms for approximate nearest neighbor in high di-mensions. In Foundations of Computer Science, 2006.FOCS’06. 47th Annual IEEE Symposium on, pp. 459–468. IEEE, 2006.

Andoni, Alexandr and Razenshteyn, Ilya. Optimal data-dependent hashing for approximate near neighbors. InProceedings of the Forty-Seventh Annual ACM on Sym-posium on Theory of Computing, pp. 793–801. ACM,2015.

Arya, Sunil and Mount, David M. Approximate nearestneighbor queries in fixed dimensions. In SODA, vol-ume 93, pp. 271–280, 1993.

Arya, Sunil, Mount, David M, Netanyahu, Nathan S, Sil-verman, Ruth, and Wu, Angela Y. An optimal algo-rithm for approximate nearest neighbor searching fixeddimensions. Journal of the ACM (JACM), 45(6):891–923, 1998.

Bayer, Rudolf. Symmetric binary b-trees: Data structureand maintenance algorithms. Acta informatica, 1(4):290–306, 1972.

Behnam, Ehsan, Waterman, Michael S, and Smith, An-drew D. A geometric interpretation for local alignment-free sequence comparison. Journal of Computational Bi-ology, 20(7):471–485, 2013.

Bentley, Jon Louis. Multidimensional binary search treesused for associative searching. Communications of theACM, 18(9):509–517, 1975.

Berchtold, Stefan, Keim, Daniel A., and peter Kriegel,Hans. The X-tree: An index structure for high-dimensional data. In Very Large Data Bases, pp. 28–39,1996.

Berchtold, Stefan, Ertl, Bernhard, Keim, Daniel A, Kriegel,H-P, and Seidl, Thomas. Fast nearest neighbor searchin high-dimensional space. In Data Engineering, 1998.Proceedings., 14th International Conference on, pp.209–218. IEEE, 1998.

Beygelzimer, Alina, Kakade, Sham, and Langford, John.Cover trees for nearest neighbor. In Proceedings ofthe 23rd International Conference on Machine Learn-ing, pp. 97–104. ACM, 2006.

Biau, Gerard, Chazal, Frederic, Cohen-Steiner, David, De-vroye, Luc, Rodriguez, Carlos, et al. A weighted k-nearest neighbor density estimate for geometric infer-ence. Electronic Journal of Statistics, 5:204–237, 2011.

Clarkson, Kenneth L. Nearest neighbor queries in metricspaces. Discrete & Computational Geometry, 22(1):63–93, 1999.

Dasgupta, Sanjoy and Freund, Yoav. Random projectiontrees and low dimensional manifolds. In Proceedingsof the Fortieth Annual ACM Symposium on Theory ofComputing, pp. 537–546. ACM, 2008.

Dasgupta, Sanjoy and Sinha, Kaushik. Randomized parti-tion trees for nearest neighbor search. Algorithmica, 72(1):237–263, 2015.

Datar, Mayur, Immorlica, Nicole, Indyk, Piotr, and Mir-rokni, Vahab S. Locality-sensitive hashing scheme basedon p-stable distributions. In Proceedings of the twenti-eth annual symposium on Computational geometry, pp.253–262. ACM, 2004.

Eldawy, Ahmed and Mokbel, Mohamed F. SpatialHadoop:A MapReduce framework for spatial data. In Data En-gineering (ICDE), 2015 IEEE 31st International Confer-ence on, pp. 1352–1363. IEEE, 2015.

Guibas, Leo J and Sedgewick, Robert. A dichromaticframework for balanced trees. In Foundations of Com-puter Science, 1978., 19th Annual Symposium on, pp.8–21. IEEE, 1978.

Guttman, Antonin. R-trees: a dynamic index structurefor spatial searching. In Proceedings of the 1984 ACMSIGMOD International Conference on Management ofData, pp. 47–57, 1984.

Houle, Michael E and Nett, Michael. Rank-based similaritysearch: Reducing the dimensional dependence. PatternAnalysis and Machine Intelligence, IEEE Transactionson, 37(1):136–150, 2015.

Indyk, Piotr and Motwani, Rajeev. Approximate nearestneighbors: towards removing the curse of dimensional-ity. In Proceedings of the Thirtieth Annual ACM Sym-posium on Theory of Computing, pp. 604–613. ACM,1998.

Jegou, Herve, Douze, Matthijs, and Schmid, Cordelia.Product quantization for nearest neighbor search. Pat-tern Analysis and Machine Intelligence, IEEE Transac-tions on, 33(1):117–128, 2011.

Karger, David R and Ruhl, Matthias. Finding nearestneighbors in growth-restricted metrics. In Proceedingsof the Thiry-fourth Annual ACM Symposium on Theoryof Computing, pp. 741–750. ACM, 2002.

Page 10: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

Krauthgamer, Robert and Lee, James R. Navigating nets:simple algorithms for proximity search. In Proceedingsof the Fifteenth Annual ACM-SIAM Symposium on Dis-crete Algorithms, pp. 798–807. Society for Industrial andApplied Mathematics, 2004.

Krizhevsky, Alex and Hinton, Geoffrey. Learning multiplelayers of features from tiny images. Technical report,University of Toronto, 2009.

LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner,Patrick. Gradient-based learning applied to documentrecognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Li, Ke and Malik, Jitendra. Fast k-nearest neighbour searchvia Dynamic Continuous Indexing. In InternationalConference on Machine Learning, pp. 671–679, 2016.

Liu, Ting, Moore, Andrew W, Yang, Ke, and Gray, Alexan-der G. An investigation of practical approximate nearestneighbor algorithms. In Advances in Neural InformationProcessing Systems, pp. 825–832, 2004.

Meiser, Stefan. Point location in arrangements of hyper-planes. Information and Computation, 106(2):286–303,1993.

Minsky, Marvin and Papert, Seymour. Perceptrons: an in-troduction to computational geometry. pp. 222, 1969.

Orchard, Michael T. A fast nearest-neighbor search al-gorithm. In Acoustics, Speech, and Signal Processing,1991. ICASSP-91., 1991 International Conference on,pp. 2297–2300. IEEE, 1991.

Pauleve, Loıc, Jegou, Herve, and Amsaleg, Laurent. Lo-cality sensitive hashing: A comparison of hash functiontypes and querying mechanisms. Pattern RecognitionLetters, 31(11):1348–1358, 2010.

Pugh, William. Skip lists: a probabilistic alternative to bal-anced trees. Communications of the ACM, 33(6):668–676, 1990.

Weiss, Yair, Torralba, Antonio, and Fergus, Rob. Spectralhashing. In Advances in Neural Information ProcessingSystems, pp. 1753–1760, 2009.

Page 11: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCISupplementary Material

Ke Li 1 Jitendra Malik 1

7. AnalysisWe present proofs that were omitted from the main paperbelow.

Theorem 1. LetvliNi=1

andvsi′N ′

i′=1be sets of vec-

tors such that∥∥vli∥∥2 >

∥∥vsi′∥∥2 ∀i ∈ [N ], i′ ∈[N ′]. Furthermore, let

u′iji∈[N ],j∈[M ]

be random uni-formly distributed unit vectors such that u′i1, . . . , u

′iM

are independent for any given i. Consider the events∃vsi′ s.t. maxj

∣∣〈vli, u′ij〉∣∣ ≤ ∥∥vsi′∥∥2Ni=1. The prob-

ability that at least k′ of these events occur is atmost 1

k′

∑Ni=1

(1− 2

π cos−1(∥∥vsmax

∥∥2/∥∥vli∥∥2))M , where∥∥vsmax

∥∥2

= maxi′∥∥vsi′∥∥2. Furthermore, if k′ = N , it is

at most mini∈[N ]

(1− 2

π cos−1(∥∥vsmax

∥∥2/∥∥vli∥∥2))M.

Proof. The event that ∃vsi′ s.t. maxj∣∣〈vli, u′ij〉∣∣ ≤∥∥vsi′∥∥2 is equivalent to the event that maxj∣∣〈vli, u′ij〉∣∣ ≤

maxi′∥∥vsi′∥∥2 =

∥∥vsmax

∥∥2. Take Ei to be the

event that maxj∣∣〈vli, u′ij〉∣∣ ≤

∥∥vsmax

∥∥2. By

Lemma 1, Pr(Ei) ≤(1− 2

π cos−1(∥∥vsmax

∥∥2/∥∥vli∥∥2))M .

It follows from Lemma 2 that the probability thatk′ of Ei’s occur is at most 1

k′

∑Ni=1 Pr (Ei) ≤

1k′

∑Ni=1

(1− 2

π cos−1(∥∥vsmax

∥∥2/∥∥vli∥∥2))M . If k′ =

N , we use the fact that⋂Ni′=1Ei′ ⊆ Ei ∀i, which

implies that Pr(⋂N

i′=1Ei′)≤ mini∈[N ] Pr (Ei) ≤

mini∈[N ]

(1− 2

π cos−1(∥∥vsmax

∥∥2/∥∥vli∥∥2))M.

Lemma 3. Consider points in the order they are re-trieved from a composite index that consists of msimple indices. The probability that there are atleast n0 points that are not the true k-nearest neigh-bours but are retrieved before some of them is at most

1n0−k

∑ni=2k+1

(1− 2

π cos−1(∥∥p(k) − q∥∥

2/∥∥p(i) − q∥∥

2

))m.

Proof. Points that are not the true k-nearest neighbours butare retrieved before some of them will be referred to asextraneous points and are divided into two categories: rea-sonable and silly. An extraneous point is reasonable if itis one of the 2k-nearest neighbours, and is silly otherwise.

For there to be n0 extraneous points, there must be n0 − ksilly extraneous points. Therefore, the probability that thereare n0 extraneous points is upper bounded by the probabil-ity that there are n0 − k silly extraneous points.

Since points are retrieved from the composite index inthe order of increasing maximum projected distance to thequery, for any pair of points p and p′, if p is retrieved beforep′, then maxj |〈p− q, ujl〉| ≤ maxj |〈p′ − q, ujl〉|,where ujlmj=1 are the projection directions associatedwith the constituent simple indices of the composite index.

By Theorem 1, if we takevliNi=1

to bep(i) − q

ni=2k+1

,vsi′N ′

i′=1to be

p(i) − q

ki=1

, M to be m,u′ijj∈[M ]

to be ujlj∈[m] for all i ∈ [N ] and k′ to be n0 − k, weobtain an upper bound for the probability of there beinga subset of

p(i)ni=2k+1

of size n0 − k such that for allpoints p in the subset, maxj |〈p− q, ujl〉| ≤ ‖p′ − q‖2for some p′ ∈

p(i) − q

ki=1

. In other words, this isthe probability of there being n0 − k points that arenot the 2k-nearest neighbours whose maximum pro-jected distances are no greater than the distance fromsome k-nearest neighbours to the query, which is at most

1n0−k

∑ni=2k+1

(1− 2

π cos−1(∥∥p(k) − q∥∥

2/∥∥p(i) − q∥∥

2

))m.

Since the event that maxj |〈p− q, ujl〉| ≤maxj |〈p′ − q, ujl〉| is contained in the event thatmaxj |〈p− q, ujl〉| ≤ ‖p′ − q‖2 for any p, p′, this isalso an upper bound for the probability of there beingn0 − k points that are not the 2k-nearest neighbourswhose maximum projected distances do not exceed thoseof some of the k-nearest neighbours, which by definitionis the probability that there are n0 − k silly extraneouspoints. Since this probability is no less than the probabilitythat there are n0 extraneous points, the upper bound alsoapplies to this probability.

Lemma 4. Consider point projections in a compositeindex that consists of m simple indices in the order theyare visited. The probability that n0 point projections thatare not of the true k-nearest neighbours are visited beforeall true k-nearest neighbours have been retrieved is at most

mn0−mk

∑ni=2k+1

(1− 2

π cos−1(∥∥p(k) − q∥∥

2/∥∥p(i) − q∥∥

2

)).

Page 12: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

Proof. Projections of points that are not the true k-nearestneighbours but are visited before the k-nearest neighbourshave all been retrieved will be referred to as extraneous pro-jections and are divided into two categories: reasonableand silly. An extraneous projection is reasonable if it isof one of the 2k-nearest neighbours, and is silly otherwise.For there to be n0 extraneous projections, there must ben0 −mk silly extraneous projections, since there could beat most mk reasonable extraneous projections. Therefore,the probability that there are n0 extraneous projections isupper bounded by the probability that there are n0 − mksilly extraneous projections.

Since point projections are visited in the order of increas-ing projected distance to the query, each extraneous sillyprojection must be closer to the query projection than themaximum projection of some k-nearest neighbour.

By Theorem 1, if we takevliNi=1

to bep(2k+b(i−1)/mc+1) − q

m(n−2k)i=1

,vsi′N ′

i′=1to be

p(b(i−1)/mc+1) − qmki=1

, M to be 1, u′i1Ni=1 to be

u(i mod m),l

m(n−2k)i=1

and k′ to be n0 − mk, we ob-tain an upper bound for the probability of there beingn0 − mk point projections that are not of the 2k-nearestneighbours whose distances to their respective queryprojections are no greater than the true distance be-tween the query and some k-nearest neighbour, which is

1n0−mk

∑ni=2k+1m

(1− 2

π cos−1(‖p(k)−q‖

2

‖p(i)−q‖2

)).

Because maximum projected distances are no more thantrue distances, this is also an upper bound for the proba-bility of there being n0 −mk silly extraneous projections.Since this probability is no less than the probability thatthere are n0 extraneous projections, the upper bound alsoapplies to this probability.

Lemma 5. On a dataset with globalrelative sparsity (k, γ), the quantity∑ni=2k+1

(1− 2

π cos−1(∥∥p(k) − q∥∥

2/∥∥p(i) − q∥∥

2

))mis

at most O(kmax(log(n/k), (n/k)1−m log2 γ)

).

Proof. By definition of global relative sparsity, for all i ≥2k + 1,

∥∥p(i) − q∥∥2> γ

∥∥p(k) − q∥∥2. A recursive ap-

plication shows that for all i ≥ 2i′k + 1,

∥∥p(i) − q∥∥2>

γi′ ∥∥p(k) − q∥∥

2.

Applying the fact that 1− (2/π) cos−1 (x) ≤ x ∀x ∈ [0, 1]and the above observation yields:

n∑i=2k+1

1− 2

πcos−1

∥∥∥p(k) − q∥∥∥

2∥∥∥p(i) − q∥∥∥2

m

≤n∑

i=2k+1

∥∥∥p(k) − q∥∥∥

2∥∥∥p(i) − q∥∥∥2

m

<

dlog2(n/k)e−1∑i′=1

2i′kγ−i

′m

If γ ≥ m√

2, this quantity is at most k log2 (n/k). On theother hand, if 1 ≤ γ < m

√2, this quantity can be simplified

to:

k

(2

γm

)((2

γm

)dlog2(n/k)e−1

− 1

)/

(2

γm− 1

)

= O

(k

(2

γm

)dlog2(n/k)e−1)

= O

(k(nk

)1−m log2 γ)

Therefore,∑ni=2k+1

(∥∥p(k) − q∥∥2/∥∥p(i) − q∥∥

2

)m ≤O(kmax(log(n/k), (n/k)1−m log2 γ)

).

Lemma 6. For a dataset with global relative spar-sity (k, γ) and a given composite index consist-ing of m simple indices, there is some k0 ∈Ω(kmax(log(n/k), (n/k)1−m log2 γ)) such that the proba-bility that the candidate points retrieved from the compositeindex do not include some of the true k-nearest neighboursis at most some constant α0 < 1.

Proof. We will refer to the true k-nearest neighbours thatare among first k0 points retrieved from the composite in-dex as true positives and those that are not as false nega-tives. Additionally, we will refer to points that are not truek-nearest neighbours but are among the first k0 points re-trieved as false positives.

When not all the true k-nearest neighbours are among thefirst k0 candidate points, there must be at least one falsenegative and so there can be at most k − 1 true positives.Consequently, there must be at least k0−(k−1) false posi-tives. To find an upper bound on the probability of the exis-tence of k0− (k−1) false positives in terms of global rela-tive sparsity, we apply Lemma 3 with n0 set to k0−(k−1),followed by Lemma 5. We conclude that this probabilityis at most 1

k0−2k+1O(kmax(log(n/k), (n/k)1−m log2 γ)

).

Because the event that not all the true k-nearest neighboursare among the first k0 candidate points is contained in theevent that there are k0 − (k− 1) false positives, the formeris upper bounded by the same quantity. So, we can choosesome k0 ∈ Ω(kmax(log(n/k), (n/k)1−m log2 γ)) to makeit strictly less than 1.

Lemma 7. For a dataset with global relative spar-sity (k, γ) and a given composite index consist-ing of m simple indices, there is some k1 ∈

Page 13: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

Ω(mkmax(log(n/k), (n/k)1−log2 γ)) such that the proba-bility that the candidate points retrieved from the compositeindex do not include some of the true k-nearest neighboursis at most some constant α1 < 1.

Proof. We will refer to the projections of true k-nearestneighbours that are among first k1 visited point projectionsas true positives and those that are not as false negatives.Additionally, we will refer to projections of points that arenot of the true k-nearest neighbours but are among the firstk1 visited point projections as false positives.

When a k-nearest neighbour is not among the candidatepoints that have been retrieved, some of its projectionsmust not be among the first k1 visited point projections.So, there must be at least one false negative, implying thatthere can be at most mk − 1 true positives. Consequently,there must be at least k1 − (mk − 1) false positives. Tofind an upper bound on the probability of the existence ofk1 − (mk − 1) false positives in terms of global relativesparsity, we apply Lemma 4 with n0 set to k1 − (mk − 1),followed by Lemma 5. We conclude that this probabilityis at most m

k1−2mk+1O(kmax(log(n/k), (n/k)1−log2 γ)

).

Because the event that some true k-nearest neighbour ismissing from the candidate points is contained in the eventthat there are k1 − (mk − 1) false positives, the former isupper bounded by the same quantity. So, we can choosesome k1 ∈ Ω(mkmax(log(n/k), (n/k)1−log2 γ)) to makeit strictly less than 1.

Theorem 2. For a dataset with global relative spar-sity (k, γ), for any ε > 0, there is some L,k0 ∈ Ω(kmax(log(n/k), (n/k)1−m log2 γ)) and k1 ∈Ω(mkmax(log(n/k), (n/k)1−log2 γ)) such that the algo-rithm returns the correct set of k-nearest neighbours withprobability of at least 1− ε.

Proof. For a given composite index, by Lemma 6, thereis some k0 ∈ Ω(kmax(log(n/k), (n/k)1−m log2 γ)) suchthat the probability that some of the true k-nearestneighbours are missed is at most some constant α0 <1. Likewise, by Lemma 7, there is some k1 ∈Ω(mkmax(log(n/k), (n/k)1−log2 γ)) such that this prob-ability is at most some constant α1 < 1. By choos-ing such k0 and k1, this probability is therefore at mostminα0, α1 < 1. For the algorithm to fail, all compos-ite indices must miss some k-nearest neighbours. Sinceeach composite index is constructed independently, the al-gorithm fails with probability of at most (minα0, α1)L,and so must succeed with probability of at least 1 −(minα0, α1)L. Since minα0, α1 < 1, there is someL that makes 1− (minα0, α1)L ≥ 1− ε.

Theorem 3. For a given number of simple indices m, thealgorithm takes O

(dkmax(log(n/k), (n/k)1−m/d

′)+

mk logm(

max(log(n/k), (n/k)1−1/d′)))

time to re-

trieve the k-nearest neighbours at query time, where d′ de-notes the intrinsic dimensionality.

Proof. Computing projections of the query point along allujl’s takes O(dm) time, since L is a constant. Searchingin the binary search trees/skip lists Tjl’s takes O(m log n)time. The total number of point projections that arevisited is at most Θ(mkmax(log(n/k), (n/k)1−log2 γ)).Because determining the next point to visit requirespopping and pushing a priority queue, which takesO(logm) time, the total time spent on visiting pointsis O(mk logmmax(log(n/k), (n/k)1−log2 γ)). Thetotal number of candidate points retrieved is at mostΘ(kmax(log(n/k), (n/k)1−m log2 γ)). Because truedistances are computed for every candidate point,the total time spent on distance computation isO(dkmax(log(n/k), (n/k)1−m log2 γ)). We can findthe k closest points to the query among the candidatepoints using a selection algorithm like quickselect,which takes O(kmax(log(n/k), (n/k)1−m log2 γ)) timeon average. Since the time for visiting points andfor computing distances dominates, the entire algo-rithm takes O(dkmax(log(n/k), (n/k)1−m log2 γ) +mk logmmax(log(n/k), (n/k)1−log2 γ)) time. Substitut-ing 1/d′ for log2 γ yields the desired expression.

Theorem 4. For a given number of simple indices m, thealgorithm takesO(m(dn+n log n)) time to preprocess thedata points in D at construction time.

Proof. Computing projections of all n points along all ujl’stakes O(dmn) time, since L is a constant. Inserting all npoints into mL self-balancing binary search trees/skip liststakes O(mn log n) time.

Theorem 5. The algorithm requires O(m(d+log n)) timeto insert a new data point and O(m log n) time to delete adata point.

Proof. In order to insert a data point, we need to computeits projection along all ujl’s and insert it into each binarysearch tree or skip list. Computing the projections takesO(md) time and inserting them into the corresponding self-balancing binary search trees or skip lists takesO(m log n)time. In order to delete a data point, we simply removeits projections from each of the binary search trees or skiplists, which takes O(m log n) time.

Theorem 6. The algorithm requires O(mn) space in ad-dition to the space used to store the data.

Proof. The only additional information that needs to bestored are the mL binary search trees or skip lists. Since

Page 14: Fast k-Nearest Neighbour Search via Prioritized DCI

Fast k-Nearest Neighbour Search via Prioritized DCI

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Mem

ory

Usa

ge

1e8

LSHPQDCI (m=25, L=2)Prioritized DCI (m=25, L=2)Prioritized DCI (m=10, L=2)

(a)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Mem

ory

Usa

ge

1e8

LSHPQDCI (m=15, L=3)Prioritized DCI (m=15, L=3)Prioritized DCI (m=10, L=2)

(b)

Figure 3. Memory usage of different algorithms on (a) CIFAR-100 and (b) MNIST. Lower values are better.

n entries are stored in each binary search tree/skip list, thetotal additional space required is O(mn).

8. ExperimentsFigure 3 shows the memory usage of different algorithmson CIFAR-100 and MNIST.