High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,

High-Dimensional Similarity Search using Data-Sensitive Space

Partitioning ┼

Sachin Kulkarni1 and Ratko Orlandic2

1 Illinois Institute of Technology, Chicago2 University of Illinois at Springfield

Database and Expert Systems Applications 2006

┼ Work supported by the NSF under grant no. IIS-0312266.

CS695 April 13, 2007 2

Outline

• Problem Definition• Existing Solutions• Our Goal• Design Principle• GardenHD Clustering and Γ Partitioning• System Architecture and Processes• Results• Conclusions

CS695 April 13, 2007 3

Problem Definition

• Consider a database of addresses of clubs

• Typical queries are:– Find all the clubs within 35 miles of 10 West

31st. Street, Chicago.– Find 5 nearest clubs

[d1]

[d2]

CS695 April 13, 2007 4

Problem Definition

• K-Nearest Neighbor (k-NN) Search:– Given a database with N points and a query point q in

some metric space, find k 1 points closest to q. [1]

• Applications: – Computational geometry– Geographic information systems (GIS)– Multimedia databases– Data mining– Etc.

Sachin

It is a fact that the Quality ...., No other measure of quality exists. It is observed that Improving the quality ....., Current search engines indexa sizebale portion of the www and respond on the order of seconds. although there will be more utility in a search tool with a longer response time.

CS695 April 13, 2007 5

Challenge of k-NN Search

• In High-dimensional feature spaces indexing structures face the problem of dead space (KDB-Trees) or overlaps (R-tree).

• Volume and area grows exponentially with respect to number of dimensions.

• Finding k-NN points is costly.

• Traditional access methods are at par with sequential scan – “Curse of dimensionality”

Sachin

Also different types of queries may require different techniques to handle. E.g. consider the types :Specific queries, eg. "Does netscape support the JDK 1.1 code signinng API?" very few pages contain the required info and it is often difficult to determine these pages, a.k.a. scarciry problem. Whereas consider a broad topic querries, e.g. : "Find information about Java programming language" such queries may return millions of pages which might be difficult for a human to digest, a.k.a. abundance problem. To provide effective search methods under these conditions, one needs a way to filter, from among a huge collection of relevant pages, a small set of most authorititative pages. Major obstacle is that how to accurately model aurhority in context of a particular query. Thus overall the problem of search is that only text based retrieval does not provide a good quality and we have the option of spending some more time and retrieving better quality results but we are not sure what to do in the extra time plus how will every type of query be effectively served is also a problem.

CS695 April 13, 2007 6

Existing Solutions

• Approximation and dimensionality reduction.

• Exact Nearest Neighbor Solutions

• Significant effort in finding the exact nearest neighbors has yielded limited success.

• VA-File • A-tree• iDistance

• R-tree• SS-tree • SR-tree

Sachin

Analysing the hyperlink structure among WWW pages gives us a way to adres many of the difficulties discussed above. Hyperlinks encode considerable amount of latent human judgment and we claim that this type of judgment is precisely what is needed to formulate a notion of aurhority. Specificaly the creation of a link represents a concrete indication of the judgement that a page confers authority on the other. i.e. if I have apage p and it provides a link to q then we can say that p confers aurhotiry on q. In such a case we can determine the relevance by matching hte keywords and identify a potential authority by the pages pointing to them. So is this approach sufficient for us to work on?

CS695 April 13, 2007 7

Goal

• Our goal:– Scalability with respect to dimensionality– Acceptable pre-processing (data-loading) time– Ability to work on incremental loads of data.

Sachin

NO ! there are some pitfalls.It is possible that the links I am considering are for navigational purpose within a website or they represent some advertisement. these are all non-aurhoritative links and are not usefull. also some pages can only be popular but may not be relevant at all to the query. It is very difficult to find a balance between relevance and popularity. Good aurhorities may not be popular or vice versa. Thus who links to a page is also important along how many pages link to a page.

CS695 April 13, 2007 8

Our Solution

•Clustering

•Space partitioning

•Indexing

0 1

1

CS695 April 13, 2007 9

Design Principle

• “multi-dimensional data must be grouped on storage in a way that minimizes the extensions of storage clusters along all relevant dimensions and achieves high storage utilization”.

CS695 April 13, 2007 10

What does it Imply?

• Storage organization must maximize the densities of storage clusters

• Reduce their internal empty space

• Improve search performance even before the retrieval process hits persistent storage

• For best results, employ a genuine clustering algorithm

Sachin

Now, if we represent the WWW as a graph with pages as the nodes and links between pages as the edges of the graph,. Suppose we have a set of node or pages W which is a subset of V then a graph induced on W would be consisting all the pages in W and all the links between those pages.For a page the number of incoming links will be the in-degree of the page and the n umber of outgoing links will be the out degree. Thus a subgraph of the WWW would look something like this.

CS695 April 13, 2007 11

Achieving the Principles

• Data space reduction: – Detecting dense areas (dense cells) in the

space with minimum amounts of empty space.

• Data clustering:– Detecting the largest areas with the above

mentioned property, called data clusters.

Sachin

How is this subgraph constructed? let say we have a broad topic query sigma, then Qsigma is the set of all pages containing sigma. Thus Qsigma would form a subgraph. But we cannot use this subgraph directly to find authorities in it. Because it may be large...and some or most of the best authrities may not be in this set.. So we need to focus on some subset say Ssigma that has the following three properties....For this lets select "t" highest ranked pages....to form a root ser Rsigma.. This root set sure satisfies properties i and ii but not iii. i.e. it may not still contain all the strongest authorities..

CS695 April 13, 2007 12

GardenHD Clustering

• Motivated by the stated principle.

• Efficiently and effectively separates disjoint areas with points.

• Hybrid of cell- and density-based clustering that operates in two phases.

• Recursive space partition - partitioning.

• Merging of dense cells.

Sachin

So lets bring in some more nodes to the root so that we satisfy the third property. All the nodes that point to and are pointed to by a node in the root are included in the graph with a restriction that not more than d nodes pointing to one node in the Rsigma will be included. Thus the newly formed graph will be called a base set Ssigma and will satisfy all the three properties. Now there are two types of links in this set the traverse .... and the intrinsic.. to eliminate the effect of non-authoritative links, we will eliminate all the intrinsic links.Now we are ready with a base set in which we can compute the authorities and hub.

CS695 April 13, 2007 13

partitioning

• G no. generators• D no. dimensions, • No. regions = 1+(G–1)D

• Space partition is compactly represented by a filter (in memory).

subspace

subspace

region0

region1

region2

region3

Region4

0 1

1

Sachin

Simplest approach is to order the pages by their in-degree. The approach should work much better now with this subgraph as compared to the one produced earlier but still it has some significant problems: e.g. a query :"java"...The basic difficulty is still the inherent tension...One may at this point think that do we require furthur use of textual content of pages in the base set but this is not the case we can still fidn a solution without the text and this is where the HUBS come into picture.. all the Hub pages pull ...rather than the popular pages.. Thus we see a mutually reinforcing relationship between a hub and an authority.

CS695 April 13, 2007 14

Data-Sensitive Gamma Partition

• DSGP :– Data-Sensitive Gamma Partition

1

2

3

4

Effective boundaries

KDB-Trees

Sachin

The author makes use of the relationship between hub and authorities via an iterative algorithm that maintains and updates numerical weights for each page. Thus with each page p is associated two different weights a nonnegative authori....Numerically, it is natural to express the mutually reinforcing relationship between hubs and authorities as follows: if p points to many pages with large x-values, then it should receive a large y-value; and if p is pointed to by many pages with large y-values, then it should receive a large x-value.. This is represented by the two operations on the right here..Thus we see a circularity here between the x and the y values....

CS695 April 13, 2007 15

System Architecture

Data Clustering

“Data-Sensitive” Space Partitioning Incremental

Data Loading

Data Loading

Data Retrieval

Region Search Similarity Search

Sachin

The I and O operations can be repeated for a fixed number of times say k to get some desired equilibrium values. This is how the algorithm would look like. After iterating over the I and O operations for k times we get a set Xk and yk and the c largest coordinates in xk would be the related authorities returned by the algorithm and likewise yk will give the hubs. here xk is a vectore representing the set of weights x<p>..Now we still are not sure what the value of k is going to be and how to decide it beforehand.. tounderstand we will need some mathematical concepts ...

CS695 April 13, 2007 16

Basic Processes

• Each region in space represented by separate KDB-tree– KDB-trees perform implicit slicing

• Initial and incremental loading of data– Dynamic assignment of multi-dimensional data

to index pages

• Retrieval– Region and k-nearest neighbor search– Several stages of refinement

CS695 April 13, 2007 17

Similarity Search - GammaNN

• Nearest neighbor search using GammaNN.

Query Point

Region Representatives

Query Hyper-sphere

Clipped portions to be queried

Sachin

With the little interlude with math we can see how our problem relates to it.We have a graph G with set of node as V and set of edges as E. Let A denote adjacency matrix of graph G such that the i,j th entry will be equal to 1 if pi,pj is an edge else the value in the matrix will be 0. Thus the I and O operations would be represented by ...Based on this the author proposes 2 theorems first one says that the sequence .... and second theorem says... The proof for theses is given in the paper. TI is observed by the authors that the convergence .. and a value of k ...

CS695 April 13, 2007 18

Region Search

12

3

4

CS695 April 13, 2007 19

Experimental Setup• PC with 3.6 GHz CPU, 3GB RAM, and 280GB

disk.

• Page size was 8K bytes.

• Normalized D-dimensional space [0,1]D.

• The GammaNN implementations with and without explicit clustering are referred to here as ‘data aware’ and ‘data blind’ algorithms, respectively.

• Comparison with Sequential Scan and VA-File.

Sachin

Find t page point to p. This gives a root set Rsigma, expand it into base set Ssigma resulting in a subgraph Gp in which we have to searh for hubs and authorities.

CS695 April 13, 2007 20

Datasets

• Data: – Synthetic data

• Up to 100 dimensions, 100,000 points. • Distributed across 11 clusters—one in the center and 10 in

random corners of the space

– Real data• 54-dimensional, 580,900 points, forest cover type

(“covtype”).• Distributed across 11 different classes.• UCI Machine learning repository.

CS695 April 13, 2007 21

Metrics

• Pre-processing time– Time of space partitioning, I/O and the time for

data loading (i.e., the construction of indices plus insertion of data).

– For VA-File, only the time to generate the vector approximation file.

• Performance– Average page access for k-NN queries.– Time to process k-NN queries.

Sachin

Study of social network has developed ways to measure the relative standing roughly importance of indivusals in an implicitly defined network. KAtz introduced notion of standing based on the total number of paths terminating in a node weighted by an exponentially decreasing damping factor. while hubbell defined standing as the quantity of endorsement entering a node weighted by the endorsers. In the study of scientific citations e.g. bibliometrics which is a study of written documents and their ciation structure, they are interested in evaluating the standing in a particular type of social network that of papers or journals linked by citations.Impact factor is the average number of citations received in the previous two years. More suitable citation based measure is of influence weight, which comes from the consideration that the journal that is citing some other journal or paper must also be influential. This is similar to the approach of the author of this paper. The observation of Geller is that the influence weight is the stationary distribution of the following random process where begining with j one chooses a random reference appeared in j and moves to that journal. Thsi is similar to the approach by google.There have been several approaches toward ranking pages in the context of hypertext and the www. Botafogo worked with focused, stand alone hypertext ecvironments and defined the notion of index nodes and reference nodes. index node is one whose out-degree is very larger thatn he aberage out-degree. and reference node are the ones whose in-degree is significantly larger than the average in-degree. also proposed notion of centrality based on node-to-node distances in the graph. Recently Brin and Page came up with the notion of Page Rank which computes ranks al the pages in a hure index of the www. This rank is then used to order results of subsquent text based search. The difference between this and the authors approach is that the aurhor proceeds without directly accessing the www index and thus works in a relatively small graph.Anchor text in which treats the text surrounding a hyperlink as a descriptor of the page being pointed to when accessing relevance of that page. Brin and Page also use this concept in their approach.Co-citation means number of documents cited by both p and q while bibliographic coupling for p and q is the number of documents that cite both p and q. These were used to detect clusters of documents. Some techniques like dimension-reduction have also been proposed for the same.

CS695 April 13, 2007 22

Experimental Results

Data Aware VA-File

Pre-processing time for the three algorithms covtype data, 54 dim , 580900 points

0 20

40

60

80

Data Blind

Tim

e in

sec

onds

120 100

160 140

CS695 April 13, 2007 23

Performance Synthetic Data

Cumulative time for 100 queries10 NN, synthetic data

0

50100

150

200250

300

350

400450

10 20 30 40 50 60 70 80 90 100

Number of dimensions

Tim

e in

sec

onds

Data Blind

Sequential ScanVA-File

Data Aware

Average page accesses for 100 queries10 NN, synthetic data

0

500

1000

1500

2000

2500

3000

10 20 30 40 50 60 70 80 90 100

Number of dimensions

Avg

pa

ge

acc

ess

es

Data Blind

Data Aware

Sequential Scan

VA-File

Sachin

When initial query string is not sufficiently broad, the algorithm may not extract sufficiently dense set of hubs and authorities. As a result, aurhoritative pages correspond to competing, "broader" topics will win out over pages relevant to Sigma. This is termed as diffusion by the author. i.e. the process diffused from the initial query. Although it limits...the authors find a natural generalization of Sigma... i.e. It provides a simple way of abstracting a specific query topic to a broader, related one.e.g...

CS695 April 13, 2007 24

Performance Real Data

Average page accesses for 100 queries 10 NN, real data

0

100020003000400050006000700080009000

S. Scan Data Blind VA-File Data Aware

Avg

pa

ge

acc

ess

es

Cumulative time for 100 queries 10 NN, real data

0 200 400 600 800

1000 1200

S. Scan Data Blind VA-File Data Aware

Tim

e in

se

con

ds

CS695 April 13, 2007 25

Progress with k in k-NN

Progress in time with respect to value of k for k-NN, real data

0200400600800

100012001400

1 10 100Number of nearest neighbors

Tim

e in

sec

onds

Data Blind

Data Aware

Sequential ScanVA-File

Progress of page accesses with respect to k for k-NN, real data

0

2000

4000

6000

8000

10000

1 10 100Number of nearest neighborsA

vg p

ag

e a

cce

sse

s

Data Blind

Data Aware

Sequential Scan

VA-File

CS695 April 13, 2007 26

Incremental Load of Data

Cumulative time vs number of points

02468

1012

200k 300k 400k 500k

Number of points

Tim

e I

n s

eco

nd

s

Data AwareIncremental load

Data Aware full load

Average page accesses vs number of points

0

50

100

150

200k 300k 400k 500k

Number of pointsAvg

pa

ge

acce

sse

s

Data AwareIncremental load

Data Awarefull load

CS695 April 13, 2007 27

Conclusions

• Comparison of the data-sensitive and data-blind approach clearly highlights the importance of clustering data on storage for efficient similarity search.

• Our approach can support exact similarity search while accessing only a small fraction of data.

• The algorithm is very efficient in high dimensionalities and performs better than sequential scan and the VA-File technique.

• The performance remains good even after incremental loads of data without re-clustering.

CS695 April 13, 2007 28

Current and Future Work

• Incorporate R-trees or A-trees in place of KDB-trees.

• Provide facility for handling data with missing values.

CS695 April 13, 2007 29

References

1. Fagin, R., Kumar, R., Shivakumar, D.: Efficient similarity search and classification via rank aggregation, Proc. Proc. ACM SIGMOD Conf., (2003) 301-312

2. Orlandic, R., Lukaszuk, J.: Efficient high-dimensional indexing by superimposing space-partitioning schemes, Proc. 8th International Database Engineering & Applications Symposium IDEAS’04, (2004) 257-264

3. Orlandic, R., Lai, Y., Yee, W.G.: Clustering high-dimensional data using an efficient and effective data space reduction, Proc. ACM Conference on Information and Knowledge Management CIKM’05, (2005) 201-208

4. Jagdish H. V., Ooi B. C., Tan K. L., Yu C., Zhang R., iDistance: An Adaptive B+-Tree Based Indexing Method for Nearest Neighbor Search, ACM Transactions on Database Systems, Vol. 30, No. 2, (2005): 364-395.

5. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity search methods in high-dimensional spaces, Proc. 24th VLDB Conf., (1998) 194-205

6. Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: An index structure for high-dimensional spaces using relative approximation, Proc. 26th VLDB Conf., (2000) 516-526

CS695 April 13, 2007 30

Questions ?

[email protected]

http://cs.iit.edu/~egalite

Documents

High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,