30
iDistance iDistance -- Indexing the -- Indexing the Distance Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance: an efficient method to KNN processing, VLDB 2001.

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing

  • Upload
    platt

  • View
    68

  • Download
    0

Embed Size (px)

DESCRIPTION

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance: an efficient method to KNN processing, VLDB 2001. Query Requirement. Similarity queries: Similarity range and KNN queries - PowerPoint PPT Presentation

Citation preview

Page 1: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

iDistanceiDistance-- Indexing the Distance-- Indexing the Distance

An Efficient Approach to KNN Indexing

C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish.Indexing the distance: an efficient method to KNN processing, VLDB 2001.

Page 2: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

• Similarity queries: Similarity range and KNN queries

• Similarity range query: Given a query point, find all data points within a given distance r to the query point.

•KNN query: Given a query point, find the K nearest neighbours, in distance to the point.

r

Kth NN

       

Query RequirementQuery Requirement

Page 3: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

• SS-tree : R-tree based index structure; use bounding spheres in internal nodes

• Metric-tree : R-tree based, but use metric distance and bounding spheres

• VA-file : use compression via bit strings for sequential filtering of unwanted data points

• Psphere-tree : Two level index structure; use clusters and duplicates data based on sample queries; It is for approximate KNN

• A-tree: R-tree based, but use relative bounding boxes

• Problems: hard to integrate into existing DBMSs

       

Other MethodsOther Methods

Page 4: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

Basic DefinitionBasic Definition• Euclidean distance:

• Relationship between data points:

• Theorem 1: Let q be the query object, and Oi be the reference point for partition i, and p an arbitrary point in partition i. If dist(p, q) <= querydist(q) holds, then it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p) <=dist(Oi,q) + querydist(q).

       

Page 5: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

Basic Concept of iDistanceBasic Concept of iDistance• Indexing points based on similarity y = i * c + dist (Si, p)

       

S1 S2 S3 Sk Sk+1

Reference/anchor points

S1

S2

S3

. . .

d

S1+d c

Page 6: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

• Data points are partitioned into clusters/ partitions.

• For each partition, there is a Reference Point that every data point in the partition makes reference to.

• Data points are indexed based on similarity (metric distance) to such a point using a CLASSICAL B+-tree

• Iterative range queries are used in KNN searching.

       

iDistanceiDistance

Page 7: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

•Searching region is enlarged till getting K NN.

A range in B+-tree

KNN SearchingKNN Searching

...

S1

S2

...

Page 8: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

dist (S1, q)

S2S1

Increasing search radius : r

Dis_min(S1)

Dis_max(S1)

q

S1 S20

dist(S2, q)

Dis_max(S2)

Dis_min(S2)

Dis_min(S1) Dis_max(S1) Dis_max(S2)

r

dist (S1,q) dist (S2,q)

KNN SearchingKNN Searching

Page 9: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

KNN SearchingKNN Searching

       

Q2

Page 10: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

dist (S, q)

Inefficient situation:

•When K= 3, query sphere with radius r will retrieve the 3 NNs.

•Among them only the o1 NN can be guaranteed. Hence the search continues with enlarged r till r > dist(q, o3)

S qr

o2

o1

o3

Over Search?

Page 11: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

Stopping CriterionStopping Criterion• Theorem 2: The KNN search algorithm

terminates when K NNs are found and the answers are correct.

Case 1: dist(furthest(KNN’), q) < r

Case 2: dist(furthest(KNN’), q) > r

        r

Kth ? In case 2

Page 12: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

(centroid of hyperplane, closest distance) (external point, closest

distance)

Space-based Partitioning: Space-based Partitioning: Equal-partitioningEqual-partitioning

Page 13: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

(centroid of hyper-plane, furthest distance)

Space-based Partitioning:Space-based Partitioning:Equal-partitioning from furthest pointsEqual-partitioning from furthest points

(external point, furthest distance)

Page 14: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• Using external point to reduce searching area

Effect of Reference Points on Effect of Reference Points on Query SpaceQuery Space

Page 15: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• Using (centroid, furthest distance) can greatly reduce search area

The area bounded by these arches is the affected searching area.

Effect on Query SpaceEffect on Query Space

Page 16: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

        0.67 1.0

0.31

0.20

0.70

0

1.0

Using cluster centroids as reference points

Data-based Partitioning IData-based Partitioning I

Page 17: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

        0.67 1.0

0.31

0.20

0.70

0

1.0

Using edge points as reference points

Data-based Partitioning IIData-based Partitioning II

Page 18: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• 100K uniform data set• Using (external point, furthest distance)• Effect of search radius on query accuracy

Dimension = 8 Dimension = 16

Dimension = 30

Performance Study:Performance Study: Effect of Search RadiusEffect of Search Radius

Page 19: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• 10-NN queries on 100K uniform data sets • Using (external point, furthest distance)• Effect of search radius on query cost

I/O Cost vs Search RadiusI/O Cost vs Search Radius

Page 20: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

•10-NN queries on 100K 30-d uniform data set •Different Reference Points

Effect of Reference PointsEffect of Reference Points

Page 21: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• KNN queries on 100K 30-d clustered data set • Effect of query radius on query accuracy for different partition number

Effect of Clustered # of Partitions Effect of Clustered # of Partitions on Accuracyon Accuracy

Page 22: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• 10-NN queries on 100K 30-d clustered data set • Effect of # of partitions on I/O and CPU Costs

Effect of # of Partitions Effect of # of Partitions on I/O and CPU Coston I/O and CPU Cost

Page 23: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• KNN queries on 100K, 500K 30-d clustered data sets • Effect of query radius on query accuracy for different size of data sets

Effect of Data SizesEffect of Data Sizes

Page 24: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• 10-KNN query on 100K,500K 30-d clustered data sets • Effect of query radius on query cost for different size of data set

Effect of Clustered Data SetsEffect of Clustered Data Sets

Page 25: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• 10-KNN query on 100K 30-d clustered data set • Effect of Reference Points: Cluster Edge vs Cluster Centroid

Effect of Reference Points on Effect of Reference Points on Clustered Data SetsClustered Data Sets

Page 26: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• 10-KNN query on 100K,500K 30-d clustered data sets • Query cost for variant query accuracy on different size of data set

iDistance ideal for Approximate iDistance ideal for Approximate KNN?KNN?

Page 27: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

• 10-KNN query on 100K 30-d clustered data sets • C. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using

B+-trees.

Performance Study -- Performance Study -- Compare iMinMax and iDistanceCompare iMinMax and iDistance

Page 28: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

iDistance vs A-tree

Page 29: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

       

iDistance vs A-tree

Page 30: iDistance -- Indexing the Distance An Efficient Approach to  KNN Indexing

Summary of iDistanceSummary of iDistance• iDistance is simple, but efficient• It is a Metric based Index• The index can be integrated to existing

systems easily.