Feature Based Similarity

Preview:

DESCRIPTION

Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data. Feature Based Similarity. Simple Similarity Queries. Specify query object and - PowerPoint PPT Presentation

Citation preview

117

Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel,University of Munich

Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data

217 Feature Based Similarity

317 Simple Similarity Queries

Specify query object and• Find similar objects – range query

• Find the k most similar objects – nearest neighbor q.

417 Join Applications: Catalogue Matching

Catalogue matching• E.g. Astronomic catalogues

R

S

517 Join Applications: Clustering

Clustering (e.g. DBSCAN)

Similarity self-join

617 Grid partitioning

General idea: Grid approximation where grid line distance =

Similar idea in the -kdB-tree[Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]

Disadvantage of any grid approach:Number of neighboring grid cells: 3d 1

717 Scalability of the -kdB-tree

Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are ...

• clustered,

• skewed and

• high-dimensional data

817 Epsilon Grid Order

917 -Grid-Order Is a Total Strict Order

Strict Order:• Irreflexivity

• Transitivity

• Asymmetry

-grid-order can be used in any sorting algorithm

1017 -Interval

Coarse approximation of join mates:Used for I/O processing

1117 I/O Processing for the Self Join

Decompose the sorted file into I/O units

1217 Epsilon Grid Order

1317 CPU Processing

I/O units are further decomposed before joining Simple divide-and-conquer: No further sorting Decomposition: maximize active dimensions

1417 CPU Processing

Point distance computations: Order of dimensions• Neighboring inactive dimensions

• Unspecified dimensions

• Active dimension

• Aligned inactive dimensions

1517 Experimental Results

8-dimensional uniformly distributed vectors

1617 Experimental Results (2)

16-d feature vectors from CAD application

1717 Conclusions

Summary• High potential for performance gains of the similarity

join by page capacity optimization

• Necessary to separately optimize I/O and CPU

Future research potential• Similarity join for metric index structures

• Approximate similarity join

• Parallel similarity join algorithms

Recommended