Upload
oke
View
39
Download
0
Embed Size (px)
DESCRIPTION
Similarity Search in High Dimensions via Hashing. Aristides Gionis, Piotr Indyk, Rajeev Motwani Presented by: Fatih Uzun. Outline. Introduction Problem Description Key Idea Experiments and Results Conclusions. Introduction. Similarity Search over High-Dimensional Data - PowerPoint PPT Presentation
Citation preview
Similarity Search in High Dimensions via Hashing
Aristides Gionis, Piotr Indyk, Rajeev Motwani
Presented by:
Fatih Uzun
Outline
• Introduction
• Problem Description
• Key Idea
• Experiments and Results
• Conclusions
Introduction
• Similarity Search over High-Dimensional Data– Image databases, document collections etc
• Curse of Dimensionality– All space partitioning techniques degrade to linear
search for high dimensions
• Exact vs. Approximate Answer– Approximate might be good-enough and much-faster
– Time-quality trade-off
Problem Description
- Nearest Neighbor Search ( - NNS)– Given a set P of points in a normed space , preprocess P
so as to efficiently return a point p P for any given query point q, such that
• dist(q,p) (1 + ) min r P dist(q,r)
• Generalizes to K- nearest neighbor search ( K >1)
Problem Description
Key Idea
• Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data
• Preprocessing : – Hash the data-point using several LSH
functions so that probability of collision is higher for closer objects
Algorithm : Preprocessing
• Input – Set of N points { p1 , …….. pn }– L ( number of hash tables )
• Output– Hash tables Ti , i = 1 , 2, …. L
• Foreach i = 1 , 2, …. L– Initialize Ti with a random hash function gi(.)
• Foreach i = 1 , 2, …. LForeach j = 1 , 2, …. N
Store point pj on bucket gi(pj) of hash table Ti
LSH - Algorithm
g1(pi) g2(pi) gL(pi)
TLT2T1
pi
P
Algorithm : - NNS Query
• Input – Query point q
– K ( number of approx. nearest neighbors )
• Access – Hash tables Ti , i = 1 , 2, …. L
• Output– Set S of K ( or less ) approx. nearest neighbors
• S
Foreach i = 1 , 2, …. L
– S S { points found in gi(q) bucket of hash table Ti }
• Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} – dist(p,q) < r1 ProbH [h(q) = h(p)] p1
– dist(p,q) r2 ProbH [h(q) = h(p)] p2 – p1 > p2 and r1 < r2
• LSH functions: gi(.) = { h1(.) …hk(.) } • For a proper choice of k and l, a simpler problem, (r,)-
Neighbor, and hence the actual problem can be solved
• Query Time : O(d n[1/(1+)] )– d : dimensions , n : data size
LSH - Analysis
Experiments• Data Sets
– Color images from COREL Draw library (20,000 points,dimensions up to 64)
– Texture information of aerial photographs (270,000 points, dimensions 60)
• Evaluation– Speed, Miss Ratio, Error (%) for various data sizes,
dimensions, and K values
– Compare Performance with SR-Tree ( Spatial Data Structure )
Performance Measures
• Speed– Number of disk block accesses in order to answer the
query ( # hash tables)
• Miss Ratio– Fraction of cases when less than K points are found for
K-NNS
• Error– Average of fractional error in distance to point found
by LSH as compared to nearest neighbor distance taken over entire set of queries
Speed vs. Data SizeApproximate 1 - NNS
0
2
4
6
8
10
12
14
16
18
20
0 5000 10000 15000 20000
Number of Database Points
Dis
k A
cc
es
se
s LSH, error = 0.2
LSH, error = 0.1
LSH, error = 0.05
LSH, error =0.02
SR-Tree
Speed vs. DimensionApproximate 1-NNS
0
2
4
6
8
10
12
14
16
18
20
0 20 40 60 80
Dimensions
Dis
k A
cces
ses LSH , Error = 0.2
LSH, Error = 0.1
LSH, Error = 0.05
LSH, Error = 0.02
SR- Tree
Speed vs. Nearest NeighborsApproximate K-NNS
0
2
4
6
8
10
12
14
16
0 20 40 60 80 100 120
Number of Nearest Neighbors
Dis
k A
cc
es
se
s
LSH, Error 0.2
LSH, Error 0.1
LSH, Error 0.05
Speed vs. Error
0
50
100
150
200
250
300
350
400
450
10 20 30 40 50
Error ( % )
Dis
k A
cces
ses
SR-Tree
LSH
Miss Ratio vs. Data SizeApproximate 1 -NNS
0
0.05
0.1
0.15
0.2
0.25
0 5000 10000 15000 20000
Number of Database Points
Mis
s R
atio
Error = 0.1
Error = 0.05
Conclusion
Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data size ( Sub-linear dependence )
Predictable running timeExtra storage over-head Inefficient for data with distances concentrated around average
Future Work
• Investigate Hybrid-Data Structures obtained by merging tree and hash based structures.
• Make use of the structure of the data-set to systematically obtain LSH functions
• Explore other applications of LSH-type techniques to data mining
Questions ?