1
NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Chen Li (UC Irvine)
Supported by NSF CAREER No. IIS-0238586
EDBT 2004
2
NN (nearest-neighbor) searchKNN: find the k nearest neighbors of an object.
qNN-join: for each object in the 1st dataset, find
the k nearest neighbors in the 2nd dataset
D1 D2
3
Example: image search
Images represented as features (color histogram, texture moments, etc.)
Similarity search using these features “Find 10 most similar images for the query image”
Other applications: Web-page search: “Find 100 most similar pages for a given
page GIS: “find 5 closest cities of Irvine” Data cleaning
Query image
4
NN Algorithms Distance measurement:
For objects are points, distance well defined Usually Euclidean Other distances possible
For arbitrary-shaped objects, assume we have a distance function between them
Most algorithms assume a high-dimensional tree structure for the datasets (e.g., R-tree).
5
Search process (1-NN for example)
Most algorithms traverse the structure (e.g., R-tree) top down, and follow a branch-and-bound approach
Keep a priority queue of nodes (“MBR”) to be visited Sorted based on the “minimum distance” between q and each no
de Improvement:
Use MINDIST and MINMAXDIST Reduce the queue size Avoid unnecessary disk IO’s to access MBR’s
Priority queue
6
Problem Queue size may be large:
60,000 objects, 32d (image) vectors, 50 NNs Max queue size: 15K entries Avg queue size: half (7.5K entries)
If queue can’t fit in memory, more disk IOs! Problem worse for k-NN joins
E.g., 1500 x 1500 join: Max queue size: 1.7M entries: >= 1GB memory! 750 seconds to run
Couldn’t scale up to 2000 objects! Disk thrashing
7
Our Solution: Nearest-Neighbor Histogram (NNH)
Main idea Utilizing NNH in a search (KNN, join) Construction and incremental
maintenance Experiments Related work
8
p1p2
pm
Distances of its nearest neighbors: r1, r2, …,
NNH: Nearest-Neighbor Histograms
m: # of pivots
They are not part of the database
9
Structure Nearest Neighbor Vectors: Trrpv ,...,)( 1
Nearest Neighbor Histogram Collection of m pivots with their NN vectors
each ri is the distance of p’s i-th NN
T: length of each vector
10
Outline
Main idea Utilizing NNH in a search (KNN, join) Construction and incremental
maintenance Experiments Related work
11
Estimate NN distance for query object
NNH does not give exact NN information for an object But we can estimate an upper bound for the k-NN dista
nce qest of q
mikpHpq iiq 1),,(),(
Triangle inequality : of NN- theof Distance qk
12
Estimate NN for query object(con’t)
Apply the triangle inequality to all pivots Upper bound estimate of NN distance of q
)),(),((min1
kpHpq iimi
estq
Complexity: O(m)
13
Utilizing estimates in NN search More pruning: prune an mbr if:
),( mbrqMINDISTestq
mbrMINDISTq
14
Utilizing estimates in NN join K-NN join: for each object o1 in D1, find
its k-nearest neighbors in D2. Traverse two trees top down; keep a
queue of pairs
15
Utilizing estimates in NN join (cont’t)
Construct NNH for D2. For each object o1 in D1, keep its estimated
NN radius o1est using NNH of D2.
Similar to k-NN query, ignore mbr for o1 if:
),( 11mbroMINDISTest
o
mbrMINDISTo1
16
More powerful: prune MBR pairs
)),(),((min 212
1kpHpmbrMAXDIST ii
Hp
estmbr
i
)),(),(: 2111 1kpHpombro iio
)),(),( 211kpHpmbrMAXDIST iimbr
17
Prune MBR pairs (cont)
),( 211mbrmbrMINDISTest
mbr
mbr1mbr2
MINDIST
Prune this MBR pair if:
18
Outline
Main idea Utilizing NNH in a search (KNN, join) Construction and incremental
maintenance Experiments Related work
19
NNH Construction If we have selected the m pivots:
Just run KNN queries for them to construct NNH
Time is O(m) Offline
Important: selecting pivots Size-Constraint Construction Error-Constraint Construction (see paper)
20
# of pivots “m” determines Storage size Initial construction cost Incremental-maintenance cost
Choose m “best” pivots
Size-constraint NNH construction
21
Size-constraint NNH construction
Given m (# of pivots), assume: query objects are from the database D H(pi,k) doesn’t vary too much
Goal: Find pivots p1, p2, …, pm to minimize object distances to the pivots:
Clustering problem: Many algorithms available Use K-means for its simplicity and efficiency
miDq
ipq,...,1,
),(
mikpHpq iiq 1),,(),(
22
Incremental Maintenance How to update the NNH when inserting or d
eleting objects? Need to “shift” each vector:
Associate a valid length Ei to each NN vector.
23
Outline
Main idea Utilizing NNH in a search (KNN, join) Construction and incremental
maintenance Experiments Related work
24
Experiments
Datasets: Corel image database
Contains 60,000 images Each image represented by a 32-dimensional float vector
Time-series data from AT&T Similar trends. Report results for Corel data set
Test bed: PC: 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000. GNU C++ in CYGWIN
25
Goal
Is the pruning using NNH estimates powerful? KNN queries NN-join queries
Is it “cheap” to have such a structure? Storage Initial construction Incremental maintenance
26
Improvement in k-NN search Ran k-means algorithm to generate
400 pivots for 60K objects, and constructed NNH
Performed K-NN queries on 100 randomly selected query objects.
Queue size to measure memory usage. Max queue size Average queue size
27
Reduced Memory Requirement
28
Reduced running time
29
Effects of different # of pivots
30
Join: Reduced Memory Requirement
31
Join: Reduced running time
32
Join:Running time for different data sizes
33
Cost/Benefit of NNH
Pivot # (m) 10 50 100 150 200 250 300 350 400
Construction time (sec)
0.7 3.59
6.6 9.4 11.5 13.7 15.7 17.8
20.4
Storage space (kB)
2 10 20 30 40 50 60 70 80
Incr mantnce. time (ms)
~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0
Improved q-size(kNN)(%)
40 30 28 24 24 24 23 20 18
Improved q-size(join)(%)
45 34 28 26 26 25 24 24 22
“~0” means almost zero.
For 60,000 float vectors (32-d).
34
Conclusion NNH: efficient, effective approach to
improving NN-search performance. Can be easily embedded into current
implementation of NN algorithms. Can be efficiently constructed and
maintained. Offers substantial performance
advantages.
35
Related work Summary histograms
E.g., [Jagadish et al VLDB98], [Mattias et al VLDB00] Objective: approximate frequency values
NN Search algorithms Many algorithms developed Many of them can benefit from NNH
Algorithms based on “pivots/foci/anchors” E.g., Omni [Filho et al, ICDE01], Vantage objects [Vleugels et al
VIIS99], M-trees [Ciaccia et al VLDB97] Choose pivots far from each other (to represent the “intrinsic
dimensionality”) NNH: pivots depend on how clustered the objects are Experiments show the differences
36
Work conducted in the Flamingo Project on Data Cleansing at UC Irvine