1
N EIGHBOURHOOD P RESERVING Q UANTISATION FOR LSH S EAN M ORAN ,V ICTOR L AVRENKO ,M ILES O SBORNE SEAN . MORAN @ ED . AC . UK R ESEARCH Q UESTION LSH uses 1 bit per hyperplane. Can we do better with multiple bits? I NTRODUCTION Problem: Fast Nearest Neighbour (NN) search in large datasets. Hashing-based approach: Generate a similarity preserving binary code (fingerprint). Use fingerprint as index into the buckets of a hash table. If collision occurs only compare to items in the same bucket. 110101 010111 111101 H H Query Compute Similarity Nearest Neighbours Query Database Hash Table Content Based IR Image: Imense Ltd Image: Doersch et al. Image: Xu et al. Location Recognition Near duplicate detection 010101 111101 ..... Advantages: Constant query time (with respect to the database size). Compact binary codes are extremely storage efficient. L OCALITY S ENSITIVE H ASHING (LSH) ( I NDYK AND M OTWANI , ’98) Randomized algorithm for approximate nearest neighbour (ANN) search using binary codes. Probabilistic guarantee on retrieval accuracy versus search time. LSH for inner product similarity: Divide input space with L random hyperplanes (e.g. L=2): x y n 2 n 1 h 1 h 2 11 01 00 10 – Projection: Take dot product of data-point (x) with normal (n.x): n 2 n 1 – Quantisation: Threshold and apply sign function: 0 1 t n 2 Vanilla LSH: each hyperplane gives 1 bit of the binary code. N EIGHBOURHOOD PRESERVING QUANTISATION (NPQ) Assigns multiple bits per hyperplane using multiple thresholds. F 1 optimisation using pairwise constraints matrix S : if S ij =1 then points x i , x j with projections y i , y j are true nearest neighbours. TP: # S ij =1 pairs in same region. FP: # S ij =0 pairs in same region. FN: # S ij =1 pairs in different regions. Combine TP, FP, FN using F 1 : F1 score: 1.00 00 01 10 11 t 1 t 2 t 3 n 2 F1 score: 0.44 00 01 10 11 t 1 t 2 t 3 n 2 Interpolate F 1 with a regularisation term Ω(T 1:u ): Z npq = αF 1 + (1 - α)(1 - Ω(T 1:u )) with: Ω(T 1:u )= 1 σ u X a=0 X i:y i r a {y i - μ r a } 2 where: σ = n X i=1 {y i - μ d } 2 d is dimension mean, μ r a is mean of region r a . Random restarts used to optimise Z npq . Time complexity O (N 2 ), where N is # data points in training dataset. E VALUATION PROTOCOL Task: Image retrieval on three image datasets: 22k LabelMe, CIFAR- 10 and 100k TinyImages. Images encoded with GIST descriptors. Projections: LSH, Shift-invariant kernel hashing (SIKH), Itera- tive Quantisation (ITQ), Spectral Hashing (SH) and PCA-Hashing (PCAH). Baselines: Single Bit Quantisation (SBQ), Manhattan Hashing (MQ) (Kong et al., ’12), Double-Bit quantisation (DBQ) (Kong and Li, ’12). Hamming Ranking: how well do we retrieve -NN of queries? Quantify using area under the precision-recall curve (AUPRC). R ESULTS AUPRC across different projection methods at 32 bits: Dataset LabelMe CIFAR TinyImages SBQ MQ DBQ NPQ SBQ MQ DBQ NPQ SBQ MQ DBQ NPQ ITQ 0.277 0.354 0.308 0.408 0.272 0.235 0.222 0.407 0.494 0.428 0.410 0.660 SIKH 0.049 0.072 0.077 0.107 0.042 0.063 0.047 0.090 0.135 0.221 0.182 0.365 LSH 0.156 0.138 0.123 0.184 0.119 0.093 0.066 0.153 0.361 0.340 0.285 0.464 SH 0.080 0.221 0.182 0.250 0.051 0.135 0.111 0.167 0.117 0.237 0.136 0.356 PCAH 0.050 0.191 0.156 0.220 0.036 0.137 0.107 0.153 0.046 0.257 0.295 0.312 NPQ can quantise a wide range of projection functions. NPQ + cheap projection (e.g. LSH) can outperform SBQ + expensive projection (e.g. PCA). NPQ is faster for N< data dimensionality. AUPRC vs. Number of bits for LabelMe, CIFAR and TinyImages: (a) LabelMe (b) CIFAR-10 (c) TinyImages NPQ is an effective quantisation strategy across a wide bit range. F UTURE W ORK Variable bits per hyperplane: refer to our recent ACL’13 paper. Evaluation of NPQ in a hash lookup based retrieval scenario.

Neighbourhood Preserving Quantisation for LSH SIGIR Poster

Embed Size (px)

Citation preview

Page 1: Neighbourhood Preserving Quantisation for LSH SIGIR Poster

NEIGHBOURHOOD PRESERVINGQUANTISATION FOR LSH

SEAN MORAN†, VICTOR LAVRENKO, MILES OSBORNE† [email protected]

RESEARCH QUESTION

• LSH uses 1 bit per hyperplane. Can we do better with multiple bits?

INTRODUCTION

• Problem: Fast Nearest Neighbour (NN) search in large datasets.• Hashing-based approach:

– Generate a similarity preserving binary code (fingerprint).– Use fingerprint as index into the buckets of a hash table.– If collision occurs only compare to items in the same bucket.

110101

010111

111101

H

H

Query ComputeSimilarity

NearestNeighbours

Query

Database

Hash Table

Content Based IR

Image: Imense Ltd

Image: Doersch et al.

Image: Xu et al.

Location Recognition

Near duplicate detection

010101

111101

.....

• Advantages:– Constant query time (with respect to the database size).– Compact binary codes are extremely storage efficient.

LOCALITY SENSITIVE HASHING (LSH) (INDYK AND MOTWANI, ’98)

• Randomized algorithm for approximate nearest neighbour (ANN)search using binary codes.

• Probabilistic guarantee on retrieval accuracy versus search time.• LSH for inner product similarity:

– Divide input space with L random hyperplanes (e.g. L=2):

x

y

n2

n1

h1

h2

11

0100

10

– Projection: Take dot product of data-point (x) with normal (n.x):

n2

n1

– Quantisation: Threshold and apply sign function:

0 1

t

n2

• Vanilla LSH: each hyperplane gives 1 bit of the binary code.

NEIGHBOURHOOD PRESERVING QUANTISATION (NPQ)

• Assigns multiple bits per hyperplane using multiple thresholds.• F1 optimisation using pairwise constraints matrix S: if Sij = 1 then

points xi, xj with projections yi, yj are true nearest neighbours.• TP: # Sij = 1 pairs in same region. FP: # Sij = 0 pairs in same

region. FN: # Sij = 1 pairs in different regions. Combine TP, FP, FNusing F1:

F1 score: 1.00

00 01 10 11

t1 t2 t3

n2

F1 score: 0.44

00 01 10 11

t1 t2 t3

n2

• Interpolate F1 with a regularisation term Ω(T1:u):

Znpq = αF1 + (1− α)(1− Ω(T1:u)) with: Ω(T1:u) =1

σ

uXa=0

Xi:yi∈ra

yi − µra2

where: σ =

nXi=1

yi − µd2 , µd is dimension mean, µra is mean of region ra

.• Random restarts used to optimise Znpq . Time complexity ∼ O(N2),where N is # data points in training dataset.

EVALUATION PROTOCOL

• Task: Image retrieval on three image datasets: 22k LabelMe, CIFAR-10 and 100k TinyImages. Images encoded with GIST descriptors.• Projections: LSH, Shift-invariant kernel hashing (SIKH), Itera-

tive Quantisation (ITQ), Spectral Hashing (SH) and PCA-Hashing(PCAH).• Baselines: Single Bit Quantisation (SBQ), Manhattan Hashing (MQ)

(Kong et al., ’12), Double-Bit quantisation (DBQ) (Kong and Li, ’12).• Hamming Ranking: how well do we retrieve ε−NN of queries?

Quantify using area under the precision-recall curve (AUPRC).

RESULTS

• AUPRC across different projection methods at 32 bits:Dataset LabelMe CIFAR TinyImages

SBQ MQ DBQ NPQ SBQ MQ DBQ NPQ SBQ MQ DBQ NPQITQ 0.277 0.354 0.308 0.408 0.272 0.235 0.222 0.407 0.494 0.428 0.410 0.660SIKH 0.049 0.072 0.077 0.107 0.042 0.063 0.047 0.090 0.135 0.221 0.182 0.365LSH 0.156 0.138 0.123 0.184 0.119 0.093 0.066 0.153 0.361 0.340 0.285 0.464SH 0.080 0.221 0.182 0.250 0.051 0.135 0.111 0.167 0.117 0.237 0.136 0.356PCAH 0.050 0.191 0.156 0.220 0.036 0.137 0.107 0.153 0.046 0.257 0.295 0.312

• NPQ can quantise a wide range of projection functions.• NPQ + cheap projection (e.g. LSH) can outperform SBQ + expensive

projection (e.g. PCA). NPQ is faster for N < data dimensionality.• AUPRC vs. Number of bits for LabelMe, CIFAR and TinyImages:

(a) LabelMe (b) CIFAR-10 (c) TinyImages

• NPQ is an effective quantisation strategy across a wide bit range.

FUTURE WORK

• Variable bits per hyperplane: refer to our recent ACL’13 paper.• Evaluation of NPQ in a hash lookup based retrieval scenario.