41
Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Sean Moran Victor Lavrenko, Miles Osborne School of Informatics University of Edinburgh ACL Sofia, August 2013 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 1/41

ACL Variable Bit Quantisation Talk

Embed Size (px)

DESCRIPTION

ACL Variable Bit Quantisation Talk

Citation preview

Page 1: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Quantisation for LSH

Sean MoranVictor Lavrenko, Miles Osborne

School of InformaticsUniversity of Edinburgh

ACL Sofia, August 2013

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 1/41

Page 2: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

Locality Sensitive Hashing (LSH)

Variable Bit Quantisation

Evaluation

Summary

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 2/41

Page 3: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

I Problem: retrieve nearest neighbour(s) to a given query item

q ?

y

x

1-NN

1-NN

q

x

y

I Naıve approach: compare query to all N database items

I Scales linearly O(N) with the size of the databaseI Impractical for all but the smallest of databases

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 3/41

Page 4: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 4/41

Page 5: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 5/41

Page 6: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 6/41

Page 7: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Hashing-based search using binary codes

Database

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 7/41

Page 8: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Hashing-based search using binary codes

110101

010111

H Database

Hash Table

010101

111101

.....

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 8/41

Page 9: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Hashing-based search using binary codes

110101

010111

H

H

Query

Database

Hash Table

010101

111101

.....

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 9/41

Page 10: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Hashing-based search using binary codes

110101

010111

H

H

Query

Database

Hash Table

010101

111101

.....

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 10/41

Page 11: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Hashing-based search using binary codes

110101

010111

H

H

Query ComputeSimilarity

NearestNeighbours

Query

Database

Hash Table

010101

111101

.....

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 11/41

Page 12: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Why transform our data to binary codes?

I Constant O(1) query time with respect to the dataset size.

I Binary code comparison requires few machine instructions(XOR followed by popcount).

I Binary codes are extremely compact e.g. 16Mb to store 1million, 128 bit encoded data points.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 12/41

Page 13: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Applications

Recommendation

Near duplicate detection

Content based IR

Recommendation

Near Duplicate Detection

Noun Clustering

Location Recognition

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 13/41

Page 14: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

Locality Sensitive Hashing (LSH)

Variable Bit Quantisation

Evaluation

Summary

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 14/41

Page 15: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Computing similarity preserving binary codes

I Locality Sensitive Hashing (LSH):

I Randomised algorithm for approximate nearest neighboursearch

I Generates similarity preserving binary codes for a dataset

I Preserved similarity depends on the selected hash family

I Probabilistic guarantee on accuracy versus search time

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 15/41

Page 16: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Locality Sensitive Hashing

x

y

n2

n1

h1

h2

11

0100

10

h1(p) (p)h2

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 16/41

Page 17: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Low Dimensional Projection

n2

n1

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 17/41

Page 18: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Single Bit Quantisation (SBQ)

0 1

t

n2

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 18/41

Page 19: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Plus many more...

I Kernel methods [1]

I Spectral methods [2] [3]

I Neural networks [4]

I Loss based methods [5]

I All use single bit quantisation (SBQ)...

[1] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09.[2] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08.[3] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12.[4] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08.[5] B. Kulis and T. Darrell. Learning to Hash with Binary Reconstructive Embeddings. NIPS ’09.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 19/41

Page 20: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Problem 1: SBQ leads to high quantisation errors

I Highest point density typically occurs around zero:

−1.5 −1 −0.5 0 0.5 1 1.50

1000

2000

3000

4000

5000

6000

Projected Value

Co

un

t

I Closer points can have a greater Hamming distance thandistant points:

0 1

t

n2

Same code Different code

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 20/41

Page 21: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Problem 2: some hyperplanes are better than others

n1

n2Projected Dimension 2

Projected Dimension 1

x

y

n2

n1

h1

h2

11

0100

10

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 21/41

Page 22: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Problem 2: some hyperplanes are better than others

n1

n2Projected Dimension 2

Projected Dimension 1

x

y

n2

n1

h1

h2

11

0100

10

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 22/41

Page 23: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Problem 2: some hyperplanes are better than others

n1

n2Projected Dimension 2

Projected Dimension 1

x

y

n2

n1

h1

h2

11

0100

10

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 23/41

Page 24: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

Locality Sensitive Hashing (LSH)

Variable Bit Quantisation

Evaluation

Summary

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 24/41

Page 25: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Our Solution: Variable Bit Quantisation

ThresholdOptimisation

Variable BitAllocation

Variable Bit Quantisation

Continuousdata

Binarystring

Multiple Bit Encoding

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 25/41

Page 26: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Our Solution: Variable Bit Quantisation

ThresholdOptimisation

Variable BitAllocation

Variable Bit Quantisation

Continuousdata

Binarystring

Multiple Bit Encoding

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 26/41

Page 27: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Multiple Bit Encoding: Natural Binary Code

0 1

t

n2

00 01 10 11

t1 t2 t3

n2

1 bit

2 bits

000 001 010

t 2 t 4 t 6

n23 bits

etc...

t1 t 3 t5 t 7

011 111101 110100

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 27/41

Page 28: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Multiple Bit Encoding [1]

00 01 10 11

t1 t2 t3

n22 bits

Decimal distance = 3-1 = 2

Decimal distance = 2-0 = 2

[1] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale imageretrieval. SIGIR ’12.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 28/41

Page 29: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Our Solution: Variable Bit Quantisation

ThresholdOptimisation

Variable BitAllocation

Variable Bit Quantisation

Continuousdata

Binarystring

Multiple Bit Encoding

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 29/41

Page 30: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Threshold Optimisation

I Multiple bits per hyperplane requires multiple thresholds.

I F1-based optimisation using pairwise constraints matrix S :TP: # Sij = 1 pairs in the same region. FP: # Sij = 0 pairsin the same region. FN: # Sij = 1 pairs in different regions.

F1 score: 1.00

00 01 10 11

t1 t2 t3

n2

F1 score: 0.44

00 01 10 11

t1 t2 t3

n2

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 30/41

Page 31: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Our Solution: Variable Bit Quantisation

ThresholdOptimisation

Variable BitAllocation

Variable Bit Quantisation

Continuousdata

Binarystring

Multiple Bit Encoding

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 31/41

Page 32: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Allocation

I F1 score is a measure of the neighbourhood preserving qualityof a hyperplane:

F1 score: 1.00

F1 score: 0.25

00 01 10 11

t1 t2 t3

0 bits assigned: 0 thresholds do just as well as one or more thresholds

n1

n2

2 bits assigned: 3 thresholds perfectly preserve the neighbourhood structure

I Compute bit allocation that maximises the cumulative F1

score across all hyperplanes subject to a bit budget B.

I Bit allocation solved as a binary integer linear program (BILP).

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 32/41

Page 33: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Allocation

max ‖F ◦ Z‖subject to ‖Zh‖ = 1 h ∈ {1 . . .B}

‖Z ◦D‖ ≤ B

Z is binary

I F contains the F1 scores per hyperplane, per bit count

I Z is an indicator matrix specifying the bit allocation

I D is a constraint matrix

I B is the bit budget

I ‖.‖ denotes the Frobenius L1 norm

I ◦ the Hadamard product

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 33/41

Page 34: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Allocation

max ‖F ◦ Z‖subject to ‖Zh‖ = 1 h ∈ {1 . . .B}

‖Z ◦D‖ ≤ B

Z is binary

F h1 h2

b0 0.25 0.25b1 0.35 0.50b2 0.40 1.00

D

0 01 12 2

Z

1 00 00 1

I Sparse solution possible: lower quality hyperplanes can be

discarded.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 34/41

Page 35: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

Locality Sensitive Hashing (LSH)

Variable Bit Quantisation

Evaluation

Summary

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 35/41

Page 36: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Evaluation Protocol

I Task: Text and image retrieval on three standard datasets:CIFAR-10, TDT-2 and Reuters-21578.

I Projections: LSH [1], Shift-invariant kernel hashing (SIKH)[2], Binary LSI (BLSI) [3], Spectral Hashing (SH) [4] andPCA-Hashing (PCAH) [5].

I Baselines: Single Bit Quantisation (SBQ), ManhattanHashing (MQ)[6], Double-Bit quantisation (DBQ) [7].

I Hamming Ranking: how well do we retrieve the ε−NN ofquery points?

[1] P. Indyk and R. Motwani. Approximate nearest neighbors: removing the curse of dimensionality. In STOC ’98.[2] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09.[3] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08.[4] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08.[5] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12.[6] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12.[7] W. Kong and W. Li. Double Bit Quantisation for Hashing. AAAI ’12.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 36/41

Page 37: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

AUPRC across different projection methods

Dataset CIFAR-10 (32 bits) Reuters-21578 (128 bits)

SBQ MQ DBQ VBQ SBQ MQ DBQ VBQ

SIKH 0.042 0.046 0.047 0.161 0.102 0.112 0.087 0.389

LSH 0.119 0.091 0.066 0.207 0.276 0.201 0.175 0.538

BLSI 0.038 0.135 0.111 0.231 0.100 0.030 0.030 0.156

SH 0.051 0.144 0.111 0.202 0.033 0.028 0.030 0.154

PCAH 0.036 0.132 0.107 0.219 0.095 0.034 0.027 0.154

I VBQ is an effective quantisation scheme for both image andtext datasets.

I VBQ and a cheap projection (e.g. LSH) can outperform SBQand an expensive projection (e.g. PCA).

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 37/41

Page 38: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

AUPRC for LSH across a broad bit range

32 48 64 96 128 2560

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SBQ MQ VBQ

Number of Bits

AU

PR

C

8 16 24 32 40 48 56 640

0.05

0.1

0.15

0.2

0.25

0.3

0.35

SBQ MQ VBQ

Number of Bits

AU

PR

C

(a) Reuters-21578 (b) CIFAR-10

I VBQ is effective across a wide range of bits.

I See paper for further results.

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 38/41

Page 39: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Variable Bit Quantisation for LSH

Fast search in large-scale datasets

Locality Sensitive Hashing (LSH)

Variable Bit Quantisation

Evaluation

Summary

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 39/41

Page 40: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Summary

I Proposed a general data-driven technique to adaptively assignvariable bits per LSH hyperplane

I Hyperplanes better preserving the neighbourhood structureare afforded more bits from the bit budget

I VBQ substantially increased retrieval performance acrossstandard text and image datasets

I Future work: evaluate with an LSH system that uses hashtables for fast retrieval

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 40/41

Page 41: ACL Variable Bit Quantisation Talk

Variable Bit Quantisation for LSH

Thank you for your attention

Sean Moran

[email protected]

Sean MoranVictor Lavrenko, Miles Osborne

ACL Sofia, August 2013 41/41