LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… · · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales

1

9/2/14

The 3rd China-Australia Workshop

Outline

  NN and c-ANN   Existing LSH methods for Large Data   Our approach: SRS [VLDB 2015]

  Conclusions

2

9/2/14 The 3rd China-Australia Workshop

NN and c-ANN Queries

  A set of points, D = ∪i=1n {Oi}, in d-dimensional Euclidean

space (d is large, e.g., hundreds)   Given a query point q, find the closest point, O*, in D   Relaxed version:

  Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q)   May return a c-ANN point with at least constant probability

3

NN = a

c-ANN = x r

c!r

q

a

b

d-dimensional space

x D = {a, b, x}

aka. (1+ε)-ANN


Motivations

  Theoretical interest   Fundamental geometric problem: “post-office problem”   Known to be hard due to high-dimensionality

  Many applications  Feature vectors: Data Mining, Multimedia DB  Applications based on similarity queries  E.g.,

 Quantization in coding/compression  Recommendation systems  Bioinformatics/Chemoinformatics  …

4


Applications

  SIFT feature matching   Image retrieval, understanding, scene-rendering   Video retrieval

5


Challenges

  Curse of Dimensionality / Concentration of Measure  Hard to find algorithms sub-linear in n and polynomial

in d  Approximate version (c-ANN) is not much easier  More details next slide

  Large data size  1KB for a single point with 256 dims 1B pts = 1TB

  image-net (Stanford & Princeton): 14 million images, ~100 feature vectors per image 1.4B feature vectors

 High-dimensionality   finding similar words in D=198 million tweets [JMLR 2013]

6


Existing Solutions

  NN:  O(d5log(n)) query time, O(n2d+δ) space  O(dn1-ε(d)) query time, O(dn) space   Linear scan: O(dn/B) I/Os, O(dn) space

  (1+ε)-ANN  O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space   Probabilistic test remove exponential dependency on d

  Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2)) space

  LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space  ρ= 1/(1+ε) + oc(1)

7

LSH is the best approach using sub-quadratic space

Linear scan is (practically) the best approach using linear space & time


Approximate NN for Multimedia Retrieval

8

  Cover-tree   Spill-tree   Reduce to NN search with Hamming distance   Dimensionality reduction (e.g., PCA)   Quantization-based approaches (e.g., CK-Means)   …


They are heuristic methods (i.e., no quality guarantee) or rely on some “hidden” characteristics of the data

Locality Sensitive Hashing (LSH)

  Equality search   Index: store o into bucket h(o)  Query: retrieve every o in the bucket

h(q), verify if o = q   LSH

 ∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o)   h :: Rd Zı  technically, dependent on r

  “Near-by” points (blue) have more chance of colliding with q than “far-away” points (red)

9

r

c!r

q

a

b

d-dimensional space

LSH is the best approach using sub-quadratic space


LSH: Indexing & Query Processing

  Indexing   For a fixed r

  sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩   store o into bucket sig(o)

  Iteratively increase r   Query Processing

  Search with a fixed r   Retrieve and “verify” points in

the bucket sig(q)   Repeat this L times (boosting)

 Galloping search to find the first good r

10

r

c!r

q

a

b

d-dimensional space

Control false positive size

Const Succ. Prob.

Incurs additional cost + only c2 quality guarantee


(r, c*r)-NN problem

Locality Sensitive Hashing (LSH)

  Standard LSH  c2-ANN many R(ci, ci+1)-NN problems

  Superior empirical query performance + theoretical guarantees w/o assumptions on data distributions

  Main problem:  Huge super-linear index size: Õ(log(distNN)⋅n1+ρ+o(1))  Typically hundreds of times larger than the data size

11


Dealing with Large Data /1

  Approach 1: A single hash table but probing more buckets  Multi-probe [VLDB 07]: Princeton

 Probe ±1, ±2, … buckets from sig(q)  Heuristic method; sensitive to parameter settings

 Entropy LSH [SODA 06]: Stanford  Create multiple perturbed queries qi from q  Probe buckets of sig(qi)  O(n2/c) perturbed queries with O(1) hash tables

12



  Approach 2: Distributed LSH  Simple distributed LSH:

 Use (key, value) store to implement (distributed) hash tables  Key machine ID  Network traffic O(nΘ(1))

 Layered LSH [CIKM 12]: Stanford  Use (h(key), key + value), where h() is a LSH  Network traffic O(log(n)0.5) w.h.p.

 13 nodes for 64-dim color histograms of 80 million Tiny Image

13



  Approach 3: Parallel LSH  PLSH [VLDB13]: MIT + Intel  Partition data to each node; broadcast the query and

each node process data locally  Meticulous hardware-oriented optimizations

 Fit the indexes into main memory at each node  SIMD, Cache-friendly bitmaps, 2-level hashing for TLB

optimization, other optimizations for streaming data  Build predicative performance models for parameter

tuning  100 servers (64 GB memory) for 1B steaming tweets

and queries

14



  Approach 4: LSH on external memory with index size reduction  LSB-forest [SIGMOD’09, TODS’10]:

 A different reduction from c2-ANN to a R(ci, ci+1)-NN problem

 C2LSH [SIGMOD’12]:  Do not use composite hash keys  Perform fine-granular counting number

of collisions in m LSH projections

15

O((dn/B)0.5) query, O((dn/B)1.5) space

O(n*log(n)/B) query, O(n*log(n)/B) space

O(n/B) query, O(n/B) space

SRS (Ours)


Small constants in O( )

Weakness of Ext-Memory LSH Methods

  Existing methods uses super-linear space  Thousands (or more) of hash tables needed if rigorous  People resorts to hashing into binary code (and using

Hamming distance) for multimedia retrieval

  Can only handle c, where c = x2, for integer x ≥ 2  To enable reusing the hash table (merging buckets)

  Valuable information lost (due to quantization)   Update? (changes to n, and c)

16

Dataset Size LSB-forest C2LSH+ SRS (Ours)

Audio, 40MB 1500 MB 127 MB 2 MB 9/2/14 The 3rd China-Australia Workshop

SRS: Our Proposed Method

  Solving c-ANN queries with O(n) query time and O(n) space with constant probability  Constants hidden in O() is very small   Early-termination condition is provably effective

  Advantages:   Small index   Rich-functionality   Simple

  Central idea:   c-ANN query in d dims kNN query in m-dims with

filtering  Model the distribution of m “stable random projections”

17


2-stable Random Projection 18

  Let D be the 2-stable random projection = standard Normal distribution N(0, 1)

  For two i.i.d. random variables A ~ D, B ~ D, then x*A + y*B ~ (x2+y2)1/2 * Dı

  Illustration

V r1

V.r1 ~ N(0, ǁvǁ)

r2

V


Dist(O) and ProjDist(O) and Their Relationship

19

  z1≔⟨V, r1⟩ ~ N(0, ǁvǁ)   z2≔⟨V, r2⟩ ~ N(0, ǁvǁ)   z1

2+z22 ~ ǁvǁ2 *χ2

m   i.e., scaled Chi-squared distribution of m degrees of freedom  Ψm(x): cdf of the standardχ2

m distribution

O in d dims

(z1, … zm) in m dims

m 2-stable random projections

r1

⟨V.r1⟩ ~ N(0, ǁvǁ) Q

O

r2

Q

O

Proj(O)

Proj(Q)

ProjDist(O)

Dist(O)

ProjDist(O)


Think of zi as “samples”

LSH-like Property 20

  Intuitive idea:   If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with

high probability   But the inverse is NOT true

  NN object in the projected space is most likely not the NN object in the original space with few projections, as

  Many far-away objects projected before the NN/cNN objects   But we can bound the expected number of such cases! (say T)

  Solution   Perform incremental k-NN search on the projected space till

accessing T objects  + Early termination test


Indexing 21

  Finding the minimum m   Input

  n, c,   T ≔ max # of points to access by the algorithm

  Output   m : # of 2-stable random projections   T’ ≤ T: a better bound on T   m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space

index

  Generate m 2-stable random projections n projected points in a m-dimensional space

  Index these projections using any index that supports incremental kNN search, e.g., R-tree

  Space cost: O(m * n) = O(n)


SRS-αβ(T, c, pτ) 22

  Compute proj(Q)   Do incremental kNN search from proj(Q) for k = 1 to T

 Compute Dist(Ok)  Maintain Omin = argmin1≤i≤k Dist(Oi)   If early-termination test (c, pτ) = TRUE

 BREAK

  Return Omin

// stopping condition α

// stopping condition β

c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18

Index = 0.0059n, Query = 0.0084n, succ prob = 0.13

Early-termination test:

€

ψmc 2 *ProjDist2(ok )Dist2(omin )

> pτ


Main Theorem: SRS-αβ returns a c-NN point with probability pτ-f(m,c) with O(n) I/O cost

Variations of SRS-αβ(T, c, pτ) 23

  Compute proj(Q)   Do incremental kNN search from proj(Q) for k = 1 to T

  Compute Dist(Ok)   Maintain Omin = argmin1≤i≤k Dist(Oi)   If early-termination test (c, pτ) = TRUE

  BREAK

  Return Omin

// stopping condition α

// stopping condition β

1.  SRS-α 2.  SRS-β

3.  SRS-αβ(T, c’, pτ)

Better quality; query cost is O(T)

Best quality; query cost bounded by O(n); handles c = 1

Better quality; query cost bounded by O(T)

9/2/14 The 3rd China-Australia Workshop All with success probability at least pτ

Other Results 24

  Uniformly optimal method for L2

  Can be easily extended to support top-k c-ANN queries (k > 1)

  No previous known guarantee on the correctness of returned results

  We guarantee the correctness with probability at least pτ, if SRS-αβ stops due to early-termination condition  ≈100% in practice (97% in theory)


Experiment Setup 25

  Algorithms  LSB-forest [SIGMOD’09, TODS’10]  C2LSH [SIGMOD’12]  SRS-* [VLDB’15]

  Data   Measures

  Index size, query cost, result quality, success probability


Datasets 26

12.5PB 820GB

37GB


122GB

Tiny Image Dataset (8M pts, 384 dims) 27

  Fastest: SRS-αβ, Slowest: C2LSH   Quality the other

way around   SRS-αhas comparable

quality with C2LSH yet has much lower cost.

  SRS-* dominates LSB-forest


Approximate Nearest Neighbor 28

  Empirically better than the theoretic guarantee

  With 15% I/Os of linear scan, returns NN with probability 71%

  With 62% I/Os of linear scan, returns NN with probability 99.7%


Large Dataset (1 Billion, 128 dims) 29


Works on a PC (4G mem, 500G hard disk)

Theoretical Guarantees Are Important


30

Success probability in a hard setting

≥ 0.78 ≤ 0.29

Summary 31

  Index size is the bottleneck for traditional LSH-based methods   Limits the data size one node can support  Needs many nodes with large memory to scale to billions of

data points   SRS reduces c-ANN queries in arbitrarily high

dimensional space to kNN queries in low dim space  Our index size is approximately d/m of the size of the data

file  Can be easily generalized to an parallel implementation

  Ongoing work   Scaling to even larger dataset using data partitioning  Accommodate rapid updates  Other theoretical work for LSH


Q&A 32

Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html



33

Backup Slides: Analysis 34


Stopping Condition α 35

  “near” point: the NN point its distance ≕ r   “far” points: points whose distance > c * r   Then for any κ > 0 and any o:

  Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2)   Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)

  Both because ProjDist2(o)/Dist2(o) ~ χ2m

  Pr[the NN point projected beforeκ*r] ≥ ψm(κ2)   Pr[# of bad points projected before κ*r < T] > (1 - ψm(κ2/c2)) * (n/T)   Choose κ such that P1 + P2 – 1 > 0

  Feasible due to good concentration bound for χ2m

P1

P2


Choosing κ 36

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

0 20 40 60 80 100

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.1

20

.14

Probability (y!axis) of Seeing an Object at Distance (x!axis)

Distance

Pro

ba

bili

ty

! Object with distance 1Object with distance 4

  Let c = 4   Mode = m – 2   Blue: 4   Red: 4*(c2) = 64

κ*r 9/2/14 The 3rd China-Australia Workshop

distance to !(q) in m-dims"!r0

less than n "far" points

distance to q in d-dimsc!r0 r

NN point

less than T "far" points

ProjDist(OT): Case I 37

ProjDist(OT)

ProjDist(o) in m-dims

Omin = the NN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability


distance to !(q) in m-dims"!r0

less than n "far" points

distance to q in d-dimsc!r0 r

NN point

less than T "far" points

ProjDist(OT): Case II 38

ProjDist(OT)

ProjDist(o) in m-dims

Omin = a cNN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability


Early-termination Condition (β) 39

  Omit the proof here  Also relies on the fact that the squared sum of m

projected distances follows a scaled χ2m distribution

  Key to  Handle the case where c = 1

 Returns the NN point with guaranteed probability   Impossible to handle by LSH-based methods

 Guarantees the correctness of top-k cANN points returned when stopped by this condition  No such guarantee by any previous method


Documents

LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… · · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1