39
LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1 9/2/14 The 3rd China-Australia Workshop

LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Embed Size (px)

Citation preview

Page 1: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales

1

9/2/14

The 3rd China-Australia Workshop

Page 2: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Outline

  NN and c-ANN   Existing LSH methods for Large Data   Our approach: SRS [VLDB 2015]

  Conclusions

2

9/2/14 The 3rd China-Australia Workshop

Page 3: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

NN and c-ANN Queries

  A set of points, D = ∪i=1n {Oi}, in d-dimensional Euclidean

space (d is large, e.g., hundreds)   Given a query point q, find the closest point, O*, in D   Relaxed version:

  Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q)   May return a c-ANN point with at least constant probability

3

NN = a

c-ANN = x r

c!r

q

a

b

d-dimensional space

x D = {a, b, x}

aka. (1+ε)-ANN

9/2/14 The 3rd China-Australia Workshop

Page 4: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Motivations

  Theoretical interest   Fundamental geometric problem: “post-office problem”   Known to be hard due to high-dimensionality

  Many applications  Feature vectors: Data Mining, Multimedia DB  Applications based on similarity queries  E.g.,

 Quantization in coding/compression  Recommendation systems  Bioinformatics/Chemoinformatics  …

4

9/2/14 The 3rd China-Australia Workshop

Page 5: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Applications

  SIFT feature matching   Image retrieval, understanding, scene-rendering   Video retrieval

5

9/2/14 The 3rd China-Australia Workshop

Page 6: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Challenges

  Curse of Dimensionality / Concentration of Measure  Hard to find algorithms sub-linear in n and polynomial

in d  Approximate version (c-ANN) is not much easier  More details next slide

  Large data size  1KB for a single point with 256 dims 1B pts = 1TB

  image-net (Stanford & Princeton): 14 million images, ~100 feature vectors per image 1.4B feature vectors

 High-dimensionality   finding similar words in D=198 million tweets [JMLR 2013]

6

9/2/14 The 3rd China-Australia Workshop

Page 7: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Existing Solutions

  NN:  O(d5log(n)) query time, O(n2d+δ) space  O(dn1-ε(d)) query time, O(dn) space   Linear scan: O(dn/B) I/Os, O(dn) space

  (1+ε)-ANN  O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space   Probabilistic test remove exponential dependency on d

  Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2)) space

  LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space  ρ= 1/(1+ε) + oc(1)

7

LSH is the best approach using sub-quadratic space

Linear scan is (practically) the best approach using linear space & time

9/2/14 The 3rd China-Australia Workshop

Page 8: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Approximate NN for Multimedia Retrieval

8

  Cover-tree   Spill-tree   Reduce to NN search with Hamming distance   Dimensionality reduction (e.g., PCA)   Quantization-based approaches (e.g., CK-Means)   …

9/2/14 The 3rd China-Australia Workshop

They are heuristic methods (i.e., no quality guarantee) or rely on some “hidden” characteristics of the data

Page 9: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Locality Sensitive Hashing (LSH)

  Equality search   Index: store o into bucket h(o)  Query: retrieve every o in the bucket

h(q), verify if o = q   LSH

 ∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o)   h :: Rd Zı  technically, dependent on r

  “Near-by” points (blue) have more chance of colliding with q than “far-away” points (red)

9

r

c!r

q

a

b

d-dimensional space

LSH is the best approach using sub-quadratic space

9/2/14 The 3rd China-Australia Workshop

Page 10: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

LSH: Indexing & Query Processing

  Indexing   For a fixed r

  sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩   store o into bucket sig(o)

  Iteratively increase r   Query Processing

  Search with a fixed r   Retrieve and “verify” points in

the bucket sig(q)   Repeat this L times (boosting)

 Galloping search to find the first good r

10

r

c!r

q

a

b

d-dimensional space

Control false positive size

Const Succ. Prob.

Incurs additional cost + only c2 quality guarantee

9/2/14 The 3rd China-Australia Workshop

(r, c*r)-NN problem

Page 11: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Locality Sensitive Hashing (LSH)

  Standard LSH  c2-ANN many R(ci, ci+1)-NN problems

  Superior empirical query performance + theoretical guarantees w/o assumptions on data distributions

  Main problem:  Huge super-linear index size: Õ(log(distNN)⋅n1+ρ+o(1))  Typically hundreds of times larger than the data size

11

9/2/14 The 3rd China-Australia Workshop

Page 12: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Dealing with Large Data /1

  Approach 1: A single hash table but probing more buckets  Multi-probe [VLDB 07]: Princeton

 Probe ±1, ±2, … buckets from sig(q)  Heuristic method; sensitive to parameter settings

 Entropy LSH [SODA 06]: Stanford  Create multiple perturbed queries qi from q  Probe buckets of sig(qi)  O(n2/c) perturbed queries with O(1) hash tables

12

9/2/14 The 3rd China-Australia Workshop

Page 13: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Dealing with Large Data /2

  Approach 2: Distributed LSH  Simple distributed LSH:

 Use (key, value) store to implement (distributed) hash tables  Key machine ID  Network traffic O(nΘ(1))

 Layered LSH [CIKM 12]: Stanford  Use (h(key), key + value), where h() is a LSH  Network traffic O(log(n)0.5) w.h.p.

 13 nodes for 64-dim color histograms of 80 million Tiny Image

13

9/2/14 The 3rd China-Australia Workshop

Page 14: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Dealing with Large Data /3

  Approach 3: Parallel LSH  PLSH [VLDB13]: MIT + Intel  Partition data to each node; broadcast the query and

each node process data locally  Meticulous hardware-oriented optimizations

 Fit the indexes into main memory at each node  SIMD, Cache-friendly bitmaps, 2-level hashing for TLB

optimization, other optimizations for streaming data  Build predicative performance models for parameter

tuning  100 servers (64 GB memory) for 1B steaming tweets

and queries

14

9/2/14 The 3rd China-Australia Workshop

Page 15: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Dealing with Large Data /4

  Approach 4: LSH on external memory with index size reduction  LSB-forest [SIGMOD’09, TODS’10]:

 A different reduction from c2-ANN to a R(ci, ci+1)-NN problem

 C2LSH [SIGMOD’12]:  Do not use composite hash keys  Perform fine-granular counting number

of collisions in m LSH projections

15

O((dn/B)0.5) query, O((dn/B)1.5) space

O(n*log(n)/B) query, O(n*log(n)/B) space

O(n/B) query, O(n/B) space

SRS (Ours)

9/2/14 The 3rd China-Australia Workshop

Small constants in O( )

Page 16: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Weakness of Ext-Memory LSH Methods

  Existing methods uses super-linear space  Thousands (or more) of hash tables needed if rigorous  People resorts to hashing into binary code (and using

Hamming distance) for multimedia retrieval

  Can only handle c, where c = x2, for integer x ≥ 2  To enable reusing the hash table (merging buckets)

  Valuable information lost (due to quantization)   Update? (changes to n, and c)

16

Dataset Size LSB-forest C2LSH+ SRS (Ours)

Audio, 40MB 1500 MB 127 MB 2 MB 9/2/14 The 3rd China-Australia Workshop

Page 17: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

SRS: Our Proposed Method

  Solving c-ANN queries with O(n) query time and O(n) space with constant probability  Constants hidden in O() is very small   Early-termination condition is provably effective

  Advantages:   Small index   Rich-functionality   Simple

  Central idea:   c-ANN query in d dims kNN query in m-dims with

filtering  Model the distribution of m “stable random projections”

17

9/2/14 The 3rd China-Australia Workshop

Page 18: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

2-stable Random Projection 18

  Let D be the 2-stable random projection = standard Normal distribution N(0, 1)

  For two i.i.d. random variables A ~ D, B ~ D, then x*A + y*B ~ (x2+y2)1/2 * Dı

  Illustration

V r1

V.r1 ~ N(0, ǁvǁ)

r2

V

9/2/14 The 3rd China-Australia Workshop

Page 19: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Dist(O) and ProjDist(O) and Their Relationship

19

  z1≔⟨V, r1⟩ ~ N(0, ǁvǁ)   z2≔⟨V, r2⟩ ~ N(0, ǁvǁ)   z1

2+z22 ~ ǁvǁ2 *χ2

m   i.e., scaled Chi-squared distribution of m degrees of freedom  Ψm(x): cdf of the standardχ2

m distribution

O in d dims

(z1, … zm) in m dims

m 2-stable random projections

r1

⟨V.r1⟩ ~ N(0, ǁvǁ) Q

O

r2

Q

O

Proj(O)

Proj(Q)

ProjDist(O)

Dist(O)

ProjDist(O)

9/2/14 The 3rd China-Australia Workshop

Think of zi as “samples”

Page 20: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

LSH-like Property 20

  Intuitive idea:   If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with

high probability   But the inverse is NOT true

  NN object in the projected space is most likely not the NN object in the original space with few projections, as

  Many far-away objects projected before the NN/cNN objects   But we can bound the expected number of such cases! (say T)

  Solution   Perform incremental k-NN search on the projected space till

accessing T objects  + Early termination test

9/2/14 The 3rd China-Australia Workshop

Page 21: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Indexing 21

  Finding the minimum m   Input

  n, c,   T ≔ max # of points to access by the algorithm

  Output   m : # of 2-stable random projections   T’ ≤ T: a better bound on T   m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space

index

  Generate m 2-stable random projections n projected points in a m-dimensional space

  Index these projections using any index that supports incremental kNN search, e.g., R-tree

  Space cost: O(m * n) = O(n)

9/2/14 The 3rd China-Australia Workshop

Page 22: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

SRS-αβ(T, c, pτ) 22

  Compute proj(Q)   Do incremental kNN search from proj(Q) for k = 1 to T

 Compute Dist(Ok)  Maintain Omin = argmin1≤i≤k Dist(Oi)   If early-termination test (c, pτ) = TRUE

 BREAK

  Return Omin

// stopping condition α

// stopping condition β

c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18

Index = 0.0059n, Query = 0.0084n, succ prob = 0.13

Early-termination test:

ψmc 2 *ProjDist2(ok )Dist2(omin )

> pτ

9/2/14 The 3rd China-Australia Workshop

Main Theorem: SRS-αβ returns a c-NN point with probability pτ-f(m,c) with O(n) I/O cost

Page 23: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Variations of SRS-αβ(T, c, pτ) 23

  Compute proj(Q)   Do incremental kNN search from proj(Q) for k = 1 to T

  Compute Dist(Ok)   Maintain Omin = argmin1≤i≤k Dist(Oi)   If early-termination test (c, pτ) = TRUE

  BREAK

  Return Omin

// stopping condition α

// stopping condition β

1.  SRS-α 2.  SRS-β

3.  SRS-αβ(T, c’, pτ)

Better quality; query cost is O(T)

Best quality; query cost bounded by O(n); handles c = 1

Better quality; query cost bounded by O(T)

9/2/14 The 3rd China-Australia Workshop All with success probability at least pτ

Page 24: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Other Results 24

  Uniformly optimal method for L2

  Can be easily extended to support top-k c-ANN queries (k > 1)

  No previous known guarantee on the correctness of returned results

  We guarantee the correctness with probability at least pτ, if SRS-αβ stops due to early-termination condition  ≈100% in practice (97% in theory)

9/2/14 The 3rd China-Australia Workshop

Page 25: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Experiment Setup 25

  Algorithms  LSB-forest [SIGMOD’09, TODS’10]  C2LSH [SIGMOD’12]  SRS-* [VLDB’15]

  Data   Measures

  Index size, query cost, result quality, success probability

9/2/14 The 3rd China-Australia Workshop

Page 26: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Datasets 26

12.5PB 820GB

37GB

9/2/14 The 3rd China-Australia Workshop

122GB

Page 27: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Tiny Image Dataset (8M pts, 384 dims) 27

  Fastest: SRS-αβ, Slowest: C2LSH   Quality the other

way around   SRS-αhas comparable

quality with C2LSH yet has much lower cost.

  SRS-* dominates LSB-forest

9/2/14 The 3rd China-Australia Workshop

Page 28: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Approximate Nearest Neighbor 28

  Empirically better than the theoretic guarantee

  With 15% I/Os of linear scan, returns NN with probability 71%

  With 62% I/Os of linear scan, returns NN with probability 99.7%

9/2/14 The 3rd China-Australia Workshop

Page 29: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Large Dataset (1 Billion, 128 dims) 29

9/2/14 The 3rd China-Australia Workshop

Works on a PC (4G mem, 500G hard disk)

Page 30: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Theoretical Guarantees Are Important

9/2/14 The 3rd China-Australia Workshop

30

Success probability in a hard setting

≥ 0.78 ≤ 0.29

Page 31: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Summary 31

  Index size is the bottleneck for traditional LSH-based methods   Limits the data size one node can support  Needs many nodes with large memory to scale to billions of

data points   SRS reduces c-ANN queries in arbitrarily high

dimensional space to kNN queries in low dim space  Our index size is approximately d/m of the size of the data

file  Can be easily generalized to an parallel implementation

  Ongoing work   Scaling to even larger dataset using data partitioning  Accommodate rapid updates  Other theoretical work for LSH

9/2/14 The 3rd China-Australia Workshop

Page 32: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Q&A 32

Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html

9/2/14 The 3rd China-Australia Workshop

Page 33: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

9/2/14 The 3rd China-Australia Workshop

33

Page 34: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Backup Slides: Analysis 34

9/2/14 The 3rd China-Australia Workshop

Page 35: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Stopping Condition α 35

  “near” point: the NN point its distance ≕ r   “far” points: points whose distance > c * r   Then for any κ > 0 and any o:

  Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2)   Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)

  Both because ProjDist2(o)/Dist2(o) ~ χ2m

  Pr[the NN point projected beforeκ*r] ≥ ψm(κ2)   Pr[# of bad points projected before κ*r < T] > (1 - ψm(κ2/c2)) * (n/T)   Choose κ such that P1 + P2 – 1 > 0

  Feasible due to good concentration bound for χ2m

P1

P2

9/2/14 The 3rd China-Australia Workshop

Page 36: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Choosing κ 36

!!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

0 20 40 60 80 100

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.1

20

.14

Probability (y!axis) of Seeing an Object at Distance (x!axis)

Distance

Pro

ba

bili

ty

! Object with distance 1Object with distance 4

  Let c = 4   Mode = m – 2   Blue: 4   Red: 4*(c2) = 64

κ*r 9/2/14 The 3rd China-Australia Workshop

Page 37: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

distance to !(q) in m-dims"!r0

less than n "far" points

distance to q in d-dimsc!r0 r

NN point

less than T "far" points

ProjDist(OT): Case I 37

ProjDist(OT)

ProjDist(o) in m-dims

Omin = the NN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability

9/2/14 The 3rd China-Australia Workshop

Page 38: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

distance to !(q) in m-dims"!r0

less than n "far" points

distance to q in d-dimsc!r0 r

NN point

less than T "far" points

ProjDist(OT): Case II 38

ProjDist(OT)

ProjDist(o) in m-dims

Omin = a cNN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability

9/2/14 The 3rd China-Australia Workshop

Page 39: LOCALITY SENSITIVE HASHING FOR BIG DATA - …weiw/project/SRS-talk-big-dat… ·  · 2014-09-02LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales 1

Early-termination Condition (β) 39

  Omit the proof here  Also relies on the fact that the squared sum of m

projected distances follows a scaled χ2m distribution

  Key to  Handle the case where c = 1

 Returns the NN point with guaranteed probability   Impossible to handle by LSH-based methods

 Guarantees the correctness of top-k cANN points returned when stopped by this condition  No such guarantee by any previous method

9/2/14 The 3rd China-Australia Workshop