Upload
phamtram
View
216
Download
3
Embed Size (px)
Citation preview
LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales
1
9/2/14
The 3rd China-Australia Workshop
Outline
NN and c-ANN Existing LSH methods for Large Data Our approach: SRS [VLDB 2015]
Conclusions
2
9/2/14 The 3rd China-Australia Workshop
NN and c-ANN Queries
A set of points, D = ∪i=1n {Oi}, in d-dimensional Euclidean
space (d is large, e.g., hundreds) Given a query point q, find the closest point, O*, in D Relaxed version:
Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q) May return a c-ANN point with at least constant probability
3
NN = a
c-ANN = x r
c!r
q
a
b
d-dimensional space
x D = {a, b, x}
aka. (1+ε)-ANN
9/2/14 The 3rd China-Australia Workshop
Motivations
Theoretical interest Fundamental geometric problem: “post-office problem” Known to be hard due to high-dimensionality
Many applications Feature vectors: Data Mining, Multimedia DB Applications based on similarity queries E.g.,
Quantization in coding/compression Recommendation systems Bioinformatics/Chemoinformatics …
4
9/2/14 The 3rd China-Australia Workshop
Applications
SIFT feature matching Image retrieval, understanding, scene-rendering Video retrieval
5
9/2/14 The 3rd China-Australia Workshop
Challenges
Curse of Dimensionality / Concentration of Measure Hard to find algorithms sub-linear in n and polynomial
in d Approximate version (c-ANN) is not much easier More details next slide
Large data size 1KB for a single point with 256 dims 1B pts = 1TB
image-net (Stanford & Princeton): 14 million images, ~100 feature vectors per image 1.4B feature vectors
High-dimensionality finding similar words in D=198 million tweets [JMLR 2013]
6
9/2/14 The 3rd China-Australia Workshop
Existing Solutions
NN: O(d5log(n)) query time, O(n2d+δ) space O(dn1-ε(d)) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space
(1+ε)-ANN O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space Probabilistic test remove exponential dependency on d
Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2)) space
LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space ρ= 1/(1+ε) + oc(1)
7
LSH is the best approach using sub-quadratic space
Linear scan is (practically) the best approach using linear space & time
9/2/14 The 3rd China-Australia Workshop
Approximate NN for Multimedia Retrieval
8
Cover-tree Spill-tree Reduce to NN search with Hamming distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-Means) …
9/2/14 The 3rd China-Australia Workshop
They are heuristic methods (i.e., no quality guarantee) or rely on some “hidden” characteristics of the data
Locality Sensitive Hashing (LSH)
Equality search Index: store o into bucket h(o) Query: retrieve every o in the bucket
h(q), verify if o = q LSH
∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o) h :: Rd Zı technically, dependent on r
“Near-by” points (blue) have more chance of colliding with q than “far-away” points (red)
9
r
c!r
q
a
b
d-dimensional space
LSH is the best approach using sub-quadratic space
9/2/14 The 3rd China-Australia Workshop
LSH: Indexing & Query Processing
Indexing For a fixed r
sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩ store o into bucket sig(o)
Iteratively increase r Query Processing
Search with a fixed r Retrieve and “verify” points in
the bucket sig(q) Repeat this L times (boosting)
Galloping search to find the first good r
10
r
c!r
q
a
b
d-dimensional space
Control false positive size
Const Succ. Prob.
Incurs additional cost + only c2 quality guarantee
9/2/14 The 3rd China-Australia Workshop
(r, c*r)-NN problem
Locality Sensitive Hashing (LSH)
Standard LSH c2-ANN many R(ci, ci+1)-NN problems
Superior empirical query performance + theoretical guarantees w/o assumptions on data distributions
Main problem: Huge super-linear index size: Õ(log(distNN)⋅n1+ρ+o(1)) Typically hundreds of times larger than the data size
11
9/2/14 The 3rd China-Australia Workshop
Dealing with Large Data /1
Approach 1: A single hash table but probing more buckets Multi-probe [VLDB 07]: Princeton
Probe ±1, ±2, … buckets from sig(q) Heuristic method; sensitive to parameter settings
Entropy LSH [SODA 06]: Stanford Create multiple perturbed queries qi from q Probe buckets of sig(qi) O(n2/c) perturbed queries with O(1) hash tables
12
9/2/14 The 3rd China-Australia Workshop
Dealing with Large Data /2
Approach 2: Distributed LSH Simple distributed LSH:
Use (key, value) store to implement (distributed) hash tables Key machine ID Network traffic O(nΘ(1))
Layered LSH [CIKM 12]: Stanford Use (h(key), key + value), where h() is a LSH Network traffic O(log(n)0.5) w.h.p.
13 nodes for 64-dim color histograms of 80 million Tiny Image
13
9/2/14 The 3rd China-Australia Workshop
Dealing with Large Data /3
Approach 3: Parallel LSH PLSH [VLDB13]: MIT + Intel Partition data to each node; broadcast the query and
each node process data locally Meticulous hardware-oriented optimizations
Fit the indexes into main memory at each node SIMD, Cache-friendly bitmaps, 2-level hashing for TLB
optimization, other optimizations for streaming data Build predicative performance models for parameter
tuning 100 servers (64 GB memory) for 1B steaming tweets
and queries
14
9/2/14 The 3rd China-Australia Workshop
Dealing with Large Data /4
Approach 4: LSH on external memory with index size reduction LSB-forest [SIGMOD’09, TODS’10]:
A different reduction from c2-ANN to a R(ci, ci+1)-NN problem
C2LSH [SIGMOD’12]: Do not use composite hash keys Perform fine-granular counting number
of collisions in m LSH projections
15
O((dn/B)0.5) query, O((dn/B)1.5) space
O(n*log(n)/B) query, O(n*log(n)/B) space
O(n/B) query, O(n/B) space
SRS (Ours)
9/2/14 The 3rd China-Australia Workshop
Small constants in O( )
Weakness of Ext-Memory LSH Methods
Existing methods uses super-linear space Thousands (or more) of hash tables needed if rigorous People resorts to hashing into binary code (and using
Hamming distance) for multimedia retrieval
Can only handle c, where c = x2, for integer x ≥ 2 To enable reusing the hash table (merging buckets)
Valuable information lost (due to quantization) Update? (changes to n, and c)
16
Dataset Size LSB-forest C2LSH+ SRS (Ours)
Audio, 40MB 1500 MB 127 MB 2 MB 9/2/14 The 3rd China-Australia Workshop
SRS: Our Proposed Method
Solving c-ANN queries with O(n) query time and O(n) space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective
Advantages: Small index Rich-functionality Simple
Central idea: c-ANN query in d dims kNN query in m-dims with
filtering Model the distribution of m “stable random projections”
17
9/2/14 The 3rd China-Australia Workshop
2-stable Random Projection 18
Let D be the 2-stable random projection = standard Normal distribution N(0, 1)
For two i.i.d. random variables A ~ D, B ~ D, then x*A + y*B ~ (x2+y2)1/2 * Dı
Illustration
V r1
V.r1 ~ N(0, ǁvǁ)
r2
V
9/2/14 The 3rd China-Australia Workshop
Dist(O) and ProjDist(O) and Their Relationship
19
z1≔⟨V, r1⟩ ~ N(0, ǁvǁ) z2≔⟨V, r2⟩ ~ N(0, ǁvǁ) z1
2+z22 ~ ǁvǁ2 *χ2
m i.e., scaled Chi-squared distribution of m degrees of freedom Ψm(x): cdf of the standardχ2
m distribution
O in d dims
(z1, … zm) in m dims
m 2-stable random projections
r1
⟨V.r1⟩ ~ N(0, ǁvǁ) Q
O
r2
Q
O
Proj(O)
Proj(Q)
ProjDist(O)
Dist(O)
ProjDist(O)
9/2/14 The 3rd China-Australia Workshop
Think of zi as “samples”
LSH-like Property 20
Intuitive idea: If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with
high probability But the inverse is NOT true
NN object in the projected space is most likely not the NN object in the original space with few projections, as
Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases! (say T)
Solution Perform incremental k-NN search on the projected space till
accessing T objects + Early termination test
9/2/14 The 3rd China-Australia Workshop
Indexing 21
Finding the minimum m Input
n, c, T ≔ max # of points to access by the algorithm
Output m : # of 2-stable random projections T’ ≤ T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space
index
Generate m 2-stable random projections n projected points in a m-dimensional space
Index these projections using any index that supports incremental kNN search, e.g., R-tree
Space cost: O(m * n) = O(n)
9/2/14 The 3rd China-Australia Workshop
SRS-αβ(T, c, pτ) 22
Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T
Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE
BREAK
Return Omin
// stopping condition α
// stopping condition β
c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18
Index = 0.0059n, Query = 0.0084n, succ prob = 0.13
Early-termination test:
€
ψmc 2 *ProjDist2(ok )Dist2(omin )
> pτ
9/2/14 The 3rd China-Australia Workshop
Main Theorem: SRS-αβ returns a c-NN point with probability pτ-f(m,c) with O(n) I/O cost
Variations of SRS-αβ(T, c, pτ) 23
Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T
Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE
BREAK
Return Omin
// stopping condition α
// stopping condition β
1. SRS-α 2. SRS-β
3. SRS-αβ(T, c’, pτ)
Better quality; query cost is O(T)
Best quality; query cost bounded by O(n); handles c = 1
Better quality; query cost bounded by O(T)
9/2/14 The 3rd China-Australia Workshop All with success probability at least pτ
Other Results 24
Uniformly optimal method for L2
Can be easily extended to support top-k c-ANN queries (k > 1)
No previous known guarantee on the correctness of returned results
We guarantee the correctness with probability at least pτ, if SRS-αβ stops due to early-termination condition ≈100% in practice (97% in theory)
9/2/14 The 3rd China-Australia Workshop
Experiment Setup 25
Algorithms LSB-forest [SIGMOD’09, TODS’10] C2LSH [SIGMOD’12] SRS-* [VLDB’15]
Data Measures
Index size, query cost, result quality, success probability
9/2/14 The 3rd China-Australia Workshop
Datasets 26
12.5PB 820GB
37GB
9/2/14 The 3rd China-Australia Workshop
122GB
Tiny Image Dataset (8M pts, 384 dims) 27
Fastest: SRS-αβ, Slowest: C2LSH Quality the other
way around SRS-αhas comparable
quality with C2LSH yet has much lower cost.
SRS-* dominates LSB-forest
9/2/14 The 3rd China-Australia Workshop
Approximate Nearest Neighbor 28
Empirically better than the theoretic guarantee
With 15% I/Os of linear scan, returns NN with probability 71%
With 62% I/Os of linear scan, returns NN with probability 99.7%
9/2/14 The 3rd China-Australia Workshop
Large Dataset (1 Billion, 128 dims) 29
9/2/14 The 3rd China-Australia Workshop
Works on a PC (4G mem, 500G hard disk)
Theoretical Guarantees Are Important
9/2/14 The 3rd China-Australia Workshop
30
Success probability in a hard setting
≥ 0.78 ≤ 0.29
Summary 31
Index size is the bottleneck for traditional LSH-based methods Limits the data size one node can support Needs many nodes with large memory to scale to billions of
data points SRS reduces c-ANN queries in arbitrarily high
dimensional space to kNN queries in low dim space Our index size is approximately d/m of the size of the data
file Can be easily generalized to an parallel implementation
Ongoing work Scaling to even larger dataset using data partitioning Accommodate rapid updates Other theoretical work for LSH
9/2/14 The 3rd China-Australia Workshop
Q&A 32
Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html
9/2/14 The 3rd China-Australia Workshop
9/2/14 The 3rd China-Australia Workshop
33
Backup Slides: Analysis 34
9/2/14 The 3rd China-Australia Workshop
Stopping Condition α 35
“near” point: the NN point its distance ≕ r “far” points: points whose distance > c * r Then for any κ > 0 and any o:
Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2) Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)
Both because ProjDist2(o)/Dist2(o) ~ χ2m
Pr[the NN point projected beforeκ*r] ≥ ψm(κ2) Pr[# of bad points projected before κ*r < T] > (1 - ψm(κ2/c2)) * (n/T) Choose κ such that P1 + P2 – 1 > 0
Feasible due to good concentration bound for χ2m
P1
P2
9/2/14 The 3rd China-Australia Workshop
Choosing κ 36
!!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
0 20 40 60 80 100
0.0
00
.02
0.0
40
.06
0.0
80
.10
0.1
20
.14
Probability (y!axis) of Seeing an Object at Distance (x!axis)
Distance
Pro
ba
bili
ty
! Object with distance 1Object with distance 4
Let c = 4 Mode = m – 2 Blue: 4 Red: 4*(c2) = 64
κ*r 9/2/14 The 3rd China-Australia Workshop
distance to !(q) in m-dims"!r0
less than n "far" points
distance to q in d-dimsc!r0 r
NN point
less than T "far" points
ProjDist(OT): Case I 37
ProjDist(OT)
ProjDist(o) in m-dims
Omin = the NN point
Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability
9/2/14 The 3rd China-Australia Workshop
distance to !(q) in m-dims"!r0
less than n "far" points
distance to q in d-dimsc!r0 r
NN point
less than T "far" points
ProjDist(OT): Case II 38
ProjDist(OT)
ProjDist(o) in m-dims
Omin = a cNN point
Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability
9/2/14 The 3rd China-Australia Workshop
Early-termination Condition (β) 39
Omit the proof here Also relies on the fact that the squared sum of m
projected distances follows a scaled χ2m distribution
Key to Handle the case where c = 1
Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods
Guarantees the correctness of top-k cANN points returned when stopped by this condition No such guarantee by any previous method
9/2/14 The 3rd China-Australia Workshop