04 Lsh Theory

8/22/2019 04 Lsh Theory

1/45

CS246: Mining Massive DatasetsJure Leskovec, Stanford University

http://cs246.stanford.edu

8/22/2019 04 Lsh Theory

2/45

Goal:Given a large number (N in the millions orbillions) of text documents, find pairs that arenear duplicates

Applications:

Mirror websites, or approximate mirrors Dont want to show both in a search

Problems:

Many small pieces of one document can appear

out of order in another Too many documents to compare all pairs

Documents are so large or so many that they cannotfit in main memory

1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

8/22/2019 04 Lsh Theory

3/45

1. Shingling: Convert docs to sets of items Document is a set of k-shingles

2. Minhashing: Convert large sets into short

signatures, while preserving similarity Want hash func. that Pr[h

(C1) = h(C2)] = sim(C1, C2)

For the Jaccard similarity Minmash has this property!

3. Locality-sensitive hashing: Focus on pairs ofsignatures likely to be from similar documents

Split signatures into bands and hash them

Documents with similar signatures get hashed into

same buckets: Candidate pairs1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

8/22/2019 04 Lsh Theory

4/45

4

Docu-

ment

The setof stringsof length k

that appearin the doc-ument

Signatures:short integervectors that

represent thesets, andreflect theirsimilarity

Locality-sensitiveHashing

Candidate

pairs:those pairs

of signaturesthat we needto test forsimilarity

1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

8/22/2019 04 Lsh Theory

5/45

A k-shingle (or k-gram) is a sequence ofktokens that appears in the document

Example: k=2; D1= abcab

Set of 2-shingles: C1 = S(D1) = {ab, bc, ca} Represent a doc by the set of hash values of

its k-shingles

A natural document similarity measure is then

the Jaccard similarity:

Sim(D1, D2) = |C1C2|/|C1C2|


8/22/2019 04 Lsh Theory

6/45

Prob. h(C1) = h(C2) is the same as Sim(D1, D2):Pr[h(C1) = h(C2)] = Sim(D1, D2)


Similarities of columns andsignatures (approx.) match!

1-3 2-4 1-2 3-4Col/Col 0.75 0.75 0 0

Sig/Sig 0.67 1.00 0 0

Signature matrix M

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

0101

0101

1010

1010

1010

1001

0101

Input matrix (Shingles x Documents)

3

4

7

2

6

1

5

Permutation

8/22/2019 04 Lsh Theory

7/45

Hash columns of the signature matrix M:Similar columns likely hash to same bucket

Divide matrix M into b bands ofrrows (M=br)

Candidatecolumn pairs are those that hashto the same bucket for 1 band


r rows

b bands

Buckets

Matrix M

SimilarityProb.ofsharing

1

bucke

t

1212

1412

2121

Threshold

s

8/22/2019 04 Lsh Theory

8/45

The S-curve is where the magic happens


Similarity t of two sets

Probabilityofs

haring

1b

ucket

Remember:

Probability ofequal hash-values= similarity

This is what 1 band gives you

Pr[h(C1) = h(C2)] = sim(D1, D2)

No chanceifs< t

Probability= 1 ifs> t

This is what we want!

How to get a step-function?

By picking rand b!

Thres

holds

Similarity t of two sets

8/22/2019 04 Lsh Theory

9/45

Remember: b bands, rrows/band Let sim(C1 , C2) = t

Pick some band (rrows)

Prob. that elements in a single row ofcolumns C1 and C2 are equal = t

Prob. that all rows in a band are equal= tr

Prob. that some row in a band is not equal = 1 - tr

Prob. that all bands are not equal = (1 - tr)b

Prob. that at least 1 band is equal = 1 - (1 - tr)b


Similarity t

P(C1,

C2

isa

candidatepa

ir)

P(C1, C2 is a candidate pair)= 1 - (1 - tr)b

8/22/2019 04 Lsh Theory

10/451/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Similarity

r = 1..10, b = 1

Pro

b(Candidatepair)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Prob(Candidatepair)

r = 1, b = 1..10

r = 5, b = 1..50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

r = 10, b = 1..50

Similarityy = 1 - (1 - x r)b

Given a fixedthreshold s.

We want chooserand bsuchthat theP(Candidate

pair) has astep rightaround s.

8/22/2019 04 Lsh Theory

11/45

Docu-

ment

The setof stringsof length k

that appearin the doc-ument

Signatures:short vectorsthat represent

the sets, andreflect theirsimilarity

Locality-sensitive

Hashing

Candidate

pairs:those pairsof signaturesthat we needto test forsimilarity

8/22/2019 04 Lsh Theory

12/45

We have used LSH to find similar documents In reality, columns in large sparse matrices with

high Jaccard similarity

e.g., customer/item purchase histories

Can we use LSH for other distance measures?

e.g., Euclidean distances, Cosine distance

Lets generalize what weve learned!

1/19/2011 12Jure Leskovec, Stanford C246: Mining Massive Datasets

8/22/2019 04 Lsh Theory

13/45

For Minhashing signatures, we got a Minhashfunction for each permutation of rows

An example of afamily of hash functions:

A hash function is any function that takes twoelements and says whether or not they are equal

Shorthand:h(x) = h(y) means h says x and y are equal

Afamilyof hash functions is any set of hash

functions from which we canpick one at

random efficiently

Example: The set of Minhash functions generated

from permutations of rows1/19/2011 13Jure Leskovec, Stanford C246: Mining Massive Datasets

8/22/2019 04 Lsh Theory

14/45

Suppose we have a space S of points witha distance measure d

A family Hof hash functions is said to be

(d1, d2,p1,p2)-sensitiveif for anyxand yin S:

1. Ifd(x, y) < d1, then the probability over all h H,that h(x) = h(y) is at leastp1

2. Ifd(x, y) > d2, then the probability over all h H,that h(x) = h(y) is at mostp2


8/22/2019 04 Lsh Theory

15/45

Pr[h(x)=h

(y)]

d(x,y)

d1 d2

p2

p1

Highprobability;at least p1

Low

probability;at most p2


8/22/2019 04 Lsh Theory

16/45

Let: S = space of all sets,

d= Jaccard distance,

His family of Minhash functions for allpermutations of rows

Then for any hash function h H:Pr[h(x) = h(y)] = 1 - d(x, y)

Simply restates theorem about Minhashing

in terms of distances rather than similarities


8/22/2019 04 Lsh Theory

17/45

Claim:His a (1/3, 2/3, 2/3, 1/3)-sensitive familyfor S and d.

For Jaccard similarity, minhashing gives a(d1,d2,(1-d1),(1-d2))-sensitive family for any d1 2/3)

Then probabilitythat Minhash values

agree is > 2/3


8/22/2019 04 Lsh Theory

18/45

Can we reproduce the S-curveeffect we saw before for any LSfamily?

The bands technique we learnedfor signature matrices carries overto this more general setting

Two constructions: AND construction like rows in a band

OR construction like many bands


Similarity t

Prob.ofsharing

abucket

8/22/2019 04 Lsh Theory

19/45

Given family H, construct family Hconsistingofrfunctions from H

For h = [h1,,hr] in H, we say

h(x) = h(y) if and only ifhi(x) = hi(y) for alli

Note this corresponds to creating a band of size r

Theorem: IfHis (d1, d2,p1,p2)-sensitive,then H is (d1,d2, (p1)

r, (p2)r)-sensitive

Proof: Use the fact that his are independent


1 i r

8/22/2019 04 Lsh Theory

20/45

Independence of hash functions (HFs) reallymeans that the prob. of two HFs saying yes

is the product of each saying yes

But two Minhash functions (e.g.) could be highlycorrelated

For example, if their permutations agree in the first one

million entries

However, the probabilities in the 4-tuples are overall possible members ofH, H

OK for Minhash, others, but must be part of LSH-

family definition

8/22/2019 04 Lsh Theory

21/45

Given family H, construct family Hconsistingofbfunctions from H

For h = [h1,,h

b] in H,

h(x) = h(y) if and only ifhi(x) = hi(y) for at least 1 i

Theorem: IfHis (d1, d2,p1,p2)-sensitive,

then His (d1, d2, 1-(1-p1)b, 1-(1-p2)b)-sensitive

Proof: Use the fact that his are independent


8/22/2019 04 Lsh Theory

22/45

ANDmakes all probs. shrink, but by choosing rcorrectly, we can make the lower prob. approach0 while the higher does not

ORmakes all probs. grow, but by choosing b

correctly, we can make the upper prob. approach1 while the lower does not


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AND

r=1..10, b=1

Prob.sharing

a

bucket

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Prob.sharinga

bucket

OR

r=1, b=1..10

Similarity of a pair of items Similarity of a pair of items

8/22/2019 04 Lsh Theory

23/45

r-way ANDfollowed by b-way ORconstruction Exactly what we did with Minhashing

If bands match in all rvalues hash to same bucket

Cols that are hashed into 1 common bucketCandidate

Take pointsxand y s.t. Pr[h(x) = h(y)] = p

Hwill make (x,y) a candidate pair with prob. p

Construction makes (x,y) a candidate pair with

probability 1-(1-pr

)b

The S-Curve! Example: Take H and construct H by the AND

construction with r= 4. Then, from H, construct H

by the ORconstruction with b = 4


8/22/2019 04 Lsh Theory

24/45

p 1-(1-p4)4

.2 .0064

.3 .0320

.4 .0985

.5 .2275

.6 .4260

.7 .6666

.8 .8785

.9 .9860

r = 4,b = 4 transforms a(.2,.8,.8,.2)-sensitive family into a

(.2,.8,.8785,.0064)-sensitive family.1/19/2011 24Jure Leskovec, Stanford C246: Mining Massive Datasets

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

8/22/2019 04 Lsh Theory

25/45

Picking rand b to get desired performance 50 hash-functions (r= 5, b = 10)


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Blue area X: False Negative rate

These are pairs with sim > sbut the Xfraction wont share a band and thenwill never become candidates.Thismeans we will never consider thesepairs for (slow/exact) similaritycalculation!Green area Y: False Posit ive rate

These are pairs with sim < sbutwe will consider them as candidates.This is not too bad, we will considerthem for (slow/exact) similarity

computation and discard them.

Similarity

Prob

(Candidatepair)

Thresho

lds

8/22/2019 04 Lsh Theory

26/45

Picking rand b to get desired performance 50 hash-functions (r* b = 50)


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

r=2, b=25

r=5, b=10

r=10, b=5Thresholds

8/22/2019 04 Lsh Theory

27/45

Apply a b-way OR construction followed byan r-way AND construction

Transforms probabilityp into (1-(1-p)b)r

The same S-curve, mirrored horizontally andvertically

Example: Take H and construct H by the OR

construction with b = 4. Then, from H,construct H by the AND construction

with r= 4.


8/22/2019 04 Lsh Theory

28/45


p (1-(1-p)4)4

.1 .0140

.2 .1215

.3 .3334

.4 .5740

.5 .7725

.6 .9015

.7 .9680

.8 .9936

The example transforms a(.2,.8,.8,.2)-sensitive family into a(.2,.8,.9936,.1215)-sensitive family.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

8/22/2019 04 Lsh Theory

29/45

Example: Apply the (4,4) OR-AND constructionfollowed by the (4,4) AND-OR construction

Transforms a (.2, .8, .8, .2)-sensitive family into

a (.2, .8, .9999996, .0008715)-sensitive family

Note this family uses 256 (=4*4*4*4) of the

original hash functions


8/22/2019 04 Lsh Theory

30/45

Pick any two distancesx< y

Start with a (x, y, (1-x), (1-y))-sensitive family

Apply constructions to amplify(x, y, p, q)-sensitive family,

wherep is almost 1 and q is almost 0

The closer to 0 and 1 we get, the more hash

functions must be used


8/22/2019 04 Lsh Theory

31/45

LSH methods for other distance metrics: Cosine distance:Random hyperplanes

Euclidean distance:Project on lines


Points

Signatures: shortinteger signatures thatreflect their similarity Locality-

sensitiveHashing

Candidate pairs:those pairs ofsignatures thatwe need to test

for similarityDesign a (x,y,p,q)-sensitivefamily of hash functions (for that

particular distance metric)

Amplify the familyusingAND and OR

constructions

Depends on the

distance function used

8/22/2019 04 Lsh Theory

32/45

For cosine distance,d(A, B) = = arccos(AB / AB)

there is a technique calledRandom Hyperplanes

Technique similar to Minhashing

Random Hyperplanesis a

(d1, d2, (1-d1/180), (1-d2/180))-sensitive

family for any d1 and d2 Reminder:(d1, d2,p1,p2)-sensitive

1. Ifd(x,y) < d1, then prob. that h(x) = h(y) is at leastp1

2. Ifd(x,y) > d2, then prob. that h(x) = h(y)is at most p2


A

B

AB

A

8/22/2019 04 Lsh Theory

33/45

Pick a random vector v, which determines ahash function hvwith two buckets

hv(x) = +1 ifvx 0; = -1 ifvx < 0 LS-family H= set of all functions derived

from any vector

Claim: For points x and y,

Pr[h(x) = h(y)] = 1 d(x,y) / 180


8/22/2019 04 Lsh Theory

34/45


x

y

Look in theplane ofxand y.

So: Prob[Red case] = / 180So:P[h(x)=h(y)] = 1- /180 = 1-d(x,y)

Hyperplanenormal to v.Here h(x) h(y)

v

Hyperplanenormal to v.Here h(x) = h(y)

v

8/22/2019 04 Lsh Theory

35/45


8/22/2019 04 Lsh Theory

36/45

Pick some number of random vectors, andhash your data for each vector

The result is a signature (sketch) of

+1s and1s for each data point

Can be used for LSH like we used the

Minhash signatures for Jaccard distance

Amplify using AND/OR constructions


8/22/2019 04 Lsh Theory

37/45

Expensive to pick a random vector in Mdimensions for large M

Would have to generate M random numbers

A more efficient approach

It suffices to consider only vectors v

consisting of +1 and 1 components

Why is this more efficient?


8/22/2019 04 Lsh Theory

38/45

Simple idea: Hash functions correspond to lines

Partition the line into buckets of size a

Hash each point to the bucket containing itsprojection onto the line

Nearby points are always close;

distant points are rarely in same bucket


8/22/2019 04 Lsh Theory

39/45

Lucky case: Points that are close

hash in the same bucket

Distant points end up in

different buckets

Two unlucky cases: Top: unlucky

quantization

Bottom: unlucky

projection1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

vv

Line

Buckets of size a

v v

vv

v v

vv

vv

8/22/2019 04 Lsh Theory

40/45


v v

vv

vv

vv

8/22/2019 04 Lsh Theory

41/45

Bucketwidth a

Randomlychosen line

Points atdistance d

Ifd

8/22/2019 04 Lsh Theory

42/45

Bucketwidth a

Points atdistance d

d cos

Ifd >> a, mustbe close to 90o

for there to beany chance pointsgo to the samebucket.


Randomlychosen line

8/22/2019 04 Lsh Theory

43/45

If points are distance d 2aapart, then they

can be in the same bucket only if d cos

a

cos

60 < < 90, i.e., at most 1/3 probability

Yields a (a/2, 2a, 1/2, 1/3)-sensitivefamily ofhash functions for any a

Amplify using AND-OR cascades


8/22/2019 04 Lsh Theory

44/45

44

Projection method yields a (a/2, 2a, 1/2,1/3)-sensitivefamily of hash functions

For previous distance measures, we could

start with an (x, y, p, q)-sensitive family foranyx< y, and drivep and q to 1 and 0 by

AND/OR constructions

Here, we seem to needx 4 y In the calculation on the previous slide we only

considered cases d 2a

1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

8/22/2019 04 Lsh Theory

45/45

But as long asx< y, the probability of pointsat distancex falling in the same bucket is

greater than the probability of points at

distance ydoing so Thus, the hash family formed by projecting

onto lines is an (x, y, p, q)-sensitive family

for somep > q Then, amplify by AND/OR constructions