04 Lsh Theory

Embed Size (px)

Citation preview

  • 8/22/2019 04 Lsh Theory

    1/45

    CS246: Mining Massive DatasetsJure Leskovec, Stanford University

    http://cs246.stanford.edu

  • 8/22/2019 04 Lsh Theory

    2/45

    Goal:Given a large number (N in the millions orbillions) of text documents, find pairs that arenear duplicates

    Applications:

    Mirror websites, or approximate mirrors Dont want to show both in a search

    Problems:

    Many small pieces of one document can appear

    out of order in another Too many documents to compare all pairs

    Documents are so large or so many that they cannotfit in main memory

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

  • 8/22/2019 04 Lsh Theory

    3/45

    1. Shingling: Convert docs to sets of items Document is a set of k-shingles

    2. Minhashing: Convert large sets into short

    signatures, while preserving similarity Want hash func. that Pr[h

    (C1) = h(C2)] = sim(C1, C2)

    For the Jaccard similarity Minmash has this property!

    3. Locality-sensitive hashing: Focus on pairs ofsignatures likely to be from similar documents

    Split signatures into bands and hash them

    Documents with similar signatures get hashed into

    same buckets: Candidate pairs1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

  • 8/22/2019 04 Lsh Theory

    4/45

    4

    Docu-

    ment

    The setof stringsof length k

    that appearin the doc-ument

    Signatures:short integervectors that

    represent thesets, andreflect theirsimilarity

    Locality-sensitiveHashing

    Candidate

    pairs:those pairs

    of signaturesthat we needto test forsimilarity

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    5/45

    A k-shingle (or k-gram) is a sequence ofktokens that appears in the document

    Example: k=2; D1= abcab

    Set of 2-shingles: C1 = S(D1) = {ab, bc, ca} Represent a doc by the set of hash values of

    its k-shingles

    A natural document similarity measure is then

    the Jaccard similarity:

    Sim(D1, D2) = |C1C2|/|C1C2|

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

  • 8/22/2019 04 Lsh Theory

    6/45

    Prob. h(C1) = h(C2) is the same as Sim(D1, D2):Pr[h(C1) = h(C2)] = Sim(D1, D2)

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

    Similarities of columns andsignatures (approx.) match!

    1-3 2-4 1-2 3-4Col/Col 0.75 0.75 0 0

    Sig/Sig 0.67 1.00 0 0

    Signature matrix M

    1212

    5

    7

    6

    3

    1

    2

    4

    1412

    4

    5

    1

    6

    7

    3

    2

    2121

    0101

    0101

    1010

    1010

    1010

    1001

    0101

    Input matrix (Shingles x Documents)

    3

    4

    7

    2

    6

    1

    5

    Permutation

  • 8/22/2019 04 Lsh Theory

    7/45

    Hash columns of the signature matrix M:Similar columns likely hash to same bucket

    Divide matrix M into b bands ofrrows (M=br)

    Candidatecolumn pairs are those that hashto the same bucket for 1 band

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

    r rows

    b bands

    Buckets

    Matrix M

    SimilarityProb.ofsharing

    1

    bucke

    t

    1212

    1412

    2121

    Threshold

    s

  • 8/22/2019 04 Lsh Theory

    8/45

    The S-curve is where the magic happens

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

    Similarity t of two sets

    Probabilityofs

    haring

    1b

    ucket

    Remember:

    Probability ofequal hash-values= similarity

    This is what 1 band gives you

    Pr[h(C1) = h(C2)] = sim(D1, D2)

    No chanceifs< t

    Probability= 1 ifs> t

    This is what we want!

    How to get a step-function?

    By picking rand b!

    Thres

    holds

    Similarity t of two sets

  • 8/22/2019 04 Lsh Theory

    9/45

    Remember: b bands, rrows/band Let sim(C1 , C2) = t

    Pick some band (rrows)

    Prob. that elements in a single row ofcolumns C1 and C2 are equal = t

    Prob. that all rows in a band are equal= tr

    Prob. that some row in a band is not equal = 1 - tr

    Prob. that all bands are not equal = (1 - tr)b

    Prob. that at least 1 band is equal = 1 - (1 - tr)b

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

    Similarity t

    P(C1,

    C2

    isa

    candidatepa

    ir)

    P(C1, C2 is a candidate pair)= 1 - (1 - tr)b

  • 8/22/2019 04 Lsh Theory

    10/451/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Similarity

    r = 1..10, b = 1

    Pro

    b(Candidatepair)

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Prob(Candidatepair)

    r = 1, b = 1..10

    r = 5, b = 1..50

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    r = 10, b = 1..50

    Similarityy = 1 - (1 - x r)b

    Given a fixedthreshold s.

    We want chooserand bsuchthat theP(Candidate

    pair) has astep rightaround s.

  • 8/22/2019 04 Lsh Theory

    11/45

    Docu-

    ment

    The setof stringsof length k

    that appearin the doc-ument

    Signatures:short vectorsthat represent

    the sets, andreflect theirsimilarity

    Locality-sensitive

    Hashing

    Candidate

    pairs:those pairsof signaturesthat we needto test forsimilarity

  • 8/22/2019 04 Lsh Theory

    12/45

    We have used LSH to find similar documents In reality, columns in large sparse matrices with

    high Jaccard similarity

    e.g., customer/item purchase histories

    Can we use LSH for other distance measures?

    e.g., Euclidean distances, Cosine distance

    Lets generalize what weve learned!

    1/19/2011 12Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    13/45

    For Minhashing signatures, we got a Minhashfunction for each permutation of rows

    An example of afamily of hash functions:

    A hash function is any function that takes twoelements and says whether or not they are equal

    Shorthand:h(x) = h(y) means h says x and y are equal

    Afamilyof hash functions is any set of hash

    functions from which we canpick one at

    random efficiently

    Example: The set of Minhash functions generated

    from permutations of rows1/19/2011 13Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    14/45

    Suppose we have a space S of points witha distance measure d

    A family Hof hash functions is said to be

    (d1, d2,p1,p2)-sensitiveif for anyxand yin S:

    1. Ifd(x, y) < d1, then the probability over all h H,that h(x) = h(y) is at leastp1

    2. Ifd(x, y) > d2, then the probability over all h H,that h(x) = h(y) is at mostp2

    1/19/2011 14Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    15/45

    Pr[h(x)=h

    (y)]

    d(x,y)

    d1 d2

    p2

    p1

    Highprobability;at least p1

    Low

    probability;at most p2

    1/19/2011 15Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    16/45

    Let: S = space of all sets,

    d= Jaccard distance,

    His family of Minhash functions for allpermutations of rows

    Then for any hash function h H:Pr[h(x) = h(y)] = 1 - d(x, y)

    Simply restates theorem about Minhashing

    in terms of distances rather than similarities

    1/19/2011 16Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    17/45

    Claim:His a (1/3, 2/3, 2/3, 1/3)-sensitive familyfor S and d.

    For Jaccard similarity, minhashing gives a(d1,d2,(1-d1),(1-d2))-sensitive family for any d1 2/3)

    Then probabilitythat Minhash values

    agree is > 2/3

    1/19/2011 17Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    18/45

    Can we reproduce the S-curveeffect we saw before for any LSfamily?

    The bands technique we learnedfor signature matrices carries overto this more general setting

    Two constructions: AND construction like rows in a band

    OR construction like many bands

    1/19/2011 18Jure Leskovec, Stanford C246: Mining Massive Datasets

    Similarity t

    Prob.ofsharing

    abucket

  • 8/22/2019 04 Lsh Theory

    19/45

    Given family H, construct family Hconsistingofrfunctions from H

    For h = [h1,,hr] in H, we say

    h(x) = h(y) if and only ifhi(x) = hi(y) for alli

    Note this corresponds to creating a band of size r

    Theorem: IfHis (d1, d2,p1,p2)-sensitive,then H is (d1,d2, (p1)

    r, (p2)r)-sensitive

    Proof: Use the fact that his are independent

    1/19/2011 19Jure Leskovec, Stanford C246: Mining Massive Datasets

    1 i r

  • 8/22/2019 04 Lsh Theory

    20/45

    Independence of hash functions (HFs) reallymeans that the prob. of two HFs saying yes

    is the product of each saying yes

    But two Minhash functions (e.g.) could be highlycorrelated

    For example, if their permutations agree in the first one

    million entries

    However, the probabilities in the 4-tuples are overall possible members ofH, H

    OK for Minhash, others, but must be part of LSH-

    family definition

  • 8/22/2019 04 Lsh Theory

    21/45

    Given family H, construct family Hconsistingofbfunctions from H

    For h = [h1,,h

    b] in H,

    h(x) = h(y) if and only ifhi(x) = hi(y) for at least 1 i

    Theorem: IfHis (d1, d2,p1,p2)-sensitive,

    then His (d1, d2, 1-(1-p1)b, 1-(1-p2)b)-sensitive

    Proof: Use the fact that his are independent

    1/19/2011 21Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    22/45

    ANDmakes all probs. shrink, but by choosing rcorrectly, we can make the lower prob. approach0 while the higher does not

    ORmakes all probs. grow, but by choosing b

    correctly, we can make the upper prob. approach1 while the lower does not

    1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    AND

    r=1..10, b=1

    Prob.sharing

    a

    bucket

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Prob.sharinga

    bucket

    OR

    r=1, b=1..10

    Similarity of a pair of items Similarity of a pair of items

  • 8/22/2019 04 Lsh Theory

    23/45

    r-way ANDfollowed by b-way ORconstruction Exactly what we did with Minhashing

    If bands match in all rvalues hash to same bucket

    Cols that are hashed into 1 common bucketCandidate

    Take pointsxand y s.t. Pr[h(x) = h(y)] = p

    Hwill make (x,y) a candidate pair with prob. p

    Construction makes (x,y) a candidate pair with

    probability 1-(1-pr

    )b

    The S-Curve! Example: Take H and construct H by the AND

    construction with r= 4. Then, from H, construct H

    by the ORconstruction with b = 4

    1/19/2011 23Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    24/45

    p 1-(1-p4)4

    .2 .0064

    .3 .0320

    .4 .0985

    .5 .2275

    .6 .4260

    .7 .6666

    .8 .8785

    .9 .9860

    r = 4,b = 4 transforms a(.2,.8,.8,.2)-sensitive family into a

    (.2,.8,.8785,.0064)-sensitive family.1/19/2011 24Jure Leskovec, Stanford C246: Mining Massive Datasets

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 0.2 0.4 0.6 0.8 1

  • 8/22/2019 04 Lsh Theory

    25/45

    Picking rand b to get desired performance 50 hash-functions (r= 5, b = 10)

    1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Blue area X: False Negative rate

    These are pairs with sim > sbut the Xfraction wont share a band and thenwill never become candidates.Thismeans we will never consider thesepairs for (slow/exact) similaritycalculation!Green area Y: False Posit ive rate

    These are pairs with sim < sbutwe will consider them as candidates.This is not too bad, we will considerthem for (slow/exact) similarity

    computation and discard them.

    Similarity

    Prob

    (Candidatepair)

    Thresho

    lds

  • 8/22/2019 04 Lsh Theory

    26/45

    Picking rand b to get desired performance 50 hash-functions (r* b = 50)

    1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    r=2, b=25

    r=5, b=10

    r=10, b=5Thresholds

  • 8/22/2019 04 Lsh Theory

    27/45

    Apply a b-way OR construction followed byan r-way AND construction

    Transforms probabilityp into (1-(1-p)b)r

    The same S-curve, mirrored horizontally andvertically

    Example: Take H and construct H by the OR

    construction with b = 4. Then, from H,construct H by the AND construction

    with r= 4.

    1/19/2011 27Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    28/45

    1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

    p (1-(1-p)4)4

    .1 .0140

    .2 .1215

    .3 .3334

    .4 .5740

    .5 .7725

    .6 .9015

    .7 .9680

    .8 .9936

    The example transforms a(.2,.8,.8,.2)-sensitive family into a(.2,.8,.9936,.1215)-sensitive family.

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 0.2 0.4 0.6 0.8 1

  • 8/22/2019 04 Lsh Theory

    29/45

    Example: Apply the (4,4) OR-AND constructionfollowed by the (4,4) AND-OR construction

    Transforms a (.2, .8, .8, .2)-sensitive family into

    a (.2, .8, .9999996, .0008715)-sensitive family

    Note this family uses 256 (=4*4*4*4) of the

    original hash functions

    1/19/2011 29Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    30/45

    Pick any two distancesx< y

    Start with a (x, y, (1-x), (1-y))-sensitive family

    Apply constructions to amplify(x, y, p, q)-sensitive family,

    wherep is almost 1 and q is almost 0

    The closer to 0 and 1 we get, the more hash

    functions must be used

    1/19/2011 30Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    31/45

    LSH methods for other distance metrics: Cosine distance:Random hyperplanes

    Euclidean distance:Project on lines

    1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 31

    Points

    Signatures: shortinteger signatures thatreflect their similarity Locality-

    sensitiveHashing

    Candidate pairs:those pairs ofsignatures thatwe need to test

    for similarityDesign a (x,y,p,q)-sensitivefamily of hash functions (for that

    particular distance metric)

    Amplify the familyusingAND and OR

    constructions

    Depends on the

    distance function used

  • 8/22/2019 04 Lsh Theory

    32/45

    For cosine distance,d(A, B) = = arccos(AB / AB)

    there is a technique calledRandom Hyperplanes

    Technique similar to Minhashing

    Random Hyperplanesis a

    (d1, d2, (1-d1/180), (1-d2/180))-sensitive

    family for any d1 and d2 Reminder:(d1, d2,p1,p2)-sensitive

    1. Ifd(x,y) < d1, then prob. that h(x) = h(y) is at leastp1

    2. Ifd(x,y) > d2, then prob. that h(x) = h(y)is at most p2

    1/19/2011 32Jure Leskovec, Stanford C246: Mining Massive Datasets

    A

    B

    AB

    A

  • 8/22/2019 04 Lsh Theory

    33/45

    Pick a random vector v, which determines ahash function hvwith two buckets

    hv(x) = +1 ifvx 0; = -1 ifvx < 0 LS-family H= set of all functions derived

    from any vector

    Claim: For points x and y,

    Pr[h(x) = h(y)] = 1 d(x,y) / 180

    1/19/2011 33Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    34/45

    1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

    x

    y

    Look in theplane ofxand y.

    So: Prob[Red case] = / 180So:P[h(x)=h(y)] = 1- /180 = 1-d(x,y)

    Hyperplanenormal to v.Here h(x) h(y)

    v

    Hyperplanenormal to v.Here h(x) = h(y)

    v

  • 8/22/2019 04 Lsh Theory

    35/45

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

  • 8/22/2019 04 Lsh Theory

    36/45

    Pick some number of random vectors, andhash your data for each vector

    The result is a signature (sketch) of

    +1s and1s for each data point

    Can be used for LSH like we used the

    Minhash signatures for Jaccard distance

    Amplify using AND/OR constructions

    1/19/2011 36Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    37/45

    Expensive to pick a random vector in Mdimensions for large M

    Would have to generate M random numbers

    A more efficient approach

    It suffices to consider only vectors v

    consisting of +1 and 1 components

    Why is this more efficient?

    1/19/2011 37Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    38/45

    Simple idea: Hash functions correspond to lines

    Partition the line into buckets of size a

    Hash each point to the bucket containing itsprojection onto the line

    Nearby points are always close;

    distant points are rarely in same bucket

    1/19/2011 38Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    39/45

    Lucky case: Points that are close

    hash in the same bucket

    Distant points end up in

    different buckets

    Two unlucky cases: Top: unlucky

    quantization

    Bottom: unlucky

    projection1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

    vv

    Line

    Buckets of size a

    v v

    vv

    v v

    vv

    vv

  • 8/22/2019 04 Lsh Theory

    40/45

    1/16/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 40

    v v

    vv

    vv

    vv

  • 8/22/2019 04 Lsh Theory

    41/45

    Bucketwidth a

    Randomlychosen line

    Points atdistance d

    Ifd

  • 8/22/2019 04 Lsh Theory

    42/45

    Bucketwidth a

    Points atdistance d

    d cos

    Ifd >> a, mustbe close to 90o

    for there to beany chance pointsgo to the samebucket.

    1/19/2011 42Jure Leskovec, Stanford C246: Mining Massive Datasets

    Randomlychosen line

  • 8/22/2019 04 Lsh Theory

    43/45

    If points are distance d 2aapart, then they

    can be in the same bucket only if d cos

    a

    cos

    60 < < 90, i.e., at most 1/3 probability

    Yields a (a/2, 2a, 1/2, 1/3)-sensitivefamily ofhash functions for any a

    Amplify using AND-OR cascades

    1/19/2011 43Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    44/45

    44

    Projection method yields a (a/2, 2a, 1/2,1/3)-sensitivefamily of hash functions

    For previous distance measures, we could

    start with an (x, y, p, q)-sensitive family foranyx< y, and drivep and q to 1 and 0 by

    AND/OR constructions

    Here, we seem to needx 4 y In the calculation on the previous slide we only

    considered cases d 2a

    1/19/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

  • 8/22/2019 04 Lsh Theory

    45/45

    But as long asx< y, the probability of pointsat distancex falling in the same bucket is

    greater than the probability of points at

    distance ydoing so Thus, the hash family formed by projecting

    onto lines is an (x, y, p, q)-sensitive family

    for somep > q Then, amplify by AND/OR constructions