183
Large-Scale Similarity Joins With Guarantees Rasmus Pagh IT University of Copenhagen SISAP October 13, 2015 S CALABLE S IMILARITY S EARCH Licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license. This images of book covers, movie posters, and research articles and the copyright for them are most likely owned either by the publishers. It is believed that the use of low-resolution images qualifies as fair use. 1

Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Large-Scale Similarity Joins With Guarantees

Rasmus Pagh IT University of Copenhagen

SISAPOctober 13, 2015

SCALABLESIMILARITYSEARCH

Licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license. This images of book covers, movie posters, and research articles and the copyright for them are most likely owned

either by the publishers. It is believed that the use of low-resolution images qualifies as fair use. 1

Page 2: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Talk outline

• In theory…

• The (approximate) similarity join problem

• Techniques for candidate set generation

- Locality-sensitive hashing- Cache-efficiency via recursion- CoveringLSH: Achieving 100% recall

• In practice…

2

Page 3: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

In theory…

3

Page 4: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

• I was trained as an algorithm theorist

- Worst case assumptions on data

- Big-O notation

- Papers full of math, rarely experiments

4

Confession

Page 5: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

• I was trained as an algorithm theorist

- Worst case assumptions on data

- Big-O notation

- Papers full of math, rarely experiments

• “In theory there is no difference between theory and practice… But in practice there is!”

4

Confession

Page 6: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

5

Figure 3: Our map of computer science: The map was constructed by embedding the conference graphinto a 2-dimensional Euclidean space. Only top-tier conferences (according to Libra) are shown. Note thatthe map only represents pairwise distances, there is no notion of orientation, i.e. the axes can be chosenarbitrarily.

ACM SIGACT News 56 December 2007 Vol. 38, No. 4

Similarity map of CS conferences

Source: Kuhn & Wattenhofer. The Theoretic Center of Computer Science

SISAP (?)

Page 7: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

5

Figure 3: Our map of computer science: The map was constructed by embedding the conference graphinto a 2-dimensional Euclidean space. Only top-tier conferences (according to Libra) are shown. Note thatthe map only represents pairwise distances, there is no notion of orientation, i.e. the axes can be chosenarbitrarily.

ACM SIGACT News 56 December 2007 Vol. 38, No. 4

Similarity map of CS conferences

Source: Kuhn & Wattenhofer. The Theoretic Center of Computer Science

interaction gap

SISAP (?)

Page 8: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Theory with impact• Almost always algorithms

that are easy to describe, implement, and adapt

6

Simpledescription

Page 9: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Theory with impact• Almost always algorithms

that are easy to describe, implement, and adapt

• Analysis often simple enough to be taught to undergraduates

6

Can be taught

Simpledescription

Page 10: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Theory with impact• Almost always algorithms

that are easy to describe, implement, and adapt

• Analysis often simple enough to be taught to undergraduates

• Solving a real problem

6

Can be taught

Simpledescription

Applicable

Page 11: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

7

NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]Andrei Broder, Moses Charikar, and Piotr Indyk, recipients of the Paris Kanellakis Theory and Practice Award for algorithms that allow for quickly finding similar entries in large databases, known as locality-sensitive hashing (LSH). […] The Kanellakis Award honors specific theoretical accomplishments that significantly affect the practice of computing.

Main contributions in the years 1997-2002

Page 12: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

8

Now

Page 13: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

8

Now

Soon?

Page 14: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

8

Now

Soon?

Page 15: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Similarity joins

9

Page 16: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Similarity join example 1 (record linkage)

Country Name

USA IBM

USA Microsoft

Germany SAP

China Baidu

Token ID

Mircosoft 1

SAP SE 2

I.B.M. 3

baidu.com 4

10

Page 17: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Similarity join example 1 (record linkage)

Country Name

USA IBM

USA Microsoft

Germany SAP

China Baidu

Token ID

Mircosoft 1

SAP SE 2

I.B.M. 3

baidu.com 4

10

Page 18: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Class

Mammalia

Mammalia

Reptilia

Aves

Name

Cat

Dog

Snake

Parrot

ID

1

2

3

4

Image

Images by Bodlina and Marek Szczepanek

11

Similarity join example 2 (classification)

Page 19: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Class

Mammalia

Mammalia

Reptilia

Aves

Name

Cat

Dog

Snake

Parrot

ID

1

2

3

4

ImageFeatures

0101101

1001000

1101101

1101110

Images by Bodlina and Marek Szczepanek

11

Similarity join example 2 (classification)

Page 20: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Class

Mammalia

Mammalia

Reptilia

Aves

Name

Cat

Dog

Snake

Parrot

Features

1100101

1101101

1011000

1010111

ID

1

2

3

4

ImageFeatures

0101101

1001000

1101101

1101110

Images by Bodlina and Marek Szczepanek

11

Similarity join example 2 (classification)

Page 21: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Class

Mammalia

Mammalia

Reptilia

Aves

Name

Cat

Dog

Snake

Parrot

Features

1100101

1101101

1011000

1010111

ID

1

2

3

4

ImageFeatures

0101101

1001000

1101101

1101110

Images by Bodlina and Marek Szczepanek

11

Similarity join example 2 (classification)

Page 22: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

User Follows

@rasmuspagh1 {@ERC_SSS, @NateSilver538, @NatureNews, @sapinker,@techreview,…}

@Reza_Zadeh {@medialab,@techreview, @AndrewYNg,…}

… …

@simMachines {@neiltyson,@NatureNews, @RichardDawkins,…}

12

Similarity join example 3 (recommendation)

Page 23: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Algorithmic problem

• Given a tolerance r compute:

13

Q ./r S = {(q, x) 2 Q⇥ S | ||q � x|| r}

Page 24: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Algorithmic problem

• Given a tolerance r compute:

13

Q ./r S = {(q, x) 2 Q⇥ S | ||q � x|| r}

Distance measure

Page 25: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Algorithmic problem

• Given a tolerance r compute:

13

• This talk: Consider n vectors in {0,1}d and Hamming distance.

1100101

1101101

1100101

1101101

q

x

=

=

Q ./r S = {(q, x) 2 Q⇥ S | ||q � x|| r}

||q � x|| = 1

Distance measure

Page 26: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Variants

• kNN similarity join

• Batched similarity search

14

Page 27: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

15

Many names…

Page 28: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

15

2015 Many names…

Page 29: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Similarity join in a picture

16

Q ./r SQR=

Page 30: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Similarity join in a picture

16

Q ./r SQR=

Page 31: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Similarity join in a picture

16

Q ./r SQR=

Page 32: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Similarity join in a picture in high dimensional space (d=80)

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

17

Page 33: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Why is this hard?

18

Page 34: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Why is this hard?

18

Because of the CURSE of

dimensionality!

Page 35: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Why is this hard?

• [Williams ’04], [Alman & Williams ’15]: Hamming similarity search in time n0.99 2o(d) ⟹

k-SAT w. n variables can be solved in time cn, c < 2

18

Page 36: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Why is this hard?

• [Williams ’04], [Alman & Williams ’15]: Hamming similarity search in time n0.99 2o(d) ⟹

k-SAT w. n variables can be solved in time cn, c < 2

18

Strong ETH states that this is not possible

Page 37: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

19

Page 38: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

c-approximate similarity joinQ ./r SQ◆C

20

Page 39: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

c-approximate similarity joinQ ./r SQ◆C

20

Page 40: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

c-approximate similarity joinQ ./r SQ◆C False positive pairs,

filter away in time|Q ./cr S|

20

Page 41: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Effect of approximation

21

10−3 10−2 10−1 10010−5

10−4

10−3

10−2

10−1

100

Jaccard distance

(b) Enron email dataset

CD

F of

pai

rwis

e di

stan

ces

103 104 10510−5

10−4

10−3

10−2

10−1

100

L1 distance

(a) MNIST dataset

CD

F of

pai

rwis

e di

stan

ces

Frac

tion

of a

ll pa

irs

within Hamming distance

MNIST data set (unary encoding)

r

Page 42: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Effect of approximation

21

10−3 10−2 10−1 10010−5

10−4

10−3

10−2

10−1

100

Jaccard distance

(b) Enron email dataset

CD

F of

pai

rwis

e di

stan

ces

103 104 10510−5

10−4

10−3

10−2

10−1

100

L1 distance

(a) MNIST dataset

CD

F of

pai

rwis

e di

stan

ces

Frac

tion

of a

ll pa

irs

within Hamming distance

MNIST data set (unary encoding)

r

Q ./r S

Page 43: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Effect of approximation

21

10−3 10−2 10−1 10010−5

10−4

10−3

10−2

10−1

100

Jaccard distance

(b) Enron email dataset

CD

F of

pai

rwis

e di

stan

ces

103 104 10510−5

10−4

10−3

10−2

10−1

100

L1 distance

(a) MNIST dataset

CD

F of

pai

rwis

e di

stan

ces

Frac

tion

of a

ll pa

irs

within Hamming distance

MNIST data set (unary encoding)

r cr

Q ./r S

Q ./cr S

Page 44: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Effect of approximation

21

10−3 10−2 10−1 10010−5

10−4

10−3

10−2

10−1

100

Jaccard distance

(b) Enron email dataset

CD

F of

pai

rwis

e di

stan

ces

103 104 10510−5

10−4

10−3

10−2

10−1

100

L1 distance

(a) MNIST dataset

CD

F of

pai

rwis

e di

stan

ces

Frac

tion

of a

ll pa

irs

within Hamming distance

MNIST data set (unary encoding)

Additional fraction of pairs that may

become candidates

r cr

Q ./r S

Q ./cr S

Page 45: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Effect of approximation

21

10−3 10−2 10−1 10010−5

10−4

10−3

10−2

10−1

100

Jaccard distance

(b) Enron email dataset

CD

F of

pai

rwis

e di

stan

ces

103 104 10510−5

10−4

10−3

10−2

10−1

100

L1 distance

(a) MNIST dataset

CD

F of

pai

rwis

e di

stan

ces

Frac

tion

of a

ll pa

irs

within Hamming distance

MNIST data set (unary encoding)

Additional fraction of pairs that may

become candidates

r cr

In rest of the talk: Assumeis not much larger than

|Q ./cr S|

|Q ./r S|Q ./r S

Q ./cr S

Page 46: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Candidate set generation- Locality-sensitive hashing (LSH)

22

Page 47: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

23

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1

[Indyk & Motwani ’98]

Page 48: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

23

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1

Idea: Consider projection onto a random subset of dimensions, each chosen with probability p

[Indyk & Motwani ’98]

Page 49: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

23

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1

Idea: Consider projection onto a random subset of dimensions, each chosen with probability p

[Indyk & Motwani ’98]

Page 50: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

23

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1

Idea: Consider projection onto a random subset of dimensions, each chosen with probability p

h(x) = x ⋀ a

[Indyk & Motwani ’98]

Page 51: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

23

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1

Idea: Consider projection onto a random subset of dimensions, each chosen with probability p

Candidate m

atch

h(x) = x ⋀ a

[Indyk & Motwani ’98]

Page 52: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

23

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1

Idea: Consider projection onto a random subset of dimensions, each chosen with probability p

Candidate m

atch

Repeat enough times that vectors at distance r produce

at least one collision

h(x) = x ⋀ a

[Indyk & Motwani ’98]

Page 53: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

23

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1

Idea: Consider projection onto a random subset of dimensions, each chosen with probability p

Candidate m

atch

Repeat enough times that vectors at distance r produce

at least one collision

h(x) = x ⋀ ahi(x) = x ⋀ ai

[Indyk & Motwani ’98]

Page 54: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Probability of collision?

24

Collision probability for ith hash table:(1� p)||x�q|| ⇡ e�p||x�q||

[Indyk & Motwani ’98]

Pr[x ⋀ ai = q ⋀ ai] =

Page 55: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Probability of collision?

24

Collision probability for ith hash table:(1� p)||x�q|| ⇡ e�p||x�q||

With , probability ≈ at distance cr and ≈ at distance r1/n

1/n1/c

p = ln(n)cr

[Indyk & Motwani ’98]

Pr[x ⋀ ai = q ⋀ ai] =

Page 56: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Probability of collision?

24

x ⋀ a1

x ⋀ a2

x ⋀ a3

⠇x ⋀ at

q ⋀ a1

q ⋀ a2

q ⋀ a3

⠇q ⋀ at

?=

?=

?=

?=

Collision probability for ith hash table:(1� p)||x�q|| ⇡ e�p||x�q||

With , probability ≈ at distance cr and ≈ at distance r1/n

1/n1/c

p = ln(n)cr

repetitions ensure

constant success

probability

t = n1/c

[Indyk & Motwani ’98]

Pr[x ⋀ ai = q ⋀ ai] =

Page 57: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Probability of collision?

24

x ⋀ a1

x ⋀ a2

x ⋀ a3

⠇x ⋀ at

q ⋀ a1

q ⋀ a2

q ⋀ a3

⠇q ⋀ at

?=

?=

?=

?=

Collision probability for ith hash table:(1� p)||x�q|| ⇡ e�p||x�q||

With , probability ≈ at distance cr and ≈ at distance r1/n

1/n1/c

p = ln(n)cr

repetitions ensure

constant success

probability

t = n1/c

Possibility of false negatives: Saying ‘No’ when ‘Yes’ is

required

[Indyk & Motwani ’98]

Pr[x ⋀ ai = q ⋀ ai] =

Page 58: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Summary of analysis

Analysis (GIM ’99): Each bit of ai is 1 with probability p.25

Page 59: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Summary of analysis

Analysis (GIM ’99): Each bit of ai is 1 with probability p.25

Page 60: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Summary of analysis

Analysis (GIM ’99): Each bit of ai is 1 with probability p.25

Page 61: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

26

Page 62: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Locality-sensitive hashing

26

Number of operations is O(dn1+1/c), expected;

Page 63: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

* Focus on subquadratic space; lower order terms ignored.

Page 64: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

O’Donnell, Wu & ZhouITCS ’10 1/c Data indep. LSH, worst case lower bound

* Focus on subquadratic space; lower order terms ignored.

Page 65: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

O’Donnell, Wu & ZhouITCS ’10 1/c Data indep. LSH, worst case lower bound

DubinerTrans. Inf. Theory ’10 1/(2c-1) Random data upper bound

* Focus on subquadratic space; lower order terms ignored.

Page 66: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

O’Donnell, Wu & ZhouITCS ’10 1/c Data indep. LSH, worst case lower bound

DubinerTrans. Inf. Theory ’10 1/(2c-1) Random data upper bound

Andoni & RazenshteynSTOC ‘15 1/(2c-1) Data dep. LSH, worst case upper bound

* Focus on subquadratic space; lower order terms ignored.

Page 67: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

O’Donnell, Wu & ZhouITCS ’10 1/c Data indep. LSH, worst case lower bound

DubinerTrans. Inf. Theory ’10 1/(2c-1) Random data upper bound

Andoni & RazenshteynSTOC ‘15 1/(2c-1) Data dep. LSH, worst case upper bound

* Focus on subquadratic space; lower order terms ignored.

and lower bound

Page 68: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

O’Donnell, Wu & ZhouITCS ’10 1/c Data indep. LSH, worst case lower bound

DubinerTrans. Inf. Theory ’10 1/(2c-1) Random data upper bound

Andoni & RazenshteynSTOC ‘15 1/(2c-1) Data dep. LSH, worst case upper bound

KapralovPODS ‘15 4/(c+1) Linear space upper bound

* Focus on subquadratic space; lower order terms ignored.

and lower bound

Page 69: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

O’Donnell, Wu & ZhouITCS ’10 1/c Data indep. LSH, worst case lower bound

DubinerTrans. Inf. Theory ’10 1/(2c-1) Random data upper bound

Andoni & RazenshteynSTOC ‘15 1/(2c-1) Data dep. LSH, worst case upper bound

KapralovPODS ‘15 4/(c+1) Linear space upper bound

ValiantFOCS ‘12

Batched search [random data]

* Focus on subquadratic space; lower order terms ignored.

1

4�! +O⇣

1�1/clog d

and lower bound

ω < 2.38

Page 70: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

O’Donnell, Wu & ZhouITCS ’10 1/c Data indep. LSH, worst case lower bound

DubinerTrans. Inf. Theory ’10 1/(2c-1) Random data upper bound

Andoni & RazenshteynSTOC ‘15 1/(2c-1) Data dep. LSH, worst case upper bound

KapralovPODS ‘15 4/(c+1) Linear space upper bound

ValiantFOCS ‘12

Batched search [random data]

Karppa, Kaski & Kohonen SODA ‘16

Batched search [c>1, random data, ω > 2.25]

* Focus on subquadratic space; lower order terms ignored.

1

4�! +O⇣

1�1/clog d

and lower bound

2!�33

ω < 2.38

Page 71: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Newer theory developments*

27

Reference Exponent, search time Comment

Linear search 1Indyk & Motwani

STOC ’98 1/c Worst case upper bound

O’Donnell, Wu & ZhouITCS ’10 1/c Data indep. LSH, worst case lower bound

DubinerTrans. Inf. Theory ’10 1/(2c-1) Random data upper bound

Andoni & RazenshteynSTOC ‘15 1/(2c-1) Data dep. LSH, worst case upper bound

KapralovPODS ‘15 4/(c+1) Linear space upper bound

ValiantFOCS ‘12

Batched search [random data]

Karppa, Kaski & Kohonen SODA ‘16

Batched search [c>1, random data, ω > 2.25]

Alman & WilliamsFOCS ‘15

Batched search, c=11� ˜

⌦(log(n)/d)

* Focus on subquadratic space; lower order terms ignored.

1

4�! +O⇣

1�1/clog d

and lower bound

2!�33

ω < 2.38

Page 72: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Can LSH work for large data?• Quote from a database paper:

“LSH needs large memory space and long processing time to achieve good performance when searching a massive dataset”.

28

Page 73: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Can LSH work for large data?• Quote from a database paper:

“LSH needs large memory space and long processing time to achieve good performance when searching a massive dataset”.

28

For similarity join, space is linear:

Process one hash function at a time.

Page 74: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Can LSH work for large data?• Quote from a database paper:

“LSH needs large memory space and long processing time to achieve good performance when searching a massive dataset”.

• Other issues:- LSH parameters are pessimistic;

chosen to work for worst-case data set.

28

For similarity join, space is linear:

Process one hash function at a time.

Page 75: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Can LSH work for large data?• Quote from a database paper:

“LSH needs large memory space and long processing time to achieve good performance when searching a massive dataset”.

• Other issues:- LSH parameters are pessimistic;

chosen to work for worst-case data set.- Poor use of internal memory: A simple nested loop join

often has better I/O complexity.

28

For similarity join, space is linear:

Process one hash function at a time.

Page 76: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

On pessimism

• LSH chosen to ensure few collisions at distance cr, even in worst-case scenarios:

29

Page 77: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

On pessimism

• LSH chosen to ensure few collisions at distance cr, even in worst-case scenarios:

29

Price paid: Also small collision

probability at distance r, so need many repetitions.

Page 78: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Candidate set generation- Cache-efficiency via recursion

30

Page 79: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O model

• Data is stored on an external storage device, with B vectors / storage block

• Internal memory can hold M vectors

• Count the number of block transfers between internal memory and external storage (I/Os)

31

D

P

M

BlockI/O

Figure courtesy of Lars Arge

Page 80: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

Cautious LSH

32

Joint work with Pham, Silvestri, and Stöckel

Idea: Apply a weak LSH with constant

collision probability.

Page 81: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

Cautious LSH

32

Joint work with Pham, Silvestri, and Stöckel

Idea: Apply a weak LSH with constant

collision probability.

Page 82: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0

1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0

Cautious LSH

33

Joint work with Pham, Silvestri, and Stöckel

Page 83: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0

1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0

Cautious LSH

33

RECURSIVE SUBPROBLEM

RECURSIVE SUBPROBLEM

Joint work with Pham, Silvestri, and Stöckel

Q0 ./r S0

Q1 ./r S1

Page 84: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0

1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0

Cautious LSH

33

RECURSIVE SUBPROBLEM

RECURSIVE SUBPROBLEM

Joint work with Pham, Silvestri, and Stöckel

Subproblems of size M or less require no further I/Os

Q0 ./r S0

Q1 ./r S1

Page 85: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Example, cautious LSH• Assume weak LSH collision prob. 0.9.

• Prob. that q and x collide i times is 0.9i.• If depth is log n, success probability is ≈ n-0.152.

34

Q ./r S

Q0 ./r S0

Q00 ./r S00 Q01 ./r S01 Q10 ./r S10 Q11 ./r S11

Q1 ./r S1

qx

Page 86: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Example, cautious LSH• Assume weak LSH collision prob. 0.9.

• Prob. that q and x collide i times is 0.9i.• If depth is log n, success probability is ≈ n-0.152.

34

Q ./r S

Q0 ./r S0

Q00 ./r S00 Q01 ./r S01 Q10 ./r S10 Q11 ./r S11

Q1 ./r S1

q x

Page 87: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Example 2, cautious LSH• Assume weak LSH collision prob. 0.5.

• Prob. that q and x collide i times is 0.5i.

35

Q ./r S

Q0 ./r S0 Q1 ./r S1

qx

Page 88: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Q00 ./r S00 Q01 ./r S01 Q10 ./r S10 Q11 ./r S11

Example 2, cautious LSH• Assume weak LSH collision prob. 0.5.

• Prob. that q and x collide i times is 0.5i.• Idea: Recurse twice at each node.

35

Q ./r S

Q0 ./r S0 Q1 ./r S1

qx

Page 89: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Q00 ./r S00 Q01 ./r S01 Q10 ./r S10 Q11 ./r S11

Example 2, cautious LSH• Assume weak LSH collision prob. 0.5.

• Prob. that q and x collide i times is 0.5i.• Idea: Recurse twice at each node.

35

Q ./r S

Q0 ./r S0 Q1 ./r S1

qx

#subproblems containing q and x: 1 at each level, expected!

Page 90: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Aside: Don’t be fooled by great expectations

36

Page 91: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Aside: Don’t be fooled by great expectations

36

What is the expected TNT equivalent of asteroid impacts during this talk?

Page 92: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Branching processes

• Abstract setting:

- One pair (q,x) at root problem.

- Generate t subproblems such that (q,x) is “reproduced” in each with probability ≥ 1/t.

37

Page 93: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Branching processes

• Abstract setting:

- One pair (q,x) at root problem.

- Generate t subproblems such that (q,x) is “reproduced” in each with probability ≥ 1/t.

• What is the probability that (q,x) is extinct at recursive level i?

37

Page 94: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Branching processes

• Abstract setting:

- One pair (q,x) at root problem.

- Generate t subproblems such that (q,x) is “reproduced” in each with probability ≥ 1/t.

• What is the probability that (q,x) is extinct at recursive level i?

- From theory of branching processes: Ω(1/√i)

37

Page 95: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

Page 96: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

Size

of a

ll su

bpro

blem

s

Page 97: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

nt

Size

of a

ll su

bpro

blem

s

Page 98: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

nt

nt2

nt3

Size

of a

ll su

bpro

blem

s

Page 99: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

nt

nt2

nt3

Size

of a

ll su

bpro

blem

s

Expe

cted

ave

rage

#p

oint

s at d

ista

nce

> cr

.n

Page 100: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

nt

nt2

nt3

Size

of a

ll su

bpro

blem

s

Expe

cted

ave

rage

#p

oint

s at d

ista

nce

> cr

.nn/tc

Page 101: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

nt

nt2

nt3

Size

of a

ll su

bpro

blem

s

Expe

cted

ave

rage

#p

oint

s at d

ista

nce

> cr

.nn/tc

n/t2c

n/t3c

⠇ ⠇

Page 102: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

nt

nt2

nt3

Size

of a

ll su

bpro

blem

s

Expe

cted

ave

rage

#p

oint

s at d

ista

nce

> cr

.nn/tc

n/t2c

n/t3c

⠇ ⠇In-memory computation

Page 103: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

nt

nt2

nt3

Size

of a

ll su

bpro

blem

s

Expe

cted

ave

rage

#p

oint

s at d

ista

nce

> cr

.nn/tc

n/t2c

n/t3c

⠇ ⠇In-memory computation

log

tc(n/M

)

Page 104: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity, sketch

38

recursion tree

n

nt

nt2

nt3

Size

of a

ll su

bpro

blem

s

Expe

cted

ave

rage

#p

oint

s at d

ista

nce

> cr

.nn/tc

n/t2c

n/t3c

⠇ ⠇In-memory computation

log

tc(n/M

)

Pessimistic

Page 105: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity

39

O

0

@⇣ n

M

⌘1/c

0

@ n

B+

|Q ./r

S|

MB

1

A

1

A I/Os

• Simplified:

Assumingis not much larger than

|Q ./cr S|

|Q ./r S|

Page 106: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity

39

O

0

@⇣ n

M

⌘1/c

0

@ n

B+

|Q ./r

S|

MB

1

A

1

A I/Os

• Simplified:

Assumingis not much larger than

|Q ./cr S|

|Q ./r S|

Cost of reading input

Page 107: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity

39

O

0

@⇣ n

M

⌘1/c

0

@ n

B+

|Q ./r

S|

MB

1

A

1

A I/Os

• Simplified:

Assumingis not much larger than

|Q ./cr S|

|Q ./r S|

Cost of reading input

Cost of generating

output

Page 108: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity

39

O

0

@⇣ n

M

⌘1/c

0

@ n

B+

|Q ./r

S|

MB

1

A

1

A I/Os

• Simplified:

Assumingis not much larger than

|Q ./cr S|

|Q ./r S|

Cost of reading input

Cost of generating

outputOverhead

Page 109: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

I/O complexity

• In general:

39

O

0

@⇣ n

M

⌘1/c

0

@ n

B+

|Q ./r

S|

MB

1

A+

|Q ./cr

S|

MB

1

A I/Os

O

0

@⇣ n

M

⌘1/c

0

@ n

B+

|Q ./r

S|

MB

1

A

1

A I/Os

• Simplified:

Page 110: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Candidate set generation- CoveringLSH: Achieving 100% recall

40

Page 111: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

41

Page 112: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Approximation

Two kinds of approximation:

• Approximate distances

• Allow false positives and negatives (precision and recall below 100%)

42

Page 113: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Approximation

Two kinds of approximation:

• Approximate distances

• Allow false positives and negatives (precision and recall below 100%)

42

Inherent to LSH

Page 114: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Approximation

Two kinds of approximation:

• Approximate distances

• Allow false positives and negatives (precision and recall below 100%)

42

Inherent to LSH

Total recall possible!

Page 115: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

43

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

partitioning

[Arasu, Ganti & Kaushik ’06]

Page 116: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

43

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

h1

partitioning

[Arasu, Ganti & Kaushik ’06]

Page 117: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

43

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

h2

partitioning

[Arasu, Ganti & Kaushik ’06]

Page 118: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

43

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

h3

partitioning

[Arasu, Ganti & Kaushik ’06]

Page 119: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

43

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

h4

partitioning

[Arasu, Ganti & Kaushik ’06]

Page 120: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

43

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

For Hamming distance ≤ 3, a collision is guaranteed!

partitioning

[Arasu, Ganti & Kaushik ’06]

Page 121: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

43

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

For Hamming distance ≤ 3, a collision is guaranteed!

[Arasu et al. ’06]: To bound probability of collision for distance > 3 randomly permute

the dimensions

partitioning

[Arasu, Ganti & Kaushik ’06]

Page 122: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

44

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

h12

enumeration

[Arasu, Ganti & Kaushik ’06]

Page 123: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

45

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

h13

enumeration

[Arasu, Ganti & Kaushik ’06]

Page 124: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

46

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

h14

enumeration

[Arasu, Ganti & Kaushik ’06]

Page 125: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Basic correlated LSH

46

1 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 0

0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 0

h14

enumeration

For Hamming distance ≤ 2, a collision is guaranteed in

h12, h13, h14, h23, h24, h34

[Arasu, Ganti & Kaushik ’06]

Page 126: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Result in LSH framework• The bit sampling LSH achieves:

47

Pr[h(q) = h(x) | ||x� q|| = cr] Pr[h(q) = h(x) | ||x� q|| = r]c

Page 127: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Result in LSH framework• The bit sampling LSH achieves:

• Bound on PartEnum construction with partitioning + enumeration + permutation:

47

Pr[h(q) = h(x) | ||x� q|| = cr] Pr[h(q) = h(x) | ||x� q|| = r]0.36 c

Pr[h(q) = h(x) | ||x� q|| = cr] Pr[h(q) = h(x) | ||x� q|| = r]c

Page 128: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Result in LSH framework• The bit sampling LSH achieves:

• Bound on PartEnum construction with partitioning + enumeration + permutation:

47

Pr[h(q) = h(x) | ||x� q|| = cr] Pr[h(q) = h(x) | ||x� q|| = r]0.36 c

Pr[h(q) = h(x) | ||x� q|| = cr] Pr[h(q) = h(x) | ||x� q|| = r]c

Need ≈ 3× larger c

Page 129: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Result in LSH framework• The bit sampling LSH achieves:

• Bound on PartEnum construction with partitioning + enumeration + permutation:

47

Pr[h(q) = h(x) | ||x� q|| = cr] Pr[h(q) = h(x) | ||x� q|| = r]0.36 c

Pr[h(q) = h(x) | ||x� q|| = cr] Pr[h(q) = h(x) | ||x� q|| = r]c

Need ≈ 3× larger c

Probabilistic argument suggests that it is possible to do much better. But how?

Page 130: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

48

Page 131: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

48

Small collision

probability

Page 132: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

48

Small collision

probability

Collision guarantee

Page 133: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

• Question: How few rows can such a matrix have?

48

Small collision

probability

Collision guarantee

Page 134: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

• Question: How few rows can such a matrix have?

48

Small collision

probability

Collision guarantee

Number of hash functions

Page 135: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

• Question: How few rows can such a matrix have?

48

p = 4/7r = 2

Small collision

probability

Collision guarantee

Number of hash functions

Page 136: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

• Question: How few rows can such a matrix have?

48

p = 4/7r = 2

Small collision

probability

Collision guarantee

Number of hash functions

Page 137: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

• Question: How few rows can such a matrix have?

48

p = 4/7r = 2

Small collision

probability

Collision guarantee

Number of hash functions

Page 138: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

Mathematical question• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

• Question: How few rows can such a matrix have?

48

p = 4/7r = 2

Small collision

probability

Collision guarantee

Number of hash functions

Page 139: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Mathematical question

49

Related to covering problems in extremal combinatorics, but good constructions have been

known only for p=O(1/d)

• A (p,r)-covering matrix of dim. d satisfies:

- Has (1-p)d zeros and pd ones in each row;

- for every set of r columns there exists a row with 0s in all of them.

• Question: How few rows can such a matrix have?

Page 140: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

50

[P., SODA ’16]

Page 141: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

50

001

010

011

101

110

111

100

001010011

101110111

100

Inde

x (b

inar

y)

[P., SODA ’16]

Page 142: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

50

Idea: Entry is dot product (mod 2) of row/column ID vectors (Hadamard code)

001

010

011

101

110

111

100

001010011

101110111

100

Inde

x (b

inar

y)

[P., SODA ’16]

Page 143: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

50

Idea: Entry is dot product (mod 2) of row/column ID vectors (Hadamard code)

001

010

011

101

110

111

100

001010011

101110111

100

Inde

x (b

inar

y)

[P., SODA ’16]

(0,1,1)·(0,1,0)=1

Page 144: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

50

Idea: Entry is dot product (mod 2) of row/column ID vectors (Hadamard code)

001

010

011

101

110

111

100

001010011

101110111

100

Inde

x (b

inar

y)

[P., SODA ’16]

(0,1,1)·(0,1,1)=0

Page 145: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

51

Idea: Entry is dot product (mod 2) of row/column ID vectors (Hadamard code)

001

010

011

101

110

111

100

001010011

101110111

100

Inde

x (b

inar

y)

[P., SODA ’16]

Page 146: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

51

Idea: Entry is dot product (mod 2) of row/column ID vectors (Hadamard code)

001

010

011

101

110

111

100

001010011

101110111

100

Inde

x (b

inar

y)

[P., SODA ’16]

Page 147: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

51

Lemma:For every set of r vectors in {0,1}r+1 there exists a nonzero vector that is orthogonal to all

Idea: Entry is dot product (mod 2) of row/column ID vectors (Hadamard code)

001

010

011

101

110

111

100

001010011

101110111

100

Inde

x (b

inar

y)

[P., SODA ’16]

Page 148: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

CoveringLSH• Next: Answer for d = 2r+1-1, pd = 2r+1 ≈ d/2

51

Lemma:For every set of r vectors in {0,1}r+1 there exists a nonzero vector that is orthogonal to all

Idea: Entry is dot product (mod 2) of row/column ID vectors (Hadamard code)

001

010

011

101

110

111

100

001010011

101110111

100

Inde

x (b

inar

y)

[P., SODA ’16]

Page 149: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Optimality?

• Could there be a smaller (4/7,2)-covering?

• We need to “cover” sets of two columns

52

✓7

2

◆= 21

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

Page 150: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Optimality?

• Could there be a smaller (4/7,2)-covering?

• We need to “cover” sets of two columns

• Each vector can cover sets of two columns

52

✓7

2

◆= 21

✓3

2

◆= 3

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

Page 151: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Optimality?

• Could there be a smaller (4/7,2)-covering?

• We need to “cover” sets of two columns

• Each vector can cover sets of two columns

52

✓7

2

◆= 21

✓3

2

◆= 3 7 vectors is

optimal!

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

Page 152: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Optimality?

• Could there be a smaller (4/7,2)-covering?

• We need to “cover” sets of two columns

• Each vector can cover sets of two columns

52

✓7

2

◆= 21

✓3

2

◆= 3 7 vectors is

optimal!

� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �� � � � � � �

For r>2: Within factor 2 of optimal

Page 153: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Optimality?

• Could there be a smaller (1/2,r)-covering?

• We need to cover sets of r columns

53

✓d

r

2r+1�

1

2r+1 � 1� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Page 154: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Optimality?

• Could there be a smaller (1/2,r)-covering?

• We need to cover sets of r columns

• Each vector can cover sets of r columns

53

✓d

r

✓d/2

r

2r+1�

1

2r+1 � 1� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Page 155: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Optimality?

• Could there be a smaller (1/2,r)-covering?

• We need to cover sets of r columns

• Each vector can cover sets of r columns

53

✓d

r

✓d/2

r

Within factor 2 of optimal!

�dr

�/�d/2

r

�> 2r

2r+1�

1

2r+1 � 1� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Page 156: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Use in similarity search• Combine with random permutation trick:

Each bit sampled w. prob. 1/2 in each hash value.

54

Page 157: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Use in similarity search• Combine with random permutation trick:

Each bit sampled w. prob. 1/2 in each hash value.

• “Sweet spot” is for r = log(n)/c:

- 2r+1 = 2 n1/c hash functions

- Collision probability 1/n at distance cr = log(n), so number of “far” collisions insignificant

54

Page 158: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Use in similarity search• Combine with random permutation trick:

Each bit sampled w. prob. 1/2 in each hash value.

• “Sweet spot” is for r = log(n)/c:

- 2r+1 = 2 n1/c hash functions

- Collision probability 1/n at distance cr = log(n), so number of “far” collisions insignificant

• Matches bound of Indyk and Motwani.

54

Page 159: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Smaller radius?

• Can map vectors from {0,1}d to {0,1}td, increasing all distances by an integer factor t.

• Try to “hit” sweet spot tr = log(n)/c

- Details: arXiv:1507.03225 [cs.DS]

55

1100101

1101101

1100101

1101101

1100101

1101101

1100101

1101101

1100101

1101101

qt=xt=

Page 160: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Example

56

106 1010 1014 1018n (size of set)

1

104

108

1012

1016

1020

TimeSimilarity search, radius 8, approx. factor 3

Linear searchExhaustive search in Hamming ballClassical LSH, error prob. 1/nClassical LSH, error prob. 1%CoveringLSH (1 partition)

106 1010 1014 1018n (size of set)

1

104

108

1012

1016

1020

TimeSimilarity search, radius 8, approx. factor 3

Linear searchExhaustive search in Hamming ballClassical LSH, error prob. 1/nClassical LSH, error prob. 1%CoveringLSH (1 partition)

r=8, c=3

Page 161: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Larger radius?• Partition to reduce to pd < d/2 sampled bits:

- Iterate over 1/(2p) parts

57

Part 1 Part 2 Part 1/(2p)…

� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �

Page 162: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Larger radius?• Partition to reduce to pd < d/2 sampled bits:

- Iterate over 1/(2p) parts

• Distance in some part will be ≤ 2pr

- Use CoveringLSH with radius 2pr on each part

- Details: arXiv:1507.03225 [cs.DS]

57

Part 1 Part 2 Part 1/(2p)…

� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �� � � � � � � � � � � � � �

Page 163: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

106 1010 1014 1018n (size of set)

105

109

1013

1017

1021

TimeSimilarity search, radius 256, approx. factor 2

Linear searchClassical LSH, error prob. 1/nClassical LSH, error prob. 1%CoveringLSH (1 partition)

Example

58

r=256, c=2

106 1010 1014 1018n (size of set)

105

109

1013

1017

1021

TimeSimilarity search, radius 256, approx. factor 2

Linear searchClassical LSH, error prob. 1/nClassical LSH, error prob. 1%CoveringLSH (multi-partition)

Page 164: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

59

(d+1)r

Euclidean space - shifted grids

Page 165: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

60

(d+1)r

Euclidean space - shifted grids

Page 166: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

61

(d+1)r

Euclidean space - shifted grids

Page 167: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

61

(d+1)r

In same cell ⇒ d+1-approximate

near neighbor

Euclidean space - shifted grids

Page 168: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

61

(d+1)r

In same cell ⇒ d+1-approximate

near neighbor

Time d+1.Approx. factor d+1.

Euclidean space - shifted grids

Page 169: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

JL dimension reduction

62

Euclidean vector x

random linear mapping

Length concentrated around

Projection

||x||2

Page 170: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

JL dimension reduction

62

Euclidean vector x

random linear mapping

Length concentrated around

Projection

Lengths can increase ⇒false negatives possible

||x||2

Page 171: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Correlated dimension reduction

63

Euclidean vector x

Rotated vector 𝜋(x)

Projection 1 Projection 2 Projection t

random rotation

partitioning, scaling by t

Page 172: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Correlated dimension reduction

63

Euclidean vector x

Rotated vector 𝜋(x)

Projection 1 Projection 2 Projection t

random rotation

partitioning, scaling by t

Length concentrated around||x||2

Page 173: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Correlated dimension reduction

63

Euclidean vector x

Rotated vector 𝜋(x)

Projection 1 Projection 2 Projection t

random rotation

partitioning, scaling by t

Length concentrated around||x||2

Some projection will have length at most ||x||2

Page 174: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Correlated Euclidean LSH

• d nÕ(1/c) hash functions

• Collision guaranteed within distance r

• Collision probability 1/n at distance cr

64

(joint work with Matthew Skala)

Page 175: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

In practice…

65

Page 176: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

In practice…

66

?

• What are the good use cases for similarity join?

• When is 100% recall of particular value?

Page 177: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

In practice…

66

Can be taught

Simpledescription

Applicable

?

• What are the good use cases for similarity join?

• When is 100% recall of particular value?

Page 178: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Linking new theory to practice?

67

Match performance of classical (or data dep.) LSH without false neg.?

(SODA ’16)

Page 179: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Linking new theory to practice?

67

Match performance of classical (or data dep.) LSH without false neg.?

Indexing: When is

sublinear query time

possible with linear space

(PODS ’15)(SODA ’16)

Page 180: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Linking new theory to practice?

67

Match performance of classical (or data dep.) LSH without false neg.?

Indexing: When is

sublinear query time

possible with linear space

Making new LSH

constructions truly

practical?

(PODS ’15)

(NIPS ’15)

(SODA ’16)

Page 181: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Linking new theory to practice?

67

Match performance of classical (or data dep.) LSH without false neg.?

Indexing: When is

sublinear query time

possible with linear space

Data dep. LSH that works in theory and in practice?

Making new LSH

constructions truly

practical?

(PODS ’15)

(NIPS ’15) (STOC ’15)

(SODA ’16)

Page 182: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Thank you

68

To people with whom I have discussed material of this talk: Annalisa De Bonis, Francesco Silvestri, Ilya Razenshteyn, Johan von Tangen Sivertsen, Matthew Skala, Ninh Pham, Riko Jacob, Thomas Dybdahl Ahle, Tobias Christiani, Ugo Vaccaro, and more.

For economic support:

Page 183: Large-Scale Similarity Joins With Guarantees · NEW YORK, NY, April 9, 2013—ACM (the Association for Computing Machinery) today announced the winners of six prestigious awards […]

Thank you

69

for yourattention!