21
CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 13th, 2017

- CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

CS 591, Lecture 7Data Analytics: Theory and Applications

Boston University

Babis Tsourakakis

February 13th, 2017

Page 2: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filter

• Approximate membership problem

• Highly space-efficient randomized data structure

• Its analysis shows an interesting tradeoff between spaceand error probability

• The Bloom filter principle[Broder and Mitzenmacher, 2004]:

Wherever a list or set is used, and space is at apremium,consider using a Bloomfilter if the effect of falsepositives can be mitigated.

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 2 / 21

Page 3: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filter – Applications

Historically, Bloom filter was developed in the context ofdictionary applications when space resources were scarce.

• Burton H. Bloom introduced Bloom filters (1970) for anapplication related to hyphenation programs.

• Bloom filters were also used in early UNIX spell-checker(space savings were crucial for functionality)

• Avoid weak passwords.

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 3 / 21

Page 4: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filter – Applications

Content Delivery in P2P networks [Byers et al., 2002]

• Peer A has items SA

• Peer B has items SB

• B wants to obtain items in SA − SB

• Communication efficient approach: B send a Bloom filterto A

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 4 / 21

Page 5: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filter – Applications

More important applications in

• Networks

• Distributed Caching

• Databases (e.g., Bloomjoin algorithm)

• . . .

For a great survey on Bloom filters, read[Broder and Mitzenmacher, 2004].

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 5 / 21

Page 6: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filter – Description

1 A vector of m bits

2 k independent hash functions h1, . . . , hk

3 A set S of n keys

4 To store key x , we set A[hi(x)] = 1 for all i ∈ [k]

5 Lookup(x): if A[hi(x)] = 1 for all i ∈ [k], then x ∈ S .

6 No false negatives, but false positives may exist.

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 6 / 21

Page 7: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom filter in Python – Pybloom library

>>> from pybloom import BloomFilter

>>> f = BloomFilter(capacity =1000, err =0.001)

>>> [f.add(x) for x in range (10)]

>>> all([(x in f) for x in range (10)])

True

>>> 10 in f

False

>>> 5 in f

True

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 7 / 21

Page 8: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – Example

Demo: Bloom Filters by Example

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 8 / 21

Page 9: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – Example

Demo: Bloom Filters by Example

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 9 / 21

Page 10: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – Example

Demo: Bloom Filters by Example

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 10 / 21

Page 11: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – Example

Demo: Bloom Filters by Example

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 11 / 21

Page 12: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – Example

Demo: Bloom Filters by Example

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 12 / 21

Page 13: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – Example

Demo: Bloom Filters by Example

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 13 / 21

Page 14: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – False positives

• Assumption: hi are close to being independent hashfunctions, probes are uniform

• Claim: Pr [A(i) = 1] = 1− (1− 1m

)kn

• Why?

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 14 / 21

Page 15: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – False positives

• How can we upper bound Pr [A(i)] ?

Pr [A(i) = 1] = 1− (1− 1

m)kn ≤ 1− ekn/m.

Where is the mistake?

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 15 / 21

Page 16: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – False positives

• The inequality goes the wrong way, since 1− x ≤ e−x .

• We will use instead 1− x ≥ e−x−x2

which holds forx ∈ [0, 0.683].

Pr [A(i) = 1] = 1− (1− 1

m)kn ≤ 1− e−

knm(1+ 1

m)

= 1− e−kα′,

where α = nm

is the load factor, and α′ = (1 + 1m

)α.

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 16 / 21

Page 17: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – False positives

• What is the probability of a false positive, i.e., x that wasnever inserted by satisfies A(hi(x)) = 1 for all i ∈ [k]?(∑A(i)

m

)k.

• Here, for simplicity we will assume that the false positiveprobability is

(1− e−kα′)k

• Question: Fix the load factor α, what is the value of kthat minimizes the error probability?

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 17 / 21

Page 18: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – False positives

• We will minimize the error probability by taking thederivative wrt x = e−kα

′.

• Solving for k = − 1α′logx

d

dxlog

((1− x)k

)=

d

dxk log(1− x) =

d

dx

(− 1

α′logx log(1− x)

)= − 1

α′

( log 1− x

x− log x

1− x

)= 0.

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 18 / 21

Page 19: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – False positives

• By simplifying the above we obtain that

x∗ log x∗ = (1− x∗) log (1− x∗)⇒ x∗ =1

2.

• In retrospect, this outcome is intuitive (why?)

• k = 1α′

log 2 = 1α(1+m−1)

log 2

• For a given false positive rate ε

2−log 2α′ ≤ ε⇒ α ≤ log2 2

(1 + 1m

) log(1ε).

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 19 / 21

Page 20: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

Bloom Filters – Do we need k hash functions?

Double hashing works!

• Instead of using k random hash functions, one can choosetwo sufficiently random hash functions h, h′ and then set

hi(x) = h(x) + ih′(x) mod m.

• This was proved [Kirsch and Mitzenmacher, 2006].

• Dillinger and Manolios had earlier suggested

hi(x) = h(x) + ih′(x) + i2 mod m,

as an effective heuristic.

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 20 / 21

Page 21: - CS 591, Lecture 7 Data Analytics: Theory and ...people.seas.harvard.edu/~babis/cs591_files/BloomFilter.pdfBloom Filter { Applications Historically, Bloom lter was developed in the

references I

Broder, A. and Mitzenmacher, M. (2004).

Network applications of bloom filters: A survey.

Internet mathematics, 1(4):485–509.

Byers, J., Considine, J., Mitzenmacher, M., and Rost, S. (2002).

Informed content delivery across adaptive overlay networks.

ACM SIGCOMM Computer Communication Review, 32(4):47–60.

Kirsch, A. and Mitzenmacher, M. (2006).

Less hashing, same performance: building a better bloom filter.

In European Symposium on Algorithms, pages 456–467. Springer.

Babis Tsourakakis CS 591 Data Analytics, Lecture 7 21 / 21