Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
CS 591, Lecture 7Data Analytics: Theory and Applications
Boston University
Babis Tsourakakis
February 13th, 2017
Bloom Filter
• Approximate membership problem
• Highly space-efficient randomized data structure
• Its analysis shows an interesting tradeoff between spaceand error probability
• The Bloom filter principle[Broder and Mitzenmacher, 2004]:
Wherever a list or set is used, and space is at apremium,consider using a Bloomfilter if the effect of falsepositives can be mitigated.
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 2 / 21
Bloom Filter – Applications
Historically, Bloom filter was developed in the context ofdictionary applications when space resources were scarce.
• Burton H. Bloom introduced Bloom filters (1970) for anapplication related to hyphenation programs.
• Bloom filters were also used in early UNIX spell-checker(space savings were crucial for functionality)
• Avoid weak passwords.
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 3 / 21
Bloom Filter – Applications
Content Delivery in P2P networks [Byers et al., 2002]
• Peer A has items SA
• Peer B has items SB
• B wants to obtain items in SA − SB
• Communication efficient approach: B send a Bloom filterto A
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 4 / 21
Bloom Filter – Applications
More important applications in
• Networks
• Distributed Caching
• Databases (e.g., Bloomjoin algorithm)
• . . .
For a great survey on Bloom filters, read[Broder and Mitzenmacher, 2004].
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 5 / 21
Bloom Filter – Description
1 A vector of m bits
2 k independent hash functions h1, . . . , hk
3 A set S of n keys
4 To store key x , we set A[hi(x)] = 1 for all i ∈ [k]
5 Lookup(x): if A[hi(x)] = 1 for all i ∈ [k], then x ∈ S .
6 No false negatives, but false positives may exist.
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 6 / 21
Bloom filter in Python – Pybloom library
>>> from pybloom import BloomFilter
>>> f = BloomFilter(capacity =1000, err =0.001)
>>> [f.add(x) for x in range (10)]
>>> all([(x in f) for x in range (10)])
True
>>> 10 in f
False
>>> 5 in f
True
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 7 / 21
Bloom Filters – Example
Demo: Bloom Filters by Example
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 8 / 21
Bloom Filters – Example
Demo: Bloom Filters by Example
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 9 / 21
Bloom Filters – Example
Demo: Bloom Filters by Example
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 10 / 21
Bloom Filters – Example
Demo: Bloom Filters by Example
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 11 / 21
Bloom Filters – Example
Demo: Bloom Filters by Example
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 12 / 21
Bloom Filters – Example
Demo: Bloom Filters by Example
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 13 / 21
Bloom Filters – False positives
• Assumption: hi are close to being independent hashfunctions, probes are uniform
• Claim: Pr [A(i) = 1] = 1− (1− 1m
)kn
• Why?
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 14 / 21
Bloom Filters – False positives
• How can we upper bound Pr [A(i)] ?
Pr [A(i) = 1] = 1− (1− 1
m)kn ≤ 1− ekn/m.
Where is the mistake?
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 15 / 21
Bloom Filters – False positives
• The inequality goes the wrong way, since 1− x ≤ e−x .
• We will use instead 1− x ≥ e−x−x2
which holds forx ∈ [0, 0.683].
Pr [A(i) = 1] = 1− (1− 1
m)kn ≤ 1− e−
knm(1+ 1
m)
= 1− e−kα′,
where α = nm
is the load factor, and α′ = (1 + 1m
)α.
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 16 / 21
Bloom Filters – False positives
• What is the probability of a false positive, i.e., x that wasnever inserted by satisfies A(hi(x)) = 1 for all i ∈ [k]?(∑A(i)
m
)k.
• Here, for simplicity we will assume that the false positiveprobability is
(1− e−kα′)k
• Question: Fix the load factor α, what is the value of kthat minimizes the error probability?
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 17 / 21
Bloom Filters – False positives
• We will minimize the error probability by taking thederivative wrt x = e−kα
′.
• Solving for k = − 1α′logx
d
dxlog
((1− x)k
)=
d
dxk log(1− x) =
d
dx
(− 1
α′logx log(1− x)
)= − 1
α′
( log 1− x
x− log x
1− x
)= 0.
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 18 / 21
Bloom Filters – False positives
• By simplifying the above we obtain that
x∗ log x∗ = (1− x∗) log (1− x∗)⇒ x∗ =1
2.
• In retrospect, this outcome is intuitive (why?)
• k = 1α′
log 2 = 1α(1+m−1)
log 2
• For a given false positive rate ε
2−log 2α′ ≤ ε⇒ α ≤ log2 2
(1 + 1m
) log(1ε).
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 19 / 21
Bloom Filters – Do we need k hash functions?
Double hashing works!
• Instead of using k random hash functions, one can choosetwo sufficiently random hash functions h, h′ and then set
hi(x) = h(x) + ih′(x) mod m.
• This was proved [Kirsch and Mitzenmacher, 2006].
• Dillinger and Manolios had earlier suggested
hi(x) = h(x) + ih′(x) + i2 mod m,
as an effective heuristic.
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 20 / 21
references I
Broder, A. and Mitzenmacher, M. (2004).
Network applications of bloom filters: A survey.
Internet mathematics, 1(4):485–509.
Byers, J., Considine, J., Mitzenmacher, M., and Rost, S. (2002).
Informed content delivery across adaptive overlay networks.
ACM SIGCOMM Computer Communication Review, 32(4):47–60.
Kirsch, A. and Mitzenmacher, M. (2006).
Less hashing, same performance: building a better bloom filter.
In European Symposium on Algorithms, pages 456–467. Springer.
Babis Tsourakakis CS 591 Data Analytics, Lecture 7 21 / 21