19
1 Bloom Filter [email protected] 2011-11-18

Bloom filter

Embed Size (px)

DESCRIPTION

bloomfilter is a data structure that can support very fast owership query and it has very compacted storage space.

Citation preview

Page 1: Bloom filter

1

Bloom Filter

[email protected]

2011-11-18

Page 2: Bloom filter

2

• A Membership Query Problem

• What is Bloom Filter

• BloomFilter Math Theory

• Compression

• Application Scenario

Agenda

Page 3: Bloom filter

3

Problem Description

Given an element E, query whether it

belongs to an big elements set S.

– Fast as soon as possible

– Small as soon as possible

Membership Query Problem

Page 4: Bloom filter

4

Some Solutions

hashtable

fast but big data structure

bitmap index

can be smaller?

Membership Query Problem

Page 5: Bloom filter

5

Tradeoff Solutions

To obtain speed and size improvements,

allow some probability of error.

Bloom Filter

Membership Query Problem

Page 6: Bloom filter

6

Support approximate set membership Given a set S = {x1,x2,…,xn}, construct data

structure to answer queries of the form “Is y in S?”

Data structure should be:–Fast (Faster than searching through S).–Small (Smaller than explicit representation).

To obtain speed and size improvements, allow some probability of error.

–False positives: y S but we report y S–False negatives: y S but we report y S

What is Bloom Filter

Page 7: Bloom filter

7

What is Bloom Filter

7

Start with an m bit array, filled with 0s.

Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

To check if y is in S, check B at Hi(y). All k values must be 1.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0BPossible to have a false positive; all k values are 1, but y is not in S.

n items m = cn bits k hash functions

Page 8: Bloom filter

What is Bloom Filter

False Positive

8

A

0

0

1

0

1

0

0

0

0

1

0

hash1

hash2

hash3B

Page 9: Bloom filter

Bloom Filter Math Theory

9

Pr(specific bit of filter is 0) is

If is fraction of 0 bits in the filter then false positive probability is

Approximations valid as is concentrated around E[].

–Martingale argument suffices. Find optimal at k = (ln 2)m/n by calculus.

–So optimal fpp is about (0.6185)m/n

pmp mknkn /e)/11('

kckkkk pp )e1()1()'1()1( /

n items m = cn bits k hash functions

Page 10: Bloom filter

Bloom Filter Math Theory

10

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 1 2 3 4 5 6 7 8 9 10

Hash functions

Fal

se p

osit

ive

rate

Opt k = 8 ln 2 = 5.45...m/n = 8

n items m = cn bits k hash functions

Page 11: Bloom filter

Bloom Filter Compression

Use BF on Network Transmission

BF as a message, should be small

enough

to transmitted over the network

Compressing bit vector is easy

Arithmetic coding gets close to entropy.

Can Bloom filters be compressed?

11

Page 12: Bloom filter

Bloom Filter Compression

• Optimize to minimize false positive.

• At k = m (ln 2) /n, p = 1/2.

• Bloom filter looks like a random string.– Can’t compress it.– H(p) = -plog2p – (1-p)log2(1-p)

12

mknkn emp /)/11(]empty is cellPr[ kmknk epf )1()1(]pos falsePr[ /

nmk /)2ln(

Page 13: Bloom filter

Bloom Filter Compression With more decompressed size (storage),

we can achive compression.

13

• Assumption: optimal compressor, z = mH(p). – H(p) is entropy function; optimally get

H(p) compressed bits per original table bit.– Arithmetic coding close to optimal.

• Optimization: Given z bits for compressed filter and n elements, choose table size m and number of hash functions k to minimize f. )(;)1(; // pmHzefep kmknmkn

Page 14: Bloom filter

Bloom Filter Compression

1414

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 1 2 3 4 5 6 7 8 9 10

Hash functions

Fal

se p

osit

ive

rate

z/n = 8Original

Compressed

Page 15: Bloom filter

Bloom Filter Compression

• At k = m (ln 2) /n, false positives are maximized with a compressed Bloom filter.– Best case without compression is worst case

with compression; compression always helps.

– Side benefit: Use fewer hash functions with compression; possible speedup.

1515

Conclusion

Page 16: Bloom filter

Application Scenario

Speed up answers in a key-value like syetem

16

filter(memory)

storage(memory)key1

no

key2yes

disk accesssuccess

key3yes

disk accessfail

Page 17: Bloom filter

Application Scenario

Web Cache

17

cache1 cache2 cache3……

Web Server

Page 18: Bloom filter

Q&A

18

Q&A

Page 19: Bloom filter