20
Bloom Filter YHD Search Sharing 2013-04-23

Bloom filter

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Bloom filter

Bloom Filter

YHD Search Sharing 2013-04-23

Page 2: Bloom filter

Outline

● Interview questions● Bloom Filter

– Data structure

– Probability of false positives

– Set properties

● Application– Cache sharing : squid

– Speed up data access : Hbase

– ID Mapping : zoie

● Materials

Page 3: Bloom filter

Interview questions

● Crawler– Billions web pages – How to keep track crawled urls

● Straggler Detection– You are manning the security desk of a large building– Everyone checks in or checks out with their id– At the end of day, identify the few stragglers left in the

building

Page 4: Bloom filter

Data structure

● Data structure – Init : a bit array of m bits, all set to 0

– Add an element ● K hash function to get K array positions● Set the bits at all these positions to 1

● Query an element (test whether it's in the set) – K hash function to get K array positions

– If any position are 0, not in the set

– If all are 1, probabilistic in the set

Page 5: Bloom filter

Probability of false positives

● 1 hash function

0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0

p(A [i ]=1∣hash [ x1 ,… , xn])=1−(1−1m

)n

p(hash [ x1]=i)=1m

p(A [i ]=0∣hash [x1])=1−1m

p(A [i ]=0∣hash [x1 ,… , xn])=(1−1m

)n

p(A [i ]=1)=1−(1−1m

)n

≃1−e−n/m limx→∞

(1−1x)−x

=e

Page 6: Bloom filter

Probability of false positives

● 1 hash function

0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0

p(A [H ( y )]=1∣y∉S)=(number of 1)

m 2/23

p(A [i ]=1)=1−e−n /m

GivenE(number of 1)=m⋅(1−e−n /m

)

p(A [H ( y )]=1∣y∉S)=(1−e−n /m)

Page 7: Bloom filter

Probability of false positives

● K hash function : repeat for k times

0 0 0 0 0 0 0 1 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0

p(A [H ( y )]=1∣y∉S)=(number of 1)

m

p(A [i ]=1)=1−e−n /m

GivenE(number of 1)=m⋅(1−e−n /m

)

p(A [H ( y )]=1∣y∉S)=(1−e−n /m)

p(A [H ( y )]=1∣y∉S)=(number of 1

m)k

p(A [i ]=1)=1−(1−1m

)kn

≃1−e−kn /m

E(number of 1)=m⋅(1−e−kn/m)

p(A [H ( y )]=1∣y∉S)=(1−e−kn/m)k

Page 8: Bloom filter

Probability of false positives

● Minimal Probability of false positives

p(A [H ( y )]=1∣y∉S)=(1−e−kn/m)k

f =(1−e−kn /m)k

f =ek∗ln (1−e−kn /m)

Minimal f, then minimal g

g=k∗ln (1−e−kn/m)

p=e−kn /mGiven

g=−mn

ln( p)∗ln(1−p)

Minimal ( f )=(12)k

p=12

e−kn /m=

12

is the probability than any specific bit is still 0

half-full Bloom filter array

Page 9: Bloom filter

Probability of false positives

● examples k=ln2mn

m/n k k=1 k=2 k=3 k=4 k=5

2 1.39 0.393 0.400

3 2.08 0.283 0.237 0.253

4 2.77 0.221 0.155 0.147 0.160

5 3.46 0.181 0.109 0.092 0.092 0.101

6 4.16 0.154 0.0804 0.0609 0.0561 0.0578

7 4.85 0.133 0.0618 0.0423 0.0359 0.0347

8 5.55 0.118 0.0489 0.0306 0.024 0.0217

Page 10: Bloom filter

Set properties

● Union (bitwise OR) – same as the Bloom filter created from scratch using

the union of the two sets.

● Intersection (AND operations)– the false positive probability in the resulting Bloom

filter is at most the false-positive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets

Page 15: Bloom filter

Squid : Cache Digests

● False positive: – Proxy A thinks Proxy B has URL U cached. A

asks for cached U, B responds back with “no”, A goes to actual website.

Page 17: Bloom filter

Hbase :HFile format

● (Not including Bloom Filter)

Page 18: Bloom filter

HBase : Query optimization

● Bloom Filter– As meta store of HFile

– used to determine if a given key is in that store file

● Characteristics– Know n total KV count (N), but actual count can

often be much lower– HFile.insert (and hence, BloomFilter.add)

commands are done in lexicographically increasing order

4000

10000

5000

9001

Page 19: Bloom filter

Application : Zoie

● Long[] uidArray– Add element

– Query element

int h = (int) ((uid >>> 32) ^ uid) * MIXER;long bits = _filter[h & _mask];bits |= ((1L << (h >>> 26)));bits |= ((1L << ((h >> 20) & 0x3F)));_filter[h & _mask] = bits;

final int h = (int) ((uid >>> 32) ^ uid) * MIXER;final int p = h & _mask; // check the filterfinal long bits = _filter[p];if ((bits & (1L << (h >>> 26))) == 0

|| (bits & (1L << ((h >> 20) & 0x3F))) == 0)return -1;

Page 20: Bloom filter

Materials

● http://www.cs.jhu.edu/~fabian/courses/CS600.624/slides/bloomslides.pdf

● Google tech talk, bloom filtering● http://www.slideshare.net/jaxlondon2012/intro

-to-hbase-lars-george● https://issues.apache.org/jira/secure/attachme

nt/12444007/Bloom_Filters_in_HBase.pdf