30
Probabilistic Data Structures and Approximate Solutions by Oleksandr Pryymak PyData London 2014 IPython notebook with code >>

Probabilistic Data Structures and Approximate Solutions

Embed Size (px)

DESCRIPTION

Will your decisions change if you'll know that the audience of your website isn't 5M users, but rather 5'042'394'953? Unlikely, so why should we always calculate the exact solution at any cost? An approximate solution for this and many similar problems would take only a fraction of memory and runtime in comparison to calculating the exact solution.

Citation preview

Page 1: Probabilistic Data Structures and Approximate Solutions

Probabilistic Data Structuresand Approximate Solutions

by Oleksandr PryymakPyData London 2014IPython notebook with code >>

Page 2: Probabilistic Data Structures and Approximate Solutions

Probabilistic||Approximate: Why?Often:● an approximate answer is sufficient● need to trade accuracy for scalability or speed● need to analyse stream of data

Catch:● despite typically achieving good result, exists a

chance of the bad worst case behaviour.● use on large datasets (law of large numbers)

Page 3: Probabilistic Data Structures and Approximate Solutions

Code: Approximationimport randomx = [random.randint(0,80000) for _ in xrange(10000)]y = [i>>8 for i in x] # trim 8 bits off of integersz = x[:500] # 5% sample (x is uniform)

avx = average(x)avy = average(y) * 2**8 # add 8 bitsavz = average(z)

print avxprint avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))

39547.881639420.7744 error 0.321401%39591.424 error 0.110100%

C. Titus Brown “Awesome Big Data Algorithms”

Page 5: Probabilistic Data Structures and Approximate Solutions

Probabilistic Data Structures

Generally they are:● Use less space than a full dataset● Require higher CPU load● Stream-friendly ● Can be parallelized● Have controlled error rate

Page 6: Probabilistic Data Structures and Approximate Solutions

Hash functionsOne-way function: arbitrary length of the key -> to a fixed length of the message

message = hash(key)

However, collisions are possible:

hash(key1) = hash(key2)

Page 7: Probabilistic Data Structures and Approximate Solutions

Code: Hashing

Page 8: Probabilistic Data Structures and Approximate Solutions

Hash collisions and performance● Cryptographic hashes not ideal for our use (like bcrypt)● Need a fast algorithm with the lowest number of collisions:

Hash Lowercase Random UUID Numbers ============= ============= =========== ==============Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collisFNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collisDJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collisSDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collisSuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collisCRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collisLoseLose 338 ns - - 215178 collis

by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Murmur2 collisions

● cataract collides with periti● roquette collides with skivie● shawl collides with stormbound● dowlases collides with tramontane● cricketings collides with twanger● longans collides with whigs

Page 9: Probabilistic Data Structures and Approximate Solutions

Hash randomness visualised hashmap

Great murmur2

on a sequence of numbers

Not so greatDJB2

on a sequence of numbers

Page 10: Probabilistic Data Structures and Approximate Solutions

Comparison: Locality Sensitive Hashing (LSH)

Page 12: Probabilistic Data Structures and Approximate Solutions

Membership test: Bloom filterBloom filter is probabilistic but only yields false positives.

Hash each item k times indices into bit field.`

1..mAt least one 0 means w definitely isn’t in set.

All 1s would mean wprobably is in set.

Page 13: Probabilistic Data Structures and Approximate Solutions

Use Bloom filter to serve requests

Page 14: Probabilistic Data Structures and Approximate Solutions

Code: bloom filter

Page 15: Probabilistic Data Structures and Approximate Solutions

Use Bloom filter to store graphsGraphs only gain nodes because of Bloom filter false positives.

Pell et al., PNAS 2012

Page 16: Probabilistic Data Structures and Approximate Solutions

Counting Distinct ElementsIn: infinite stream of dataQuestion: how many distinct elements are there?

is similar to:

In: coin flipsQuestion: how many times it has been flipped?

Page 17: Probabilistic Data Structures and Approximate Solutions

Coin flips: intuition● Long runs of HEADs in random series are rare.

● The longer you look, the more likely you see a long one.

● Long runs are very rare and are correlated with how many coins you’ve flipped.

Page 18: Probabilistic Data Structures and Approximate Solutions

Code: Cardinality estimation

Page 19: Probabilistic Data Structures and Approximate Solutions

Cardinality estimationBasic algorithm:

● n=0● For each input item:

○ Hash item into bit string○ Count trailing zeroes in bit string○ If this count > n:

■ Let n = count

● Estimated cardinality (“count distinct”) = 2^n

Page 20: Probabilistic Data Structures and Approximate Solutions

Cardinality estimation: HyperLogLog

Demo by: http://www.aggregateknowledge.com/science/blog/hll.html

Billions of distinct values in 1.5KB of RAM with 2% relative error

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007

Page 21: Probabilistic Data Structures and Approximate Solutions

Code: HyperLogLog

Page 22: Probabilistic Data Structures and Approximate Solutions

Count-min sketch

count(value) = min{w1[h1(value)], ... wd[hd(value)]}

Frequency histogram estimation with chance of over-counting

Page 23: Probabilistic Data Structures and Approximate Solutions

Code: Frequent Itemsets

Page 24: Probabilistic Data Structures and Approximate Solutions

Machine Learning: Feature hashingHigh-dimensional machine learning without feature dictionary

by Andrew Clegg “Approximate methods for scalable data mining”

Page 25: Probabilistic Data Structures and Approximate Solutions

Locality-sensitive hashing To approximate nearest neighbours

by Andrew Clegg “Approximate methods for scalable data mining”

Page 26: Probabilistic Data Structures and Approximate Solutions

Probabilistic Databases● PrDB (University of Maryland)

● Orion (Purdue University)

● MayBMS (Cornell University)

● BlinkDB v0.1alpha(UC Berkeley and MIT)

Page 27: Probabilistic Data Structures and Approximate Solutions

BlinkDB: queriesQueries with Bounded Errors

and Bounded Response Times on Very Large Data

Page 28: Probabilistic Data Structures and Approximate Solutions

BlinkDB: architecture

Page 29: Probabilistic Data Structures and Approximate Solutions

References

Mining of Massive Datasetsby Jure Leskovec, Anand Rajaraman, and Jeff Ullmanhttp://infolab.stanford.edu/~ullman/mmds.html