Probabilistic Data Structures and Approximate Solutions

Probabilistic Data Structuresand Approximate Solutions

by Oleksandr PryymakPyData London 2014IPython notebook with code >>

http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df


Probabilistic||Approximate: Why?Often:● an approximate answer is sufficient● need to trade accuracy for scalability or speed● need to analyse stream of data

Catch:● despite typically achieving good result, exists a

chance of the bad worst case behaviour.● use on large datasets (law of large numbers)

Code: Approximationimport randomx = [random.randint(0,80000) for _ in xrange(10000)]y = [i>>8 for i in x] # trim 8 bits off of integersz = x[:500] # 5% sample (x is uniform)

avx = average(x)avy = average(y) * 2**8 # add 8 bitsavz = average(z)

print avxprint avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))

39547.881639420.7744 error 0.321401%39591.424 error 0.110100%

C. Titus Brown “Awesome Big Data Algorithms”

Code: Sampling Data

Interview question: Get K samples from an infinite stream

http://code-slim-jim.blogspot.co.uk/2010/06/reservoir-sampling.html

http://code-slim-jim.blogspot.co.uk/2010/06/reservoir-sampling.html

Probabilistic Data Structures

Generally they are:● Use less space than a full dataset● Require higher CPU load● Stream-friendly ● Can be parallelized● Have controlled error rate

Hash functionsOne-way function: arbitrary length of the key -> to a fixed length of the message

message = hash(key)

However, collisions are possible:

hash(key1) = hash(key2)

Code: Hashing

Hash collisions and performance● Cryptographic hashes not ideal for our use (like bcrypt)● Need a fast algorithm with the lowest number of collisions:

Hash Lowercase Random UUID Numbers ============= ============= =========== ==============Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collisFNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collisDJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collisSDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collisSuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collisCRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collisLoseLose 338 ns - - 215178 collis

by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Murmur2 collisions

● cataract collides with periti● roquette collides with skivie● shawl collides with stormbound● dowlases collides with tramontane● cricketings collides with twanger● longans collides with whigs

http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Hash randomness visualised hashmap

Great murmur2

on a sequence of numbers

Not so greatDJB2

on a sequence of numbers

Comparison: Locality Sensitive Hashing (LSH)

Comparison: Locality Sensitive Hashing (LSH)

Image hashes

Kernelized locality-sensitive hashing for scalable image searchB Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org

Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high-dimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.4414&rep=rep1&type=pdf







http://scholar.google.co.uk/citations?user=okcbLqoAAAAJ&hl=en&oi=sra

http://scholar.google.co.uk/citations?user=Jp6Mz1sAAAAJ&hl=en&oi=sra

http://ieeexplore.ieee.org

http://ieeexplore.ieee.org

http://scholar.google.co.uk/citations?user=okcbLqoAAAAJ&hl=en&oi=sra

http://scholar.google.co.uk/scholar?cites=9336768545853330946&as_sdt=2005&sciodt=0,5&hl=en

Membership test: Bloom filterBloom filter is probabilistic but only yields false positives.

Hash each item k times indices into bit field.`

1..mAt least one 0 means w definitely isn’t in set.

All 1s would mean wprobably is in set.

Use Bloom filter to serve requests

Code: bloom filter

Use Bloom filter to store graphsGraphs only gain nodes because of Bloom filter false positives.

Pell et al., PNAS 2012

Counting Distinct ElementsIn: infinite stream of dataQuestion: how many distinct elements are there?

is similar to:

In: coin flipsQuestion: how many times it has been flipped?

Coin flips: intuition● Long runs of HEADs in random series are rare.

● The longer you look, the more likely you see a long one.

● Long runs are very rare and are correlated with how many coins you’ve flipped.

Code: Cardinality estimation

Cardinality estimationBasic algorithm:

● n=0● For each input item:

○ Hash item into bit string○ Count trailing zeroes in bit string○ If this count > n:

■ Let n = count

● Estimated cardinality (“count distinct”) = 2^n

Cardinality estimation: HyperLogLog

Demo by: http://www.aggregateknowledge.com/science/blog/hll.html

Billions of distinct values in 1.5KB of RAM with 2% relative error

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007

http://www.aggregateknowledge.com/science/blog/hll.html




Code: HyperLogLog

Count-min sketch

count(value) = min{w1[h1(value)], ... wd[hd(value)]}

Frequency histogram estimation with chance of over-counting

Code: Frequent Itemsets

Machine Learning: Feature hashingHigh-dimensional machine learning without feature dictionary

by Andrew Clegg “Approximate methods for scalable data mining”

Locality-sensitive hashing To approximate nearest neighbours

by Andrew Clegg “Approximate methods for scalable data mining”

Probabilistic Databases● PrDB (University of Maryland)

● Orion (Purdue University)

● MayBMS (Cornell University)

● BlinkDB v0.1alpha(UC Berkeley and MIT)

http://www.cs.umd.edu/~amol/PrDB/

http://www.cs.umd.edu/~amol/PrDB/

http://orion.cs.purdue.edu/

http://orion.cs.purdue.edu/

http://www.cs.cornell.edu/database-OLD/maybms/

http://www.cs.cornell.edu/database-OLD/maybms/

http://blinkdb.org/

http://blinkdb.org/

BlinkDB: queriesQueries with Bounded Errors

and Bounded Response Times on Very Large Data

BlinkDB: architecture

References

Mining of Massive Datasetsby Jure Leskovec, Anand Rajaraman, and Jeff Ullmanhttp://infolab.stanford.edu/~ullman/mmds.html

http://infolab.stanford.edu/~ullman/mmds.html

http://infolab.stanford.edu/~ullman/mmds.html

Summary

● know the data structures● know what you sacrifice● control errors


http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/ by Ilya Katsov



http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/



Technology

Probabilistic Data Structures and Approximate Solutions