Upload
oleksandr-pryymak
View
720
Download
0
Embed Size (px)
DESCRIPTION
Will your decisions change if you'll know that the audience of your website isn't 5M users, but rather 5'042'394'953? Unlikely, so why should we always calculate the exact solution at any cost? An approximate solution for this and many similar problems would take only a fraction of memory and runtime in comparison to calculating the exact solution.
Citation preview
Probabilistic Data Structuresand Approximate Solutions
by Oleksandr PryymakPyData London 2014IPython notebook with code >>
Probabilistic||Approximate: Why?Often:● an approximate answer is sufficient● need to trade accuracy for scalability or speed● need to analyse stream of data
Catch:● despite typically achieving good result, exists a
chance of the bad worst case behaviour.● use on large datasets (law of large numbers)
Code: Approximationimport randomx = [random.randint(0,80000) for _ in xrange(10000)]y = [i>>8 for i in x] # trim 8 bits off of integersz = x[:500] # 5% sample (x is uniform)
avx = average(x)avy = average(y) * 2**8 # add 8 bitsavz = average(z)
print avxprint avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39547.881639420.7744 error 0.321401%39591.424 error 0.110100%
C. Titus Brown “Awesome Big Data Algorithms”
Code: Sampling Data
Interview question: Get K samples from an infinite stream
Probabilistic Data Structures
Generally they are:● Use less space than a full dataset● Require higher CPU load● Stream-friendly ● Can be parallelized● Have controlled error rate
Hash functionsOne-way function: arbitrary length of the key -> to a fixed length of the message
message = hash(key)
However, collisions are possible:
hash(key1) = hash(key2)
Code: Hashing
Hash collisions and performance● Cryptographic hashes not ideal for our use (like bcrypt)● Need a fast algorithm with the lowest number of collisions:
Hash Lowercase Random UUID Numbers ============= ============= =========== ==============Murmur 145 ns 259 ns 92 ns 6 collis 5 collis 0 collisFNV-1 184 ns 730 ns 92 ns 1 collis 5 collis 0 collisDJB2 156 ns 437 ns 93 ns 7 collis 6 collis 0 collisSDBM 148 ns 484 ns 90 ns 4 collis 6 collis 0 collisSuperFastHash 164 ns 344 ns 118 ns 85 collis 4 collis 18742 collisCRC32 250 ns 946 ns 130 ns 2 collis 0 collis 0 collisLoseLose 338 ns - - 215178 collis
by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Murmur2 collisions
● cataract collides with periti● roquette collides with skivie● shawl collides with stormbound● dowlases collides with tramontane● cricketings collides with twanger● longans collides with whigs
Hash randomness visualised hashmap
Great murmur2
on a sequence of numbers
Not so greatDJB2
on a sequence of numbers
Comparison: Locality Sensitive Hashing (LSH)
Comparison: Locality Sensitive Hashing (LSH)
Image hashes
Kernelized locality-sensitive hashing for scalable image searchB Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed high-dimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
Membership test: Bloom filterBloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.`
1..mAt least one 0 means w definitely isn’t in set.
All 1s would mean wprobably is in set.
Use Bloom filter to serve requests
Code: bloom filter
Use Bloom filter to store graphsGraphs only gain nodes because of Bloom filter false positives.
Pell et al., PNAS 2012
Counting Distinct ElementsIn: infinite stream of dataQuestion: how many distinct elements are there?
is similar to:
In: coin flipsQuestion: how many times it has been flipped?
Coin flips: intuition● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how many coins you’ve flipped.
Code: Cardinality estimation
Cardinality estimationBasic algorithm:
● n=0● For each input item:
○ Hash item into bit string○ Count trailing zeroes in bit string○ If this count > n:
■ Let n = count
● Estimated cardinality (“count distinct”) = 2^n
Cardinality estimation: HyperLogLog
Demo by: http://www.aggregateknowledge.com/science/blog/hll.html
Billions of distinct values in 1.5KB of RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
Code: HyperLogLog
Count-min sketch
count(value) = min{w1[h1(value)], ... wd[hd(value)]}
Frequency histogram estimation with chance of over-counting
Code: Frequent Itemsets
Machine Learning: Feature hashingHigh-dimensional machine learning without feature dictionary
by Andrew Clegg “Approximate methods for scalable data mining”
Locality-sensitive hashing To approximate nearest neighbours
by Andrew Clegg “Approximate methods for scalable data mining”
Probabilistic Databases● PrDB (University of Maryland)
● Orion (Purdue University)
● MayBMS (Cornell University)
● BlinkDB v0.1alpha(UC Berkeley and MIT)
BlinkDB: queriesQueries with Bounded Errors
and Bounded Response Times on Very Large Data
BlinkDB: architecture
References
Mining of Massive Datasetsby Jure Leskovec, Anand Rajaraman, and Jeff Ullmanhttp://infolab.stanford.edu/~ullman/mmds.html
Summary
● know the data structures● know what you sacrifice● control errors
http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/ by Ilya Katsov