56
An introduction to An introduction to probabilities data- probabilities data- structures structures (and algorithms) (and algorithms) == Grokking Engineering, April 2016 == Võ Việt Hùng [email protected]

Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

Embed Size (px)

Citation preview

Page 1: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

An introduction to An introduction to probabilities data-probabilities data-

structures structures (and algorithms)(and algorithms)

== Grokking Engineering, April 2016 ==

Võ Việt Hù[email protected]

Page 2: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

22

Who am I?Who am I?

● A technical guy, has been working in IT for 15+ years

✔ In many roles: developer, sys-admin, dba, big data analyst

✔ Large systems: billions of requests per month

● Current: one of the biggest adnetworks in Vietnam

● Past

✔ VNG: Zing Ads, Zing Me, ...

✔ Vietnamworks/Navigos Group...

Page 3: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

33

Agenda (1)Agenda (1)

● A real-world problem

● Probabilities data-structures (PDS), what?

● PDS, why?

● Some characteristics of PDS

● Some common PDS

● Membership Query – BloomFilter

● Cardinality Estimation – HyperLogLog

● Frequency Estimation – Count-Min Sketch

● Percentile and Quantile Estimation – t-digest

Page 4: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

44

Agenda (2)Agenda (2)

● Some case studies

● Whats else in the jungle?

● References

● Q&A

Page 5: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

55

A real-world problem (1)A real-world problem (1)

When processing data sets, we often want to do some simple checks (queries) like:

● Does the data set contain a particular element (membership query)?

● How many distinct elements are in the data set (i.e. what is the cardinality of the data set)?

● What are the most frequent elements (i.e. top-k elements)?

● What are the frequencies of the most frequent elements?

● What are the mean/median value of some quantity of the data set?

Page 6: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

66

A real-world problem (2)A real-world problem (2)

The common approach is to use some kind of deterministic data structure like HashSet or Hashtable for such purposes.

Another approach is using database, then performs SQL queries.

But along with data grows, with demand for fast response, come to problems with memory, CPU limitation, slow queries.

Page 7: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

77

Probabilities data-structures (PDS)Probabilities data-structures (PDS)

● PDS are a group of data structures that are extremely useful for big data and streaming/realtime applications.

● These data structures use hash functions to randomize and compactly represent a set of items.

● Collisions are ignored but errors can be well-controlled under certain threshold.

Page 8: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

88

Why?Why?

to deal with

● fast response

● (very) large data and could not fit in memory

● data (could be) processed in one pass

● incremental updates (results)

● no need of 100% correct, just approximation but controllable

Page 9: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

99

PDS characteristicsPDS characteristics

(as comparing with error-free approaches)

● trade space and performance for accuracy

● use less memory

● have constant (and short) query time

● (usually) support union and intersection operations

● can be merged => map-reduced friendly

● Parallelized and distributed

Page 10: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1010

Some common PDSSome common PDS

● Membership Query

✔ Bloom Filter (BF)

✔ Bloom Filter extensions: counting-BF, scalable-BF, stable-BF, layered-BF, inverse-BF

✔ Cuckoo hashing● Cardinality Estimation – HyperLogLog (HLL), KMV, LC

● Frequency Estimation – Count-Min Sketch (CMS)

● Percentile and Quantile Estimation – t-digest

● Skip-list

● ….

Page 11: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1111

Membership Query – Bloom FilterMembership Query – Bloom Filter

● conceived by Burton Howard Bloom in 1970

● is used to test whether an element is a member of a set

● False-positive matches are possible, but false-negatives are not. In other words, a query returns either "possibly in set" or "definitely not in set"

● Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter).

● The more elements that are added to the set, the larger the probability of false positives.

Page 12: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1212

BloomFilter – algorithm behindBloomFilter – algorithm behind

● effectively a hash table where collisions are ignored and each element added to the table is hashed by some number k hash functions.

● There is one major difference: a bloom filter does NOT store the hashed keys.

● Instead, it has a bit array as its underlying data structure; each key is remembered by flipping on all of the bits the k hash functions map it to.

Page 13: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1313

BloomFilter – Simple implementationBloomFilter – Simple implementation

Page 14: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1414

BloomFilter – PropertiesBloomFilter – Properties

● Unlike a standard hash table, a BF of a fixed size can represent a set with an arbitrarily large number of elements

● adding an element never fails due to the data structure "filling up"

● Union and intersection of BFs with the same size and set of hash functions can be implemented with bitwise OR (union) and AND (intersection)

● The union operation on BFs is lossless in the sense that the resulting BF is the same as the BF created from scratch using the union of the two sets.

● The intersect operation satisfies a weaker property: the false-positive probability in the resulting BF is at most the false-positive probability in one of the constituent BFs, but may be larger than the false-positive probability in the BF created from scratch using the intersection of the two sets.

Page 15: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1515

BloomFilter – simple usageBloomFilter – simple usage

Page 16: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1616

BloomFilter – Math behindBloomFilter – Math behind

Page 17: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1717

BloomFilter – rules-of-thumbBloomFilter – rules-of-thumb

Fomulas, rule-of-thumbs

(http://corte.si/posts/code/bloom-filter-rules-of-thumb/)

● fp rate bits

50% 1.44

10% 4.79

2% 8.14

1% 9.58

0.1% 14.38

0.01% 19.17

Page 18: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1818

BloomFilter – size over probabilityBloomFilter – size over probability

Page 19: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

1919

BloomFilter extension – CountingBloomFilter extension – Counting

● Counting BFs provide a way to implement a delete operation on a BF without recreating the filter afresh.

● In a counting filter the array positions (buckets) are extended from being a single bit to being an n-bit counter.

● When an item is added, the corresponding counters are incremented, and when it’s removed, the counters are decremented.

● Counting BF takes n-times more space than a regular BF, but it also has a scalability limit. Because the counting BF table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted.

Page 20: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2020

BloomFilter extension – ScalableBloomFilter extension – Scalable

● Standard BFs require knowing the size of the data set ahead of time in order to keep probability controlable

● Scalable BFs are useful for cases where the size of the data set isn’t known a priori and memory constraints aren’t of particular concern.

● Scalable BF is essentially an array of BFs. New elements are added to the last filter. When this filter becomes “full” – when it reaches a target fill ratio – a new filter is added with a tightened error probability.

Page 21: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2121

BloomFilter extension – ScalableBloomFilter extension – Scalable

Page 22: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2222

BloomFilter extension – StableBloomFilter extension – Stable

● Stable BF is a variant of BFs for detecting duplicates in unbounded data streams with limited space (memory).In particular, if the stream is not uniformly distributed, meaning duplicates are likely to be grouped closer together, the rate of false positives becomes immaterial.

● Since there is no way to store the entire history of a stream (which can be infinite), Stable BFs continuously evict stale information to make room for more recent elements.

● Since stale information is evicted, the Stable BF introduces false negatives, which do not appear in traditional Bloom filters. But a tight upper bound of false positive rates is guaranteed.

Page 23: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2323

BloomFilter extension – LayeredBloomFilter extension – Layered

● A layered BF consists of multiple BF layers.

● Layered BFs allow keeping track of how many times an item was added to the BF by checking how many layers contain the item.

● With a layered BF a check operation will normally return the deepest layer number the item was found in.

Page 24: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2424

BloomFilter extension – InverseBloomFilter extension – Inverse

● Inverse BF is an “opposite” of BF. It may report a false negative but can never report a false positive. That is, it may indicate that an item has not been seen when it actually has, but it will never report an item as seen which it hasn’t come across.

● Inverse BF behaves in a similar manner to a fixed-size hash map of m buckets which doesn’t handle conflicts, but it provides lock-free concurrency using an underlying CAS.

● Inverse BF is a nice option for dealing with unbounded streams or large data sets due to its limited memory usage. If duplicates are close together, the rate of false negatives becomes vanishingly small with an adequately sized filter.

Page 25: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2525

BloomFilter – ApplicationsBloomFilter – Applications

● Akamai's web servers use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches

● Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or columns

● Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed

● The Squid Web Proxy Cache uses Bloom filters for cache digests

● Bitcoin uses Bloom filters to speed up wallet synchronization

● The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature

Page 26: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2626

BloomFilter – AlternativesBloomFilter – Alternatives

● Cuckoo hashinghttps://en.wikipedia.org/wiki/Cuckoo_hashing

● Roaringbitmapshttp://roaringbitmap.org/

Page 27: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2727

Cardinality Estimation – HyperLogLogCardinality Estimation – HyperLogLog

● a streaming algorithm used for estimating the number of distinct elements (cardinality) of very large data sets.

● HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory.

● It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k.

Page 28: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2828

HyperLogLog – simple explanationHyperLogLog – simple explanation

● For example, given four bits there exist only 16 possible representations. If in our stream the highest number of consecutive zeroes were three (000), the probability of seeing that pattern is 2 in 16 (or 1 in 8) to conclude that the cardinality of our streaming set is 8.

Page 29: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

2929

HyperLogLog – more detailsHyperLogLog – more details

● In the HLL algorithm, a hash function is applied to each element in the original multiset (a set which allows multiple occurrences of its elements), to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above.

● The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.

Page 30: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3030

HyperLogLog – an implementationHyperLogLog – an implementation

Page 31: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3131

Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch

● Count-Min Sketches is a family of memory efficient data structures that allow one to estimate frequency-related properties of the data set, e.g. estimate frequencies of particular elements, find top-K frequent elements, perform range queries (where the goal is to find the sum of frequencies of elements within a range), estimate percentiles

● It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary.

Page 32: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3232

Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch

● Count-Min sketch is a two-dimensional array (dxw) of integer counters. When a value arrives, it is mapped to one position at each of d rows using d different and preferably independent hash functions. Counters on each position are incremented.

Page 33: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3333

Frequency Est. – Count-Min SketchFrequency Est. – Count-Min Sketch

● The estimate of the counts for an item is the minimum value of the counts at the array positions determined by the d hash functions.

● The space used by Count-Min sketch is the array of w*d counters. By choosing appropriate values for d and w, very small error and high probability can be achieved.

Page 34: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3434

Count-Min Sketch – implementationCount-Min Sketch – implementation

Page 35: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3535

Count-Min Sketch – PropertiesCount-Min Sketch – Properties

● Union can be performed by cell-wise ADD operation

● O(k) query time

● Better accuracy for higher frequency items (heavy-hitters)

● Can only cause over-counting but not under-counting

Page 36: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3636

Count-Min Sketch – NotesCount-Min Sketch – Notes

● Accuracy of the Count-Min sketch depends on the ratio between the sketch size and the total number of registered events. This means that Count-Min technique provides significant memory gains only for skewed data, i.e. data where items have very different probabilities.

● Applicability of Count-Min sketches is not a straightforward question and the best thing that can be recommended is experimental evaluation of each particular case.

● Count-Min sketch performs well on highly skewed data, but on low or moderately skewed data it is not so efficient because of poor protection from the high number of hash collisions – Count-Min sketch simply selects minimal (less distorted) estimator => Count-Mean-Min sketch

Page 37: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3737

Count-Mean-Min Sketch – implementationCount-Mean-Min Sketch – implementation

● CMM estimates noise for each hash function as the average value of all counters in the row that correspond to this function (except counter that corresponds to the query itself), deduces it from the estimation for this hash function, and, finally, computes the median of the estimations for all hash functions.

Page 38: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3838

Count-Min Sketch – Top-k problemCount-Min Sketch – Top-k problem

Find all elements in the data set with the frequencies greater than k percent of the total number of elements in the data set.

● Maintain a standard Count-Min sketch during the scan of the data set and put all elements into it.

● Maintain a heap of top elements, initially empty, and a counter N of the total number of already process elements.

● For each element in the data set:

✔ Put the element to the sketch

✔ Estimate the frequency of the element using the sketch. If frequency is greater than a threshold (k*N), then put the element to the heap. Heap should be periodically or continuously cleaned up to remove elements that do not meet the threshold anymore.

● In general, the top-k problem makes sense only for skewed data, so usage of Count-Min sketches is reasonable in this context.

Page 39: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

3939

Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest

● The problem of calculating median of a dataset in distributed environment. ('cause the median of median is not equal to the median) => what's needed is an algorithm that can approximate the median, while still being space efficient.

● the t-Digest is a probabilistic data structure for estimating the median (and more generally any percentile) from either distributed data or streaming data.

● Internally, the data structure is a sparse representation of the cumulative distribution function. After ingesting data, the data structure has learned the "interesting" points of the CDF, called centroids.

Page 40: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4040

Percentile & Quantile Est. – t-digestPercentile & Quantile Est. – t-digest

● A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

● The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics.

● The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean.

● The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.

Page 41: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4141

t-digest – characteristicst-digest – characteristics

● has smaller summaries than Q-digest

● works on doubles as well as integers.

● provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles

● is fast

● is very simple

● can be used with map-reduce very easily because digests can be merged

Page 42: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4242

Some remarksSome remarks

● For some structures like HyperLogLog or Bloom filter, there're simple and practical formulas to determine parameters of the structure on the basis of expected data volume and required error probability.

● Other structures like Count-(Mean-)Min Sketch have complex dependency on statistical properties of data and experiments are the only reasonable way to understand their applicability to real use cases.

● Data-structures populated by different data sets can often be combined to process complex queries.

● Some types of queries can be supported by using customized versions of the described data-structures/ algorithms.

Page 43: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4343

Case Study 1Case Study 1

● There is a system that tracks a huge number of web events and each event is marked by a number of tags including a user ID this event corresponds to.It is required to report a number of unique users that meet the specified combination of tags (like users from the city C that visited site A or site B)

Page 44: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4444

Case Study 1: solutionCase Study 1: solution

● Solution 1:

✔ maintain a BF that tracks user IDs for each tag value and a BF that contains user IDs that correspond to the final result.

✔ A user ID from each incoming event is tested against the per-tag filters – does it satisfy the required combination of tags or not.

✔ If the user ID passes this test, it is additionally tested against the additional BF that corresponds to the report itself and, if passed, the final report counter is increased.

● Solution 2: using HLL for each tag value

Page 45: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4545

Case Study 2Case Study 2

● There is a system that receives events on user visits from different internet sites.

● This system enables analysis to query a number of unique visitors for the specified date range and site.

Page 46: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4646

Case Study 2: solutionCase Study 2: solution

● HLL can be used to aggregate information about visitor IDs for each day and site, masks for each day are saved, and a query can be processed using bitwise OR-ing of the daily masks.

Page 47: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4747

Case Study 3Case Study 3

● There is a system that tracks traffic by IP address and it is required to detect most traffic-intensive addresses.

Page 48: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4848

Case Study 3: solutionCase Study 3: solution

● CMS?!!

● the problem is not trivial because we need to track the total traffic for each address, not a frequency of items.

● counters in the CMS implementation can be incremented not by 1, but by absolute amount of traffic for each observation (i.e, size of IP packet if sketch is updated for each packet)

● In this case, sketch will track amounts of traffic for each address and a heap with the most traffic-intensive addresses can be maintained (top-k or heavy-hitter).

Page 49: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

4949

Case Study 4Case Study 4

● There is a system that monitors traffic and counts unique visitors for different criteria (visited site, geography, etc.).

● It is required to compute 100 most popular sites using a number of unique visitors as a metric of popularity.

● Popularity should be computed every day on the basis of data for last 30-day, i.e. every day one-day partition added, another one is removed from the scope.

Page 50: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

5050

Case Study 4: solutionCase Study 4: solution

● create a fresh set of per-site HLL counters every day and maintain this set during 30 days, i.e. 30 sets of counters are active at any moment of time.

Page 51: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

5151

Case Study 5Case Study 5

● Number of users doing-action (view, click...) on site objects (banner, button, …) 1-times, 2-times, …., 10+-times

● Report looks like below

Filter: Object=X 1-times: 98765 2-times: 76543 3-times: 54321 … 9-times: 1234 10+-times: 343

Page 52: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

5252

Case Study 5: solutionCase Study 5: solution

● Should we use CMS???

● … and why/why NOT???

Page 53: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

5353

Case Study 5: solutionCase Study 5: solution

● Use scalable layered-BF to track k-times user actions on objects

● Use HLL to count users on each k-times action

Page 54: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

5454

What else?What else?

● Libs

✔ Redis: HLL already, BF in next 3.2

✔ https://github.com/twitter/algebird

✔ https://github.com/addthis/stream-lib

✔ https://github.com/tylertreat/BoomFilters

✔ https://github.com/tdunning/t-digest● More

✔ Linear Counting

✔ MinHash

✔ Top-K

Page 55: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)

5555

ReferencesReferences

● https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

● https://dzone.com/articles/introduction-probabilistic-0

● http://bravenewgeek.com/stream-processing-and-probabilistic-methods/

● https://www.somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/

● https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest

Page 56: Grokking TechTalk #11 - An introduction to probabilistic data-structures (and algorithms)