Streaming Algorithms

Streaming Algorithms

Joe KelleyData Engineer

July 2013

CONFIDENTIAL | 2

Accelerating Your Time to Value

Strategy and Roadmap

IMAGINETraining

and Education

ILLUMINATEHands-On

Data Science and Data Engineering

IMPLEMENT

Leading Provider ofData Science & Engineering for Big Analytics

CONFIDENTIAL | 3

• Operates on a continuous stream of data• Unknown or infinite size

• Only one pass; options:• Store it• Lose it• Store an approximation

• Limited processing time per item•

• Limited total memory•

What is a Streaming Algorithm?

CONFIDENTIAL | 4

Why use a Streaming Algorithm?

• Compare to typical “Big Data” approach: store everything, analyze later, scale linearly

• Streaming Pros:• Lower latency• Lower storage cost

• Streaming Cons:• Less flexibility• Lower precision (sometimes)

• Answer?• Why not both?

CONFIDENTIAL | 5

General Techniques

1. Tunable Approximation2. Sampling

• Sliding window• Fixed number• Fixed percentage

3. Hashing: useful randomness

CONFIDENTIAL | 6

Example 1: Sampling device error rates

• Stream of (device_id, event, timestamp)• Scenario:

• Not enough space to store everything• Simple queries storing 1% is good enough

CONFIDENTIAL | 7




Algorithm:

for each element e: with probability 0.01: store e else: throw out e

Can lead to some insidious statistical “bugs”…

CONFIDENTIAL | 8




Query:How many errors has the average device encountered?

Answer:SELECT AVG(n) FROM ( SELECT COUNT(*) AS n FROM events WHERE event = 'ERROR' GROUP BY device_id)

Simple… but off by up to 100x. Each device had only 1% of its events sampled.Can we just multiply by 100?

CONFIDENTIAL | 9




Better Algorithm:

for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e

Choose how to hash carefully... or hash every different way

CONFIDENTIAL | 10

Example 2: Sampling fixed number

Choice of p is crucial:• p = constant prefer more recent elements. Higher p = more

recent• p = k/n sample uniformly from entire stream

Let arr = array of size k

for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e

Want to sample a fixed count (k), not a fixed percentage.Algorithm:

CONFIDENTIAL | 11

Example 2: Sampling fixed number

CONFIDENTIAL | 12

Example 3: Counting unique users

• Input: stream of (user_id, action, timestamp)• Want to know how many distinct users are seen over

a time period• Naïve approach:

• Store all user_id’s in a list/tree/hashtable• Millions of users = lot of memory

• Better approach:• Store all user_id’s in a database• Good, but maybe it’s not fast enough…

• What if an approximate count is ok?

CONFIDENTIAL | 13


• Input: stream of (user_id, action, timestamp)• Want to know how many distinct users are seen over a time period

• Approximate count is ok• Flajolet-Martin Idea:

• Hash each user_id into a bit string• Count the trailing zeros• Remember maximum number of trailing zeros seen

user_id H(user_id) trailing zeros max(trailing zeros)john_doe 0111001001 0 0jane_doe 1011011100 2 2alan_t 0010111000 3 3EWDijkstra 1101011110 1 3jane_doe 1011011100 2 3

CONFIDENTIAL | 14



• Intuition:• If we had seen 2 distinct users, we would expect 1

trailing zero• If we had seen 4, we would expect 2 trailing zeros• If we had seen , we would expect

• In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users

• Want more precision? User more independent hash functions, and combine the results• Median = only get powers of two• Mean = subject to skew• Median of means of groups works well in practice

CONFIDENTIAL | 15



Flajolet-Martin, all together:

arr = int[k]for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z

means = group_means(arr)median = median(means)return pow(2, median)

CONFIDENTIAL | 16


Flajolet-Martin in practice• Devil is in the details• Tunable precision

• more hash functions = more precise• See the paper for bounds on precision

• Tunable latency• more hash functions = higher latency• faster hash functions = lower latency• faster hash functions = more possibility of

correlation = less precision

Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer

CONFIDENTIAL | 17

Example 4: Counting Individual Item Frequencies

Want to keep track of how many times each item has appeared in the stream

Many applications:• How popular is each search term?• How many times has this hashtag been tweeted?• Which IP addresses are DDoS’ing me?

Again, two obvious approaches:• In-memory hashmap of itemcount• Database

But can we be more clever?

CONFIDENTIAL | 18

Example 4: Counting Individual Item FrequenciesWant to keep track of how many times each item has appeared in the stream

Idea:• Maintain array of counts• Hash each item, increment array at that index

To check the count of an item, hash again and check array at that index• Over-estimates because of hash “collisions”

CONFIDENTIAL | 19


Count-Min Sketch algorithm:• Maintain 2-d array of size w x d• Choose d different hash functions; each row in array corresponds to one

hash function• Hash each item with every hash function, increment the appropriate

position in each row• To query an item, hash it d times again, take the minimum value from all

rows

CONFIDENTIAL | 20

Example 4: Counting Individual Item FrequenciesWant to keep track of how many times each item has appeared in the stream

Count-Min Sketch, all together:arr = int[d][w]for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++

def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min

CONFIDENTIAL | 21


Count-Min Sketch in practice• Devil is in the details• Tunable precision

• Bigger array = more precise• See the paper for bounds on precision

• Tunable latency• more hash functions = higher latency

• Better at estimating more frequent items• Can subtract out estimation of collisions

Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer

CONFIDENTIAL | 22

Questions?

• Feel free to reach out• www.thinkbiganalytics.com• [email protected]• www.slideshare.net/jfkelley1

• References:• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf• http://infolab.stanford.edu/~ullman/mmds.html

We’re hiring! Engineers and Data Scientists

http://www.thinkbiganalytics.com/

mailto:[email protected]

http://www.slideshare.net/jfkelley1

http://www.slideshare.net/jfkelley1

http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf

http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf

http://infolab.stanford.edu/~ullman/mmds.html

http://infolab.stanford.edu/~ullman/mmds.html

Documents

Streaming Algorithms