Upload
kovit
View
45
Download
0
Embed Size (px)
DESCRIPTION
Joe Kelley Data Engineer. July 2013. Streaming Algorithms. Leading Provider of. Data Science & Engineering for Big Analytics . Accelerating Your Time to Value. IMAGINE. ILLUMINATE. IMPLEMENT. Strategy and Roadmap. Training and Education. Hands-On Data Science and Data Engineering. - PowerPoint PPT Presentation
Citation preview
Streaming Algorithms
Joe KelleyData Engineer
July 2013
CONFIDENTIAL | 2
Accelerating Your Time to Value
Strategy and Roadmap
IMAGINETraining
and Education
ILLUMINATEHands-On
Data Science and Data Engineering
IMPLEMENT
Leading Provider ofData Science & Engineering for Big Analytics
CONFIDENTIAL | 3
• Operates on a continuous stream of data• Unknown or infinite size
• Only one pass; options:• Store it• Lose it• Store an approximation
• Limited processing time per item•
• Limited total memory•
What is a Streaming Algorithm?
CONFIDENTIAL | 4
Why use a Streaming Algorithm?
• Compare to typical “Big Data” approach: store everything, analyze later, scale linearly
• Streaming Pros:• Lower latency• Lower storage cost
• Streaming Cons:• Less flexibility• Lower precision (sometimes)
• Answer?• Why not both?
CONFIDENTIAL | 5
General Techniques
1. Tunable Approximation2. Sampling
• Sliding window• Fixed number• Fixed percentage
3. Hashing: useful randomness
CONFIDENTIAL | 6
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)• Scenario:
• Not enough space to store everything• Simple queries storing 1% is good enough
CONFIDENTIAL | 7
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)• Scenario:
• Not enough space to store everything• Simple queries storing 1% is good enough
Algorithm:
for each element e: with probability 0.01: store e else: throw out e
Can lead to some insidious statistical “bugs”…
CONFIDENTIAL | 8
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)• Scenario:
• Not enough space to store everything• Simple queries storing 1% is good enough
Query:How many errors has the average device encountered?
Answer:SELECT AVG(n) FROM ( SELECT COUNT(*) AS n FROM events WHERE event = 'ERROR' GROUP BY device_id)
Simple… but off by up to 100x. Each device had only 1% of its events sampled.Can we just multiply by 100?
CONFIDENTIAL | 9
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)• Scenario:
• Not enough space to store everything• Simple queries storing 1% is good enough
Better Algorithm:
for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e
Choose how to hash carefully... or hash every different way
CONFIDENTIAL | 10
Example 2: Sampling fixed number
Choice of p is crucial:• p = constant prefer more recent elements. Higher p = more
recent• p = k/n sample uniformly from entire stream
Let arr = array of size k
for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e
Want to sample a fixed count (k), not a fixed percentage.Algorithm:
CONFIDENTIAL | 11
Example 2: Sampling fixed number
CONFIDENTIAL | 12
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)• Want to know how many distinct users are seen over
a time period• Naïve approach:
• Store all user_id’s in a list/tree/hashtable• Millions of users = lot of memory
• Better approach:• Store all user_id’s in a database• Good, but maybe it’s not fast enough…
• What if an approximate count is ok?
CONFIDENTIAL | 13
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)• Want to know how many distinct users are seen over a time period
• Approximate count is ok• Flajolet-Martin Idea:
• Hash each user_id into a bit string• Count the trailing zeros• Remember maximum number of trailing zeros seen
user_id H(user_id) trailing zeros max(trailing zeros)john_doe 0111001001 0 0jane_doe 1011011100 2 2alan_t 0010111000 3 3EWDijkstra 1101011110 1 3jane_doe 1011011100 2 3
CONFIDENTIAL | 14
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)• Want to know how many distinct users are seen over a time period
• Intuition:• If we had seen 2 distinct users, we would expect 1
trailing zero• If we had seen 4, we would expect 2 trailing zeros• If we had seen , we would expect
• In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users
• Want more precision? User more independent hash functions, and combine the results• Median = only get powers of two• Mean = subject to skew• Median of means of groups works well in practice
CONFIDENTIAL | 15
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)• Want to know how many distinct users are seen over a time period
Flajolet-Martin, all together:
arr = int[k]for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z
means = group_means(arr)median = median(means)return pow(2, median)
CONFIDENTIAL | 16
Example 3: Counting unique users
Flajolet-Martin in practice• Devil is in the details• Tunable precision
• more hash functions = more precise• See the paper for bounds on precision
• Tunable latency• more hash functions = higher latency• faster hash functions = lower latency• faster hash functions = more possibility of
correlation = less precision
Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
CONFIDENTIAL | 17
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has appeared in the stream
Many applications:• How popular is each search term?• How many times has this hashtag been tweeted?• Which IP addresses are DDoS’ing me?
Again, two obvious approaches:• In-memory hashmap of itemcount• Database
But can we be more clever?
CONFIDENTIAL | 18
Example 4: Counting Individual Item FrequenciesWant to keep track of how many times each item has appeared in the stream
Idea:• Maintain array of counts• Hash each item, increment array at that index
To check the count of an item, hash again and check array at that index• Over-estimates because of hash “collisions”
CONFIDENTIAL | 19
Example 4: Counting Individual Item Frequencies
Count-Min Sketch algorithm:• Maintain 2-d array of size w x d• Choose d different hash functions; each row in array corresponds to one
hash function• Hash each item with every hash function, increment the appropriate
position in each row• To query an item, hash it d times again, take the minimum value from all
rows
CONFIDENTIAL | 20
Example 4: Counting Individual Item FrequenciesWant to keep track of how many times each item has appeared in the stream
Count-Min Sketch, all together:arr = int[d][w]for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++
def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min
CONFIDENTIAL | 21
Example 4: Counting Individual Item Frequencies
Count-Min Sketch in practice• Devil is in the details• Tunable precision
• Bigger array = more precise• See the paper for bounds on precision
• Tunable latency• more hash functions = higher latency
• Better at estimating more frequent items• Can subtract out estimation of collisions
Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
CONFIDENTIAL | 22
Questions?
• Feel free to reach out• www.thinkbiganalytics.com• [email protected]• www.slideshare.net/jfkelley1
• References:• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf• http://infolab.stanford.edu/~ullman/mmds.html
We’re hiring! Engineers and Data Scientists