Upload
maximilian-terry
View
216
Download
1
Embed Size (px)
Citation preview
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton]
Paper reportBy MH , 2004/12/17
Finding Frequent Items in Data Streams
TodaySynopsis Data StructuresSketches and Frequency Moments Finding Frequency Items in Data Streams
Synopsis Data Structures
Synopsis Data Structures “Lossy” Summary (of a data stream) Advantages – fits in memory + easy to com
municateDisadvantage – lossiness implies approxim
ation errorKey Techniques – randomization and hashi
ng
Random Samples Goal maintain uniform sample of item-stream Sampling Semantics?
Coin flip select each item with probability p easy to maintain undesirable – sample size is unbounded
Fixed-size sample without replacement Our focus today
Fixed-size sample with replacement Show – can generate from previous sample
Non-Uniform Samples [Chaudhuri-Motwani-Narasayya]
Generalized Stream Model
Input Element (i,a)a copies of domain-value i increment to ith dimension of m by a a need not be an integer
Data stream: 2, 0, 1, 3, 1, 2, 4, . . .
m0 m1 m2 m3 m4
11 1
2 2
Example
m0 m1 m2 m3 m4
11 1
2 2
On seeing element (i,a) = (2,2) On seeing element (i,a) = (1,-1)
m0 m1 m2 m3 m4
11 1
2
4
m0 m1 m2 m3 m4
11 1
4
1
Frequency Moments
Input Stream values from U = {0,1,…,N-1} frequency vector m = (m0,m1,…,mN-1)
Kth Frequency Moment Fk(m) = Σi mik
F0: number of distinct values
F1: stream size
F2: Gini index, self-join size, Euclidean norm
Fk: for k>2, measures skew, sometimes useful
F∞: maximum frequency
Finding Frequent Items in Data Streams
Introduction Main Idea COUNT SKETCH Algorithm Final result
Problem - This work was done while the author was at Google Inc.
The Google ProblemReturn list of k most frequent items in stream
Motivation search engine queries, network traffic, …
Remember Saw lower bound recently!Solution
Data structure Count-Sketch maintaining count-estimates of high-frequency elements
Introduction (1)
One of the most basic problems on a data stream [HRR98,AMS99] is that of finding the most frequently occurring items in the stream
We shall assume here that the stream is large enough that memory-intensive solutions such as sorting the stream or keeping a counter for each distinct element are infeasible
This problem comes up in the context of search engines, where the streams in question are streams
of queries sent to the search engine and we are interested in finding the most frequent queries handled in some period of time.
Introduction (2)
A wide variety of heuristics for this problem have been proposed, all involving some combination of sampling, hashing, and counting (see [GM99] and Section 2 for a survey).
However, none of these solutions have clean bounds on the amount of space necessary to produce good approximate lists of the most frequent items.
Definitions
Notation Assume {1, 2, …, N} in order of frequency mi is frequency of ith most frequent element m = Σmi is number of elements in stream
Two notions of approximating the frequent-element problem FindCandidateTop
Input: stream S, int k, int p Output: list of p elements containing top k
FindApproxTop Input: stream S, int k, real Output: list of k elements, each of frequency mi > (1-) mk
FindCandidateTop
for example, that nk = np+1 + 1, that is, the k-th most frequent element has almost the same frequency as the p + 1st most frequent element. Then it would be almost impossible to find only p elements that are likely to have the top k elements.
We therefore define the following variant:
FindApproxTop
Main Idea
Consider single counter X hash function h(i): {1, 2,…,N} {-1,+1}
Input element i update counter X += Zi = h(i)
For each r, use XZr as estimator of mr
Theorem: E[XZr] = mr Proof
X = Σi miZi
E[XZr] = E[Σi miZiZr] = Σi miE[Zi Zr] = mrE[Zr2] = mr
A couple of problems
The variance of every estimate is very large
O(N) elements have estimates that are wrong by more than the variance.
Array of Counters
Idea – t counters,c1,...ct, t hash function h
1,…,ht
We can then take the mean or median of these estimates to achieve an estimate with lower variance.
Problem with “Array of Counters”
Variance – dominated by highest frequency
Estimates for less-frequent elements like kcorrupted by higher frequencies
Avoiding Collisions?spread out high frequency elements replace each counter with hashtable of b co
unters
Count Sketch data structure
Hash Functions independent hashes h1,...,ht and s1,…,st
hashes independent of each other Data structure: hashtables of counters X(r,c)
1 2 … b
s1 : i {1, ..., b}
h1: i {+1, -1}
st : i {1, ..., b}
ht: i {+1, -1}
configuration and operations
sr(i) – one of b counters in rth hashtable
ADD(i): for each r, update X(r,sr(i)) += hr(i)
Estimator(mi) = medianr { X(r,sr(i)) • hr(i) }
Maintain heap of k top elements seen so far
Why we choose median
we have not eliminated the problem of collisions with high-frequency elements, and these will still spoil some subset of the estimates. The mean is very sensitive to outliers, while the median is sufficiently robust.
Overall Algorithm
1. Add(i) 2. If i is in the heap, increment its count. Else,
add i to the heap if Estimate(mi) is greater than the smallest estimated count in the heap.
In this case, the smallest estimated count should be evicted from the heap.
This algorithm solves FindApproxTop where our choice of b will depend on .
we can add and subtract . Thealgorithm takes space O(tb + k).And we bound t and b.
Final Results (1)
bound t and b t =O( log m/) , where the algorithm fails wit
h probability at most b = O(k + i>k mi
2 / (mk)2)
(5 lemmas and 1 theorem are listed in the rear)
So…..
Final Results (2)
FindApproxTop O([k + (i>kmi
2) / (mk)2] log m/) Zipfian Distribution: mi 1/i
gives improved results compare with Sampling algorithm.
Finding items with largest frequency change This problem also has a practical motivation in the context of search engine query streams, since the queries whose frequency changes most between two consecutive time periods can indicate which topics people are currently most interested in [Goo].
5 Lemmas and 1 theorem(1)
nq(l) be the number of occurrences of element q up to position l.
Ai[q] be the set of elements that hash onto the same bucket in the i-th row as
q does
]}[][,:{][ ,,, qhqhqqqqA iii