Upload
duane-dawson
View
217
Download
5
Embed Size (px)
Citation preview
Estimating Entropy for Data Streams
Khanh Do Ba, Dartmouth College
Advisor: S. Muthu Muthukrishnan
Review of Data Streams
Motivation: huge data stream that needs to be mined for info “efficiently.”
Applications: monitoring IP traffic, mining email and text message streams, etc.
The Mathematical Model
Sequence of integers A = a1, …, am, where each ai N = {1, …, n}.
For each v N, the frequency mv of v is # occurrences of v in A.
Statistics to be estimated are functions on A, but usually just on the mv’s (e.g. frequency moments).
What is Entropy?
In physics: measure of disorder in a system.
In math: measure of randomness (or uniformity) of a probability distribution.
Formula:
Nv
vvH ]Pr[log]Pr[
Entropy on Data Streams
For big m, mv/m → Pr[v]. So formula becomes:
Suffices to compute m (easy) and
Nv
vv mm log:
Nv
vv
m
m
m
mH log
The Goal
Approximation algorithm to estimate μ.
Approximate means to output a number Y such that: Pr[|Y – μ| λμ] ε, for any user-specified λ, ε > 0.
Restrictions: o(n), preferably Õ(1), space, and only 1 pass over data.
The Algorithm
We want Y to have E[Y] = μ and very small variance, so find a computable random variable X with E[X] = μ and small variance, and compute it several times.
Y is the median of s2 RVs Yi, each of which is the mean of s1 RVs Xij = X (independently, identically computed).
Computing X
Choose p {1, …, m} uniformly at random.
Let r = #{q p | aq = ap} ( 1).
X = m[r log r – (r – 1) log (r – 1)].
The Analysis
Easy: E[Y] = E[X] = μ. Hard: Var[Y] is very small. Turns out s1 = O(log n), s2 = O(1) works.
Each X maintained in O(log n + log m) space. Total: O(s1s2(log n + log m)) = O(log n log m).
Future Directions
Extension to insert/delete streams. Applications in: DBMSs where massive secondary storage cannot
be scanned quickly enough to answer real-time queries.
Monitoring open flows through internet routers.
Lowerbound proof showing algorithm is optimal, or an improved algorithm.