Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan

Estimating Entropy for Data Streams

Khanh Do Ba, Dartmouth College

Advisor: S. Muthu Muthukrishnan

Review of Data Streams

Motivation: huge data stream that needs to be mined for info “efficiently.”

Applications: monitoring IP traffic, mining email and text message streams, etc.

The Mathematical Model

Sequence of integers A = a1, …, am, where each ai N = {1, …, n}.

For each v N, the frequency mv of v is # occurrences of v in A.

Statistics to be estimated are functions on A, but usually just on the mv’s (e.g. frequency moments).

What is Entropy?

In physics: measure of disorder in a system.

In math: measure of randomness (or uniformity) of a probability distribution.

Formula:

Nv

vvH ]Pr[log]Pr[

Entropy on Data Streams

For big m, mv/m → Pr[v]. So formula becomes:

Suffices to compute m (easy) and

Nv

vv mm log:

Nv

vv

m

m

m

mH log

The Goal

Approximation algorithm to estimate μ.

Approximate means to output a number Y such that: Pr[|Y – μ| λμ] ε, for any user-specified λ, ε > 0.

Restrictions: o(n), preferably Õ(1), space, and only 1 pass over data.

The Algorithm

We want Y to have E[Y] = μ and very small variance, so find a computable random variable X with E[X] = μ and small variance, and compute it several times.

Y is the median of s2 RVs Yi, each of which is the mean of s1 RVs Xij = X (independently, identically computed).

Computing X

Choose p {1, …, m} uniformly at random.

Let r = #{q p | aq = ap} ( 1).

X = m[r log r – (r – 1) log (r – 1)].

The Analysis

Easy: E[Y] = E[X] = μ. Hard: Var[Y] is very small. Turns out s1 = O(log n), s2 = O(1) works.

Each X maintained in O(log n + log m) space. Total: O(s1s2(log n + log m)) = O(log n log m).

Future Directions

Extension to insert/delete streams. Applications in: DBMSs where massive secondary storage cannot

be scanned quickly enough to answer real-time queries.

Monitoring open flows through internet routers.

Lowerbound proof showing algorithm is optimal, or an improved algorithm.

Documents

Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan