Algorithms for data streams Lecture 2

Algorithms for data streamsLecture 2

Foundations of Data Science2014

Indian Institute of ScienceNavin Goyal

Estimating using the AMS sketch• Given a turnstile stream estimate within multiplicative error with

probability at least

• Obvious solution takes space (maintain the frequency vector). Can’t do better deterministically

• Randomized algorithm [Alon—Matias—Szegedy ’96]:• Sample a random vector with each coordinate chosen uniformly at random

from independently

•

• So if we could compute then we could estimate

Basic AMS algorithm for • Given a turnstile stream estimate within multiplicative error with

probability at least

Basic AMS estimator:

• Choose a random vector • Initialize • Until the end of the stream do

– On arrival of element

• At the end of the stream • is an estimator of • Problem: requires space

is a reasonable estimator of • (proof on the board; also in the book)

• Application of Chebyshev:

• Can improve by the median of the means estimator: and ,…• Output median• This gives -approximation of

The AMS sketch• How much space does the basic AMS sketch take (without the

median of the means trick)?

• (assuming are bounded by a constant)• So space is sufficient• No!

• We also need to remember random vector • And this requires bits

• What essential property of the random vector did we use?

The AMS sketch• What essential property of the random vector did we use?

• For , we used for all • For , we used

– for all pairwise distinct

• This is satisfied if the are 4-wise independent: For any pairwise distinct random variables are mutually independent

• For our situation, this means that for any we have

Constructing pairwise independent random bit vectors

• Given a uniformly random vector ( bits of perfect randomness)• We use to construct a pairwise independent random vector ( bits of

useful randomness)

• We index by nonempty subsets of

• For define

Claim For distinct and nonemptly , and are independent and uniformly distributedProof On the board

• are not 3-wise independent

2-wise independent hash function families• Very useful concept both in theory and practice• Let and

• A family of functions is called -wise independent if for any distinct , and any , and for chosen uniformly at random from , we have

(Also called -universal family)• The set of all functions is 2-universal• It’s very large: , describing one function takes bits

Pairwise independent random vectors

2-wise independent hash functions

• We say that random vector is pairwise independent if for any distinct we have and are independent

• A random hash function from a 2-wise independent hash function family of functions mapping gives us a pairwise independent random vector: with

• Hash function language slightly more convenient in some situations

• A non-streaming example of the utility of 2-wise independence: MAX CUT

Constructing 2-wise independent hash function families

• There are much smaller 2-wise independent families than the family of all functions

• Suppose a prime number• For define : by • Intuition: Determining a line in the plane requires two distinct

points on the line• This gives a family of size • is 2-wise independent

• Need bits to store a function in • Evaluation of is constant time on RAM (or certainly

Constructing 2-wise independent hash function families using finite fields

• More generally, we could take for some positive integer

• : the finite field with elements

• The elements of can be represented as bitvector of length

• The field provides a way to add and multiply the elements in time

• For (the finite field with elements) define by

• Need bits to represent

2-wise independent hash function families

• Can achieve and :– Elements of can be represented as -tuples – Represent in this way: – And define the new hash function by keeping just the first

coordinate :

Claim Functions above form a 2-wise independent hash function familyProof On the board

-wise independent hash function families• A family of functions is called -wise independent if for all distinct ,

and any , and for chosen uniformly at random from , we have

• The family of all functions is -wise independent

• There exist much smaller families obtained by generalizing the construction for pairwise independent hash families:

• or (a prime number) For a -tuple define by

• The above family is a -wise independent family of size • Intuition: A degree polynomial is fully specified by its values at

points

Constructing 4-wise independent random -1/1-vector

• Choose sufficiently large so that

• Construct a 4-wise independent hash function family mapping • Define by • Functions form a -wise independent family

• To generate a -wise independent random vector first choose a random • The random vector is • This is a -vector• To construct a -vector map to in the above vector

Basic AMS algorithm for

Basic AMS estimator with fully independent random vector:



Basic AMS estimator with -wise independent random vector:



• can be evaluated in time

Back to the AMS sketch• Generate using a 4-wise independent family of hash functions from to • Requires space • Total space for the basic AMS sketch

• Improve by the median of the means estimator: and ,…• Output median

• Total space used • (-approximation)

AMS sketch is linear

• The algorithm maintains

• Corollary Given two streams and , we can get the sketch for their concatenation their sketches by adding them:

• Geometric interpretation of the AMS sketch: Similar to Johnson—Lindenstrauss projection trick that preserves the length

• Works in the turnstile model because of the linearity of the AMS sketch

Other ’s• For , algorithms with space [Indyk 2000] and later improvements (nearly tight)

• For the problem becomes hard: (nearly tight)

Documents

Algorithms for data streams Lecture 2