View
29
Download
0
Category
Tags:
Preview:
DESCRIPTION
Algorithms for data streams Lecture 2. Foundations of Data Science 2014 Indian Institute of Science Navin Goyal. Estimating using the AMS sketch. Given a turnstile stream estimate within multiplicative error with probability at least - PowerPoint PPT Presentation
Citation preview
Algorithms for data streamsLecture 2
Foundations of Data Science2014
Indian Institute of ScienceNavin Goyal
Estimating using the AMS sketch• Given a turnstile stream estimate within multiplicative error with
probability at least
• Obvious solution takes space (maintain the frequency vector). Can’t do better deterministically
• Randomized algorithm [Alon—Matias—Szegedy ’96]:• Sample a random vector with each coordinate chosen uniformly at random
from independently
•
• So if we could compute then we could estimate
Basic AMS algorithm for • Given a turnstile stream estimate within multiplicative error with
probability at least
Basic AMS estimator:
• Choose a random vector • Initialize • Until the end of the stream do
– On arrival of element
• At the end of the stream • is an estimator of • Problem: requires space
is a reasonable estimator of • (proof on the board; also in the book)
• Application of Chebyshev:
• Can improve by the median of the means estimator: and ,…• Output median• This gives -approximation of
The AMS sketch• How much space does the basic AMS sketch take (without the
median of the means trick)?
• (assuming are bounded by a constant)• So space is sufficient• No!
• We also need to remember random vector • And this requires bits
• What essential property of the random vector did we use?
The AMS sketch• What essential property of the random vector did we use?
• For , we used for all • For , we used
– for all pairwise distinct
• This is satisfied if the are 4-wise independent: For any pairwise distinct random variables are mutually independent
• For our situation, this means that for any we have
Constructing pairwise independent random bit vectors
• Given a uniformly random vector ( bits of perfect randomness)• We use to construct a pairwise independent random vector ( bits of
useful randomness)
• We index by nonempty subsets of
• For define
Claim For distinct and nonemptly , and are independent and uniformly distributedProof On the board
• are not 3-wise independent
2-wise independent hash function families• Very useful concept both in theory and practice• Let and
• A family of functions is called -wise independent if for any distinct , and any , and for chosen uniformly at random from , we have
(Also called -universal family)• The set of all functions is 2-universal• It’s very large: , describing one function takes bits
Pairwise independent random vectors
2-wise independent hash functions
• We say that random vector is pairwise independent if for any distinct we have and are independent
• A random hash function from a 2-wise independent hash function family of functions mapping gives us a pairwise independent random vector: with
• Hash function language slightly more convenient in some situations
• A non-streaming example of the utility of 2-wise independence: MAX CUT
Constructing 2-wise independent hash function families
• There are much smaller 2-wise independent families than the family of all functions
• Suppose a prime number• For define : by • Intuition: Determining a line in the plane requires two distinct
points on the line• This gives a family of size • is 2-wise independent
• Need bits to store a function in • Evaluation of is constant time on RAM (or certainly
Constructing 2-wise independent hash function families using finite fields
• More generally, we could take for some positive integer
• : the finite field with elements
• The elements of can be represented as bitvector of length
• The field provides a way to add and multiply the elements in time
• For (the finite field with elements) define by
• Need bits to represent
2-wise independent hash function families
• Can achieve and :– Elements of can be represented as -tuples – Represent in this way: – And define the new hash function by keeping just the first
coordinate :
Claim Functions above form a 2-wise independent hash function familyProof On the board
-wise independent hash function families• A family of functions is called -wise independent if for all distinct ,
and any , and for chosen uniformly at random from , we have
• The family of all functions is -wise independent
• There exist much smaller families obtained by generalizing the construction for pairwise independent hash families:
• or (a prime number) For a -tuple define by
• The above family is a -wise independent family of size • Intuition: A degree polynomial is fully specified by its values at
points
Constructing 4-wise independent random -1/1-vector
• Choose sufficiently large so that
• Construct a 4-wise independent hash function family mapping • Define by • Functions form a -wise independent family
• To generate a -wise independent random vector first choose a random • The random vector is • This is a -vector• To construct a -vector map to in the above vector
Basic AMS algorithm for
Basic AMS estimator with fully independent random vector:
• Choose a random vector • Initialize • Until the end of the stream do
– On arrival of element
Basic AMS estimator with -wise independent random vector:
• Choose a random vector • Initialize • Until the end of the stream do
– On arrival of element
• can be evaluated in time
Back to the AMS sketch• Generate using a 4-wise independent family of hash functions from to • Requires space • Total space for the basic AMS sketch
• Improve by the median of the means estimator: and ,…• Output median
• Total space used • (-approximation)
AMS sketch is linear
• The algorithm maintains
• Corollary Given two streams and , we can get the sketch for their concatenation their sketches by adding them:
• Geometric interpretation of the AMS sketch: Similar to Johnson—Lindenstrauss projection trick that preserves the length
• Works in the turnstile model because of the linearity of the AMS sketch
Other ’s• For , algorithms with space [Indyk 2000] and later improvements (nearly tight)
• For the problem becomes hard: (nearly tight)
Recommended