of 20/20

Distinct items: • Given a stream , where , count the number of distinct items (so we are in the cash register model) • Example: 3 5 7 4 3 4 3 4 7 5 9 • 5 distinct elements: 3 4 5 7 9 (we only want the count of distinct elements, and not the set of distinct elements) • In terms of frequency moments estimation, this is the problem of estimating • The easy deterministic solutions with space and ( number of distinct elements) • Deterministic exact solution requires space in the worst case • How about deterministic approximate solutions? And exact randomized? • Can we do better with randomization and approximation?

View

88Download

1

Tags:

Embed Size (px)

DESCRIPTION

Distinct items: . Given a stream , where , count the number of distinct items (so we are in the cash register model) Example: 3 5 7 4 3 4 3 4 7 5 9 5 distinct elements: 3 4 5 7 9 (we only want the count of distinct elements, and not the set of distinct elements) - PowerPoint PPT Presentation

PowerPoint Presentation

Counting distinct elements (FlajoletMartin 1985)Counting distinct [email protected]@@3Counting distinct elementsCounting distinct elementsUnion boundCounting distinct [email protected]@@6Boosting the success probabilityWhy doesnt this give (1+/-\epsilon) approximation? 7Analyzing the BJKST estimatorAnalyzing the BJKST estimatorChebyshevAnalyzing the BJKST estimatorCounting distinct elements (strict turnstile model)Counting distinct elements (strict turnstile model) Algorithm for the decision version of counting distinct elementsDecision version of counting distinct elements (analysis idea)Full algorithmCounting distinct elementsDocument sketchingProblem: duplicate or near-duplicate identification in a collection of documentsHow to measure the similarity between documents? A reasonable (?) candidate: edit distanceComputationally expensive

Another measure: resemblance due to [Broder 97]

A document here can be thought of just as a text file. So we are given a collection of files and we want to classify them into classes so that documents in a class are very (syntactially) similar to each other. Now the question is what do we mean by similarity. Of course, this depends on the application. One of the applications of such a problem is to find duplicates of documents on the web. For this application, a good definition of similarity would be the edit distance between the documents. But while edit distance may be good at capturing our intuitive notion of similarity, its computationally expensive to compute the edit distance between two documents. Broder proposed 17Resemblance of documents [Broder 97]18Estimating resemblanceSet vs multiset. Need to check this last statement. How does the variance go down with k? 19Estimating resemblanceHow does the variance go down with k? 20