42
1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

1

How to Summarize the Universe:Dynamic Maintenance of Quantiles

By:Anna C. GilbertYannis Kotidis

S. MuthukrishnanMartin J. Strauss

Page 2: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

2

Quantiles Median, quartiles, … The general case:

Uses Statistics Estimating result set size Partitioning …

/1,...,2,1for NNk

Page 3: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

3

Computing static quantiles Blum, Floyd, Pratt, Rivest & Tarjan

Find the i’th element Comparison based Similar to QuickSort O(n) – worst case time

Page 4: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

4

Problems with massive data sets

O(n) time – not good enough… O(n) space – usually not affordable Dynamic environment

Cancellations are especially troublesome Usually recomputed periodically

May be very inaccurate until recomputed

Some kind of approximation is the only choice !…

Page 5: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

5

Common approaches Deterministically chosen sample Randomization – probability of

failure Maintaining a backing sample Wavelets Most of the above approaches work

well for the incremental case, but deletions may cause inaccuracy.

Page 6: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

6

GK – Greenwald-Khanna (‘01)

Fill the available memory with values Maintain rank ranges on values is memory. When a new value is inserted, kick a value

out of memory. Insert-only algorithm Can be extended to support deletes

(“GK2”). Maintain two instances – one for insertions and

one for deletions.

Page 7: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

7

Maintenance of Equi-Depth Histograms (using a backing sample)

Gibbons, Matias, Poosala – ’97 Scan the dataset and choose values for

the sample using the “reservoir” method. Treat insertions as a “continuous” scan. When a deletion from the sample is

necessary – rescan only if number of items drops below a specified minimum.

Works well for a mostly-insertions enviornment.

Page 8: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

8

The authors’ main result The RSS algorithm

RSS – Random Subset Sum Space – polylogarithmic in universe

size Proportional time A priori guarantee of accuracy within

a user specified error ε, with a user specified probability of failure δ.

Page 9: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

9

Some formalism… The universe: U = {0, …, |U |-1} Number of tuples in data set: ||A||=N Data set can be thought of as an

array:A[i] – number of tuples with value i

Our goal for computing Ф-quantiles – find a jk such that:

NkiNkkji

)(][A)(

Page 10: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

10

Some assumptions The universe’s size is known

Later we’ll throw that assumption away

Update = Delete + Insert

Page 11: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

11

Computing quantiles Let’s say A[i] is known for every i.

Easy to maintain through updates Summing up array items ?

Not a very good complexity…

Page 12: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

12

Computing quantiles (cont.) We need a method of reducing

summation overhead. We should be able to compute any

sum of items in A in logarithmic time.

The solution: Keeping computed sums of intervals.

Page 13: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

13

Dyadic intervals - defined Atomic dyadic interval – a single point. Ij,k = [k*2log(|U|)-j,(k+1)*2log(|U|)-j-1] j – resolution level Example:

0 1 2 3 4 5 6 7I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7)

I(2,0) I(2,1) I(2,2) I(2,3)

I(1,0) I(1,1)

I(0,0)

Page 14: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

14

Let’s say we have sums for all dyadic intervals as in the above example.

We want to compute A[0,6].

A[0,6] = I(1,0) + I(2,2) + I(3,6)

Computing an arbitrary interval

0 1 2 3 4 5 6 7I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7)

I(2,0) I(2,1) I(2,2) I(2,3)

I(1,0) I(1,1)

I(0,0)

Page 15: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

15

Dyadic intervals - observations Log(|U|) + 1 resolution levels 2|U| - 1 dyadic intervals altogether

O(|U|) space needed to keep them all O(log(|U|)) time needed to

compute any arbitrary interval.

Page 16: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

16

Computing quantiles (Cont.) We can now efficiently compute

any arbitrary interval in A. A ф-quantile for any k can be

computed thus: We need a jk s.t.:

A[0,jk) < kФN < a[0,jk+1) Use binary search to find it !

Page 17: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

17

But… Keeping O(|U|) of data presents a

real space complexity problem. We need a way of estimating A[i]

on demand. … And also of estimating any

dyadic interval on demand.

Page 18: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

18

Introducing random sets Let S be a random set of values

from U. Each value has a probability of ½

of being in S. Expectation of the number of items

in S is ½|U|.

Page 19: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

19

Random subset sums Define ||AS|| as the number of

items in A with values in S.

Expectation of ||AS|| is ½||A||=½N. Now consider only subsets S

containing a certain value i.

Si

S A[i] A

}\{S A2

1 ]A[]A[E iUiSi

Page 20: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

20

Random subset sums (cont.) Suppose we keep a number of

random sets S, each containing random values from U – each with probability ½.

We maintain ||AS|| for each such set. Easy to maintain during updates.

How can we now estimate A[i] ?

Page 21: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

21

Random subset sums (cont.) We can estimate A[i] for any i with:

A[i] = 2||AS|| - ||A|| Proof:

The authors prove that repeating the process O(1/ε2) times yields the required accuracy.

][AA)A2

1]A[(2)AAE(2 }\{S ii iU

Page 22: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

22

Random subset sums (cont.) We can also estimate any dyadic

interval Ij,k using the same method. Improvement: We can compute the

sums for dyadic intervals from a certain level.

We can now estimate any arbitrary interval in the universe…

Page 23: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

23

Space Considerations Keeping a set of expected size ½|

U| is still O(|U|). We need a method of “keeping” a

set without actually keeping it… The technique: instead of sets,

keep random seeds of size o(log|U|) bits and compute whether a given iєS on demand.

Page 24: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

24

Extended Hamming Code Used for generating the random sets. Provides sufficient “randomness” For example:

|U| = 8 Seed size: log|U|+1 = 4

G(seed, i) = seed X i’th column

10101010

11001100

11110000

11111111

Page 25: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

25

RSS Algorithm Summary To compute a dyadic interval.

Compute 2||AS|| - ||A|| for sets containing the given dyadic interval.

To compute an arbitrary interval. Write it as a disjoint union of dyadic

intervals, estimate them and take a median over possible results (simplified).

To compute the quantiles. Use binary search and compute the

intervals until found.

Page 26: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

26

Algorithm Complexity Claim The RSS algorithm’s space

complexity (for t quantile queries):

Time complexity for inserts, deletes and computing each quantile on demand is proportional to the space used.

)/))log(

log()((log 22 Ut

UO

Page 27: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

27

Proof Outline Declare random variable

Xk=2||AIk|| if Ik is in S and 0 otherwise X – Sum of all Xk’s in a certain set Y – Sum of all X’s in a given interval Z – A number of repetitions of X.

Page 28: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

28

Proof Outline (Cont.) In a similar fashion to previous

slides, show that Y and ||A|| can be used to compute ||AI||.

Compute the variance. Use Chebyshev’s and then

Chernoff’s inequalities, together with the computed variance, to achieve the required result.

Page 29: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

29

What If U Is Unknown ? In practice, the universe U is not

always known. Predict a range [0, u-1] for U. Given an inserted (or updated) value

i s.t. (i > u-1), add another instance of RSS with range [u, u2-1], and so on…

Estimating dyadic intervals can be done in a single instance of RSS.

Increased cost factor: log2log(|U|).

Page 30: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

30

Some RSS Properties RSS may return as a quantile a value

which is not really in the dataset. Order of insertions and deletions

does not affect result and accuracy. Can be parallelized quite easily (as

long as random subsets are pre-agreed).

Page 31: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

31

Experimental Results Experiments

Static artificial dataset Dynamic artificial dataset Dynamic real dataset

Participants Naïve[l] RSS[l] GK GK2 – an improvement for GK

Page 32: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

32

Static Artificial Dataset |U| = 220

Compute 15 quantiles at position (1/16)k for k = 1,2,…,15.

3 different distributions Uniform Zipf Normal[m,v]

Algorithm used: RSS[7] (11K footprint).

Page 33: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

33

Errors for Zipf data

Page 34: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

34

Errors for Normal[U/2, U/50] Distribution

Page 35: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

35

Dynamic Artificial Dataset Insert N=104,858 items from

uniform dist. D1=Uni[1,U], U=220. Insert αN more items from uniform

dist. D2=Uni[U/2-U/32, U/2+U/32]. Delete all values from the first

insertion. Parameter α controls the mass of

the second insertion with respect to the first.

Page 36: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

36

Dynamic Artificial Dataset Results

Page 37: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

37

Dynamic Real Dataset Based on true Call Detail Records

(CDRs) from AT&T. Dataset used includes 4.42 million

CDRs covering a period of 18 hours. Objective: find the median length of

current calls. Probe for estimates every 10,000

records. Algorithm used: RSS[6] (4K footprint).

Page 38: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

38

Number of Active Phone Calls Over Time

Page 39: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

39

Error in Computation of Median Over Time

Page 40: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

40

Average Error for Last 50 Snapshots, For Deciles

Page 41: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

41

Conclusions – RSS Algorithm for maintaining dynamic

quantiles. Works well (within a user-defined

precision) both for insertions AND deletions.

Polylogarithmic (in universe size) in space and time complexities.

Page 42: 1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

42