29
How to Summarize the How to Summarize the Universe: Universe: Dynamic Maintenance of Dynamic Maintenance of Quantiles Quantiles Gilbert, Kotidis, Gilbert, Kotidis, Muthukrishnan, Strauss Muthukrishnan, Strauss Presented by Itay Presented by Itay Malinger Malinger December 2003 December 2003

How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Embed Size (px)

Citation preview

Page 1: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

How to Summarize the Universe:How to Summarize the Universe:Dynamic Maintenance of Dynamic Maintenance of

QuantilesQuantiles

Gilbert, Kotidis, Muthukrishnan, Gilbert, Kotidis, Muthukrishnan, StraussStrauss

Presented by Itay MalingerPresented by Itay Malinger

December 2003December 2003

Page 2: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Problem DefinitionProblem Definition

► The Universe: The Universe: U = {0, …, U = {0, …, ||U U ||-1}-1}►Number of records in data set: ||A||=Number of records in data set: ||A||=NN►Data set can be thought of as an array:Data set can be thought of as an array:

A[i] – number of records with value iA[i] – number of records with value i► AASS – number of records with values in S – number of records with values in S► The The Ф-Ф-quantile of an ordered sequence of N quantile of an ordered sequence of N

data items are the value with rankdata items are the value with rank►Our goal is computing Our goal is computing εε-approximate -approximate ФФ--

quantiles – find a quantiles – find a jjk k such that:such that:

kji

iNk ][A)(

/1,...,2,1for kNk

Nkikji

)(][A

Page 3: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

0

2

4

6

8

10

12

A[i]

1 2 3 4 … … |U|

U

Page 4: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

TransactionsTransactions

► Insert(i): A[i] Insert(i): A[i] A[i] + 1 A[i] + 1►Delete(i): A[i] Delete(i): A[i] A[i] – 1 A[i] – 1►LetLet►ASSUME: The Universe size |U| is ASSUME: The Universe size |U| is

knownknown

i

tt iAN ][

Page 5: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

The Main Algorithmic ResultThe Main Algorithmic Result

►The RSS AlgorithmThe RSS Algorithm►Space ComplexitySpace Complexity►Update In every transaction in Update In every transaction in

O(space) timeO(space) time►Estimation On demand in O(space) Estimation On demand in O(space)

timetime►One Time passOne Time pass

)/))log(

log()((log 22 U

UO

Page 6: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Dyadic IntervalsDyadic Intervals

►Log(|U|)+1 resolution levels jLog(|U|)+1 resolution levels j►2|U|-1 Dyadic intervals2|U|-1 Dyadic intervals

UIiiU

I

jUkjUkkj

I

0,0}{

|),log(|

]1|)log(|2)1(,|)log(|2[,

0 1 2 3 4 5 6 7I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7)

I(2,0) I(2,1) I(2,2) I(2,3)

I(1,0) I(1,1)

I(0,0)

Page 7: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Arbitrary intervalsArbitrary intervals►Any Interval can be displayed as a Any Interval can be displayed as a

disjoint union of at most log(|U|) disjoint union of at most log(|U|) dyadic intervalsdyadic intervals

►For example A[0,6] = For example A[0,6] = I(1,0)+I(2,2)+I(3,6)I(1,0)+I(2,2)+I(3,6)

► Intervals starting at 0 will not use the Intervals starting at 0 will not use the same resolution twicesame resolution twice0 1 2 3 4 5 6 7

I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7)

I(2,0) I(2,1) I(2,2) I(2,3)

I(1,0) I(1,1)

I(0,0)

Page 8: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Computing quantilesComputing quantiles

►Assuming we have the number of Assuming we have the number of records in each dyadic interval, We can records in each dyadic interval, We can efficiently compute any arbitrary interval efficiently compute any arbitrary interval in A.in A.

►To compute the To compute the фф-quantile for any -quantile for any k, k, we we need a need a jjkk s.t.: s.t.:

A[0,jA[0,jkk) < kФN < A[0,j) < kФN < A[0,jk+1k+1))

►Use binary search to find it.Use binary search to find it.►Keeping all intervals is costly (O(|U|))Keeping all intervals is costly (O(|U|))

Page 9: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Random Subset SumsRandom Subset Sums

► In case j = log(|U|)In case j = log(|U|)►Let S be a subset of ULet S be a subset of U►Each uEach uU has p=½ of being in SU has p=½ of being in S►E(|S|)= ½|U|E(|S|)= ½|U|►Define:Define:►E(|AE(|ASS|)=½||A||=½|)=½||A||=½NN

Si

S A[i] A

Page 10: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Estimating A[i]Estimating A[i]

]A[

AA]A[]A[

A)A]2(A[

)AE[-])SA2(E[

)SAAE(2

AA[i]]SAE[

}\{U

}\{U21

S

S

}\{U21

S

i

ii

i

i

i

i

i

i

i

Page 11: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

ImprovementImprovement

► Instead of keeping sets of point dyadic Instead of keeping sets of point dyadic sets, Keep random sets of all sets, Keep random sets of all resolutionsresolutions

►We need a method of keeping a We need a method of keeping a Random set of j-resolution dyadic Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|)intervals (keeping it explicitly is o(|U|)

► Instead of keeping the sets keep a Instead of keeping the sets keep a small representation of themsmall representation of them

Page 12: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Pseudorandom set generatorPseudorandom set generator

►We need to keep a small We need to keep a small representation of a random set S (Uirepresentation of a random set S (UiS S with p= ½)with p= ½)

►Given a seed of size log(|U|)+1Given a seed of size log(|U|)+1►Represent a set S of size o(|U|)Represent a set S of size o(|U|)►Quickly test if iQuickly test if iS or notS or not►Use Extended Hamming CodeUse Extended Hamming Code

Page 13: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Extended Hamming CodeExtended Hamming Code

►Given a seed, tells whether the iGiven a seed, tells whether the iSS►For example:For example:

|U| = 8|U| = 8 Seed size: log|U|+1 = 4Seed size: log|U|+1 = 4

G(seed, i) = seed X i’th column mod 2G(seed, i) = seed X i’th column mod 2►Efficient to computeEfficient to compute►3-wise disjoint3-wise disjoint

10101010

11001100

11110000

11111111

}7,5,2,0{~

1

0

1

1

Page 14: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

The Data StructureThe Data Structure

►For each resolution level j keep For each resolution level j keep num_copies random subsets S of all num_copies random subsets S of all dyadic intervals in that level (we only dyadic intervals in that level (we only keep the representation seed)keep the representation seed)

►KeepKeep►Maintain N = ||A||Maintain N = ||A||►We got SWe got S11,…,S,…,Snum_copiesnum_copies per level per level

2/|)log(|)/|)log(log(|24num_copies UU

Si

S A[i] A

Page 15: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Upon TransactionsUpon Transactions

► Insert(i) / Delete(i)Insert(i) / Delete(i) For Each resolution level jFor Each resolution level j

►Locate the single ILocate the single Ij,kj,k into which i falls into which i falls (high order binary bits)(high order binary bits)

►Determine all SDetermine all Sℓℓ containing I containing Ij,kj,k

►For Each SFor Each Sℓℓ increase/Decrease ||A increase/Decrease ||ASSℓℓ|| by || by 11

Page 16: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Estimating Quantiles: Estimating Quantiles: Dyadic IntervalsDyadic Intervals

► Given a dyadic interval I=IGiven a dyadic interval I=Ij,kj,k

► There are num_copies sets of resolution jThere are num_copies sets of resolution j

GG EE► Quickly test each SQuickly test each Sℓℓ and check if I and check if ISSℓℓ and if so and if so

estimateestimate► Group all estimations into Group all estimations into GG groups of groups of EE

elementselements► For each group g calculate the average of all For each group g calculate the average of all

estimations Aestimations Ag,j,kg,j,k

2/|)log(|8)/|)log(log(|3num_copies UU

AA2A , SSI

Page 17: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Estimating Quantiles:Estimating Quantiles:Arbitrary intervalsArbitrary intervals

►Given an interval I, Write it as a disjoint Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals union of at most log(|U|) dyadic intervals IIj,kj,k

►Form G groups and calculate each Form G groups and calculate each group’s sum of all dyadic interval’s Agroup’s sum of all dyadic interval’s Ag,j,kg,j,k for all Ifor all Ij,kj,k comprising I. comprising I.

►Take the median of all G groups as the Take the median of all G groups as the final estimate of Afinal estimate of AII

► Its more convenient to refer to the result Its more convenient to refer to the result as an overestimate |Aas an overestimate |AII|≤|A|≤|AII||~~≤|A≤|AII|+|+εεNN

Page 18: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

3 dyadic intervals

E = 4 Elements per group

G = 3 Groups

SUM

SUM

SUM

SUM

AV

ER

AG

E

MEDIAN

The Interval’s Estimate

Page 19: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

AnalysisAnalysis

►LemmaLemma: The algorithm estimates each : The algorithm estimates each quantile to within quantile to within εεN with p>1-N with p>1-δδ

►ProofProof:: For a fixed resolution level j, Let For a fixed resolution level j, Let Then:Then:

otherwise0,

SI,A2X kI

kK

k kXX

AA

AA2

]E[XA2

S]I|E[X

0k

0

k0k

0

0k

0

I

kkII

kkkI

k

2

II

]|var[

A2AXA

0

0k0k

ASIX k

Page 20: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

)SIAAE(2

A

kj,S

I kj,

Page 21: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Analysis (cont.)Analysis (cont.)

87

I

222

2

222I

22

jk

jkI

Ijk

j

(j)kkk

εN]AZP[

8

1

/εU8logNε

NUlog

ENε

var(Y)

εN

var(Z)εN]AZP[

NUlogAUlog]SIj|var[Y

]SIj|γN-E[YA

AγA]SIj|E[Y

XYIIII

j

j

j

γ21

2/|)log(|8E U

Page 22: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Analysis (cont.)Analysis (cont.)

►We take G copies of Z and take the median.We take G copies of Z and take the median.►By the Chernoff inequality,By the Chernoff inequality,

►The binary search looked for a jThe binary search looked for a jkk such that such that

►We made log|U| checks in the binary searchWe made log|U| checks in the binary search►The probability any of them failed is log|U| The probability any of them failed is log|U|

times what we achieved, i.e times what we achieved, i.e δδ

)/|)log(log(|3 UG

|U|δ/log1εN]|AmZP[| I

NAANkAA

ANkA

kkkk

kk

jjjj

jj

)1,0[~)1,0[~),0[),0[

~)1,0[~),0[

Page 23: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

RSS PropertiesRSS Properties

►The algorithm may return a quantile The algorithm may return a quantile value which was not seen in the inputvalue which was not seen in the input

►Changing the order of insertions and Changing the order of insertions and deletions doesn’t affect resultsdeletions doesn’t affect results

►The RSSs are composable: U can be The RSSs are composable: U can be split to many disjoint ranges and some split to many disjoint ranges and some pre-agreed common random subsetspre-agreed common random subsets

Page 24: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Extension: U is unknownExtension: U is unknown

►Predict a range Predict a range [0, u-1][0, u-1] for U. for U.►Upon insertion of i > u-1, add Upon insertion of i > u-1, add

anotheranother instance of RSS with range instance of RSS with range [u, u[u, u22-1]-1], and so on…, and so on…

►Because RSS is composable, we only Because RSS is composable, we only have to join the result upon queryhave to join the result upon query

► Increased cost factor: logIncreased cost factor: log22log(|U|).log(|U|).

Page 25: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

ExperimentsExperiments

►What is the median length of all active What is the median length of all active AT&T calls ?AT&T calls ?

►When call When call Starts: Add timestampStarts: Add timestamp Ends: Delete start timestampEnds: Delete start timestamp

►4 KB used for RSS4 KB used for RSS►ComparedCompared

RSSRSS GKGK GK2GK2

Page 26: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Number of Active Phone Calls Number of Active Phone Calls Over TimeOver Time

Page 27: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Error in Computation of Error in Computation of Median Over TimeMedian Over Time

Page 28: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Average Error for Last 50 Average Error for Last 50 Snapshots, For DecilesSnapshots, For Deciles

Page 29: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

The The EndEnd