30
2006/6/6 1 Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu ICDE2004

Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

  • Upload
    hogan

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu ICDE2004. Outline. motivation Problem definition Quantile Sketch Sliding window model n of N model Conclusion. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 1

Continuously Maintaining Quantile Summaries of the Most Recent N

Elements over a Data Stream

Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu

ICDE2004

Page 2: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 2

Outline

• motivation

• Problem definition

• Quantile Sketch

• Sliding window model

• n of N model

• Conclusion

Page 3: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 3

Motivation

• Data elements seen early could be outdated and quantile summaries for the most recently seen data elements are more important.

• Example:– The top ranked Web pages among most recentl

y assessed N pages should produce more accurate webpages accessed so far as users’ interests are changing.

Page 4: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 4

Problem Definitions

-Quantile: A -quantile ((0,1]) of an ordered sequence of N data elements is the element with rank N .

• Quantile Query: Given , find the data element with rank N among all elements in the stream.– Variation: N recent elements (sliding window model).

Page 5: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 5

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15

12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3

sort

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12

N = 16

0.5 quantile returns element ranked 8 ( 0.5*16)

which is 8

0.75 quantile returns element ranked 12 (0.75*16)

which is 10

Page 6: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 6

Three Different Models• Data stream model

– Computing ψ-quantile for all data items seen so far

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15

12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3

1 6 7 8 9 10 10 10 11 11 11 12

1 2 3 4 5 6 7 8 9 10 10 10 11 11 11 12

0.5-quantile returns 10 at time t11

0.5-quantile returns 8 at time t15

Page 7: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 7

Three Different Models (contd.)• Sliding window model

– Computing ψ-quantile against the N most recent elements in a data stream seen so far

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15

12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3

1 6 7 8 9 10 10 10 11 11 11 12

1 2 3 4 5 6 7 8 9 10 11 11

Window size = 12 , 0.5-quantile returns 10 at time t11

0.5-quantile returns 6 at time t15

Page 8: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 8

Three Different Models (contd.)• n-of-N model

– For any n N, computing ψ-quantile among the ≦ n most recent elements in a data stream seen so far

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15

12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3

1 6 7 8 9 10 11 11

2 3 4 5

N = 12, 0.5-quantile returns 8 at time t11 for n = 8,

0.5-quantile returns 3 at time t15 for n = 4

Page 9: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 9

ε- approximate• A quantile summary for a data sequence is ε- appr

oximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ]

• 0.25-approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}.

Example : A data stream with 100 elements, 0.5 – quantile with ε= 0.1 returns a value v. The true rank of v is within [40,60]

Page 10: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 10

Quantile Sketch

• Data structure – { (vi , ri

– ,ri+) : 1 i m}≦ ≦

– A value vi is one of the element seen so far

– ri– is the lower bound on the rank of vi

– ri+ is the upper bound on the rank of vi

– vi v≦ i+1 , for 1 i m - 1≦ ≦– ri

– r≦ i+1– , for 1 i m – 1≦ ≦

– ri– r≦ i r≦ i

+ , where ri is the rank of vi

Page 11: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 11

The Summary Data Structure

• Given gi = ri– - ri-1

– and Δi = ri+ - ri

– ri– = ji gj

– ri+ = ji gj +Δi

• v1 and vm always correspond to the minimum and the maximum elements seen so far.

Page 12: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 12

Example??t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15

12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3

Quantile sketch consisting of 6 tuples

{(1,1,1), (2,2,9), (3,3,10), (5,4,10), (10,10,10), (12,16,16)}

Page 13: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 13

ε- approximate sketch

• Theorem– 1. r1

+ εN + 1≦ ,

– 2. rm– (1-ε)N,≧

– 3. for 2 i m, ≦ ≦

• Sketch S is ε- approximate, That is for each

ψ (0,1] , there is a (vi , ri– ,ri

+) in S such that N NrrNN ii

Nrr ii 21

Page 14: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 14

Query

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15

12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3

Quantile sketch consisting of 6 tuples ε= 0.25

{(1,1,1), (2,2,9), (3,3,10), (5,4,10), (10,10,10), (12,16,16)}

0.5 – quantile return the vi of rank 8 , εN = 4

4848 ii rNrrNr Find the first tuple to satisfy the rule, and return vi

(5,4,10) => return 5

Page 15: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 15

Dilemma• Memory is bounded

• GK-algorithm - space requirement

NO

log1

Page 16: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 16

One-Pass summary for sliding windows

• Continuously divide a stream into the buckets based on the arrival ordering of data elements

• The capacity of each bucket is • For each bucket, we maintain an -approximate

continuously by GK-algorithm Once a bucket is full its - approximate sketch is

compressed into an - approximate sketch The oldest bucket is expired if currently the total

number of elements is N+1

2

N

4

4

2

Page 17: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 17

Current bucket

the most recent N elements

elements 2

N

elements 2

N

elements 2

N

elements 2

N ….

expired bucket

Compressed - approximate sketch in each bucket2

GK

Page 18: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 18

-approximate sketch4

-approximate sketch

2

ExampleN = 8 , ε= 1 , = 4

1 2 3 4

2

N

5 6 7 8 9

Current bucket

Expire

Current bucket Current bucket

Full , compress

-approximate sketch4

-approximate sketch

2

Page 19: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 19

Compress

• Compress an - approximate sketch intoε- approximate sketch

• Memory space is most

2

21

Page 20: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 20

Merge• There are h data stream Di ,and each Di has

Ni data elements. Suppose each Si is an ε- approximate sketch of Di.

• Smerge is a sketch of

• |Smerge| =

• Suppose each Si is an ε- approximate sketch. Then, Smerge is also an

ε- approximate sketch

ihi D1

ihi S1

Page 21: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 21

Page 22: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 22

Another Problem

5, 6, 7, 8,1, 2, 3, 4,

Expired

9

Current

ε=1 and N = 8

2

Approximate sketch 3,3,7,1,1,5 1 ,1 ,9

The first tuple in Smerge is , but the rank of 5 is 4. Smerge is not an - approximate sketch

5.3 ,1 ,5

2

Page 23: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 23

Lift• To solve the pervious problem, we use a “l

ift” operation to lift the value of by for each tuple i

• If S is an - approximate sketch, then Slift

is an ε -approximate sketch

ir

2

N

2

Page 24: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 24

QueryStep1. merge the local sketch

…N2

N

2

N

2

N

2

N

2

Smerge

Step2. lift Smerge lift

Slift

Current bucket

Step3. for a given rank r = ,find the first tuple

in Slift such that , return vi

N iii rrv ,,

NrrrNr ii

Page 25: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 25

e3

One-Pass Summary under n-of N

• EH partitioning Technique– EH maintains at most +1 “i-buckets” for each i

1

e1 e2 e3 e4 e5 e6 e7 e8 e9 e10

e1 e2 e3e4

2

1

1-bucket = 4 , merge 1-buckete1 e2 e3 e4 e5

e6

e1 e2 e4 e5 e8e7e6

e3e1 e2 e4 e5 e10e7e6 e9e8

e3e1 e2 e4 e5 e7e6 e9e8

2-bucket = 4 , merge 2-bucket

e10

For N elements, the number of buckets in EH is always

N

Olog

Page 26: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 26

Sketch Construction

• Use the EH technique to partition a data stream

• Maintain a sketch Sb for each bucket b

• Choose λ=

• Maintain an approximate sketch for each Sb

2

2

Page 27: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 27

Example• Construct a sketch Sb for each bucket b to summarize the data ele

ment from the earliest element in b up to now

ef d c b a

Sf

Se

Sd

Sc

Sb

Sa

ef d c b a

Sf

Se

Sd

Sc

Sb

Sa

g

Sg

4-bucket 2-bucket 1-bucket 4-bucket 2-bucket 1-bucket2-bucket 1-bucket

λ= 1/2

Page 28: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 28

n-of-N Query

ef d c b a

Sf

Se

Sd

Sc

Sb

4-bucket 2-bucket 1-bucket

n

Sa

Step1.

Page 29: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 29

n-of-N Query

Step2. Se

Lift by

2

N

Slift

n

Step3.

for a given rank r, find the first tuple in Slift

such that , return vi

iii rrv ,,

nrrrnr ii

Page 30: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

2006/6/6 30

Conclusions

• The work presented is among the attempts to develop space efficient, one pass, deterministic quantile summary algorithms with performance guarantees under the sliding window model of data streams