Upload
hogan
View
33
Download
0
Embed Size (px)
DESCRIPTION
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu ICDE2004. Outline. motivation Problem definition Quantile Sketch Sliding window model n of N model Conclusion. Motivation. - PowerPoint PPT Presentation
Citation preview
2006/6/6 1
Continuously Maintaining Quantile Summaries of the Most Recent N
Elements over a Data Stream
Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu
ICDE2004
2006/6/6 2
Outline
• motivation
• Problem definition
• Quantile Sketch
• Sliding window model
• n of N model
• Conclusion
2006/6/6 3
Motivation
• Data elements seen early could be outdated and quantile summaries for the most recently seen data elements are more important.
• Example:– The top ranked Web pages among most recentl
y assessed N pages should produce more accurate webpages accessed so far as users’ interests are changing.
2006/6/6 4
Problem Definitions
-Quantile: A -quantile ((0,1]) of an ordered sequence of N data elements is the element with rank N .
• Quantile Query: Given , find the data element with rank N among all elements in the stream.– Variation: N recent elements (sliding window model).
2006/6/6 5
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
sort
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12
N = 16
0.5 quantile returns element ranked 8 ( 0.5*16)
which is 8
0.75 quantile returns element ranked 12 (0.75*16)
which is 10
2006/6/6 6
Three Different Models• Data stream model
– Computing ψ-quantile for all data items seen so far
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
1 6 7 8 9 10 10 10 11 11 11 12
1 2 3 4 5 6 7 8 9 10 10 10 11 11 11 12
0.5-quantile returns 10 at time t11
0.5-quantile returns 8 at time t15
2006/6/6 7
Three Different Models (contd.)• Sliding window model
– Computing ψ-quantile against the N most recent elements in a data stream seen so far
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
1 6 7 8 9 10 10 10 11 11 11 12
1 2 3 4 5 6 7 8 9 10 11 11
Window size = 12 , 0.5-quantile returns 10 at time t11
0.5-quantile returns 6 at time t15
2006/6/6 8
Three Different Models (contd.)• n-of-N model
– For any n N, computing ψ-quantile among the ≦ n most recent elements in a data stream seen so far
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
1 6 7 8 9 10 11 11
2 3 4 5
N = 12, 0.5-quantile returns 8 at time t11 for n = 8,
0.5-quantile returns 3 at time t15 for n = 4
2006/6/6 9
ε- approximate• A quantile summary for a data sequence is ε- appr
oximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ]
• 0.25-approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}.
Example : A data stream with 100 elements, 0.5 – quantile with ε= 0.1 returns a value v. The true rank of v is within [40,60]
2006/6/6 10
Quantile Sketch
• Data structure – { (vi , ri
– ,ri+) : 1 i m}≦ ≦
– A value vi is one of the element seen so far
– ri– is the lower bound on the rank of vi
– ri+ is the upper bound on the rank of vi
– vi v≦ i+1 , for 1 i m - 1≦ ≦– ri
– r≦ i+1– , for 1 i m – 1≦ ≦
– ri– r≦ i r≦ i
+ , where ri is the rank of vi
2006/6/6 11
The Summary Data Structure
• Given gi = ri– - ri-1
– and Δi = ri+ - ri
–
– ri– = ji gj
– ri+ = ji gj +Δi
• v1 and vm always correspond to the minimum and the maximum elements seen so far.
2006/6/6 12
Example??t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
Quantile sketch consisting of 6 tuples
{(1,1,1), (2,2,9), (3,3,10), (5,4,10), (10,10,10), (12,16,16)}
2006/6/6 13
ε- approximate sketch
• Theorem– 1. r1
+ εN + 1≦ ,
– 2. rm– (1-ε)N,≧
– 3. for 2 i m, ≦ ≦
• Sketch S is ε- approximate, That is for each
ψ (0,1] , there is a (vi , ri– ,ri
+) in S such that N NrrNN ii
Nrr ii 21
2006/6/6 14
Query
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
12 10 11 10 1 10 11 9 6 7 8 11 4 5 2 3
Quantile sketch consisting of 6 tuples ε= 0.25
{(1,1,1), (2,2,9), (3,3,10), (5,4,10), (10,10,10), (12,16,16)}
0.5 – quantile return the vi of rank 8 , εN = 4
4848 ii rNrrNr Find the first tuple to satisfy the rule, and return vi
(5,4,10) => return 5
2006/6/6 15
Dilemma• Memory is bounded
• GK-algorithm - space requirement
NO
log1
2006/6/6 16
One-Pass summary for sliding windows
• Continuously divide a stream into the buckets based on the arrival ordering of data elements
• The capacity of each bucket is • For each bucket, we maintain an -approximate
continuously by GK-algorithm Once a bucket is full its - approximate sketch is
compressed into an - approximate sketch The oldest bucket is expired if currently the total
number of elements is N+1
2
N
4
4
2
2006/6/6 17
Current bucket
the most recent N elements
elements 2
N
elements 2
N
elements 2
N
elements 2
N ….
expired bucket
Compressed - approximate sketch in each bucket2
GK
2006/6/6 18
-approximate sketch4
-approximate sketch
2
ExampleN = 8 , ε= 1 , = 4
1 2 3 4
2
N
5 6 7 8 9
Current bucket
Expire
Current bucket Current bucket
Full , compress
-approximate sketch4
-approximate sketch
2
2006/6/6 19
Compress
• Compress an - approximate sketch intoε- approximate sketch
• Memory space is most
2
21
2006/6/6 20
Merge• There are h data stream Di ,and each Di has
Ni data elements. Suppose each Si is an ε- approximate sketch of Di.
• Smerge is a sketch of
• |Smerge| =
• Suppose each Si is an ε- approximate sketch. Then, Smerge is also an
ε- approximate sketch
ihi D1
ihi S1
2006/6/6 21
2006/6/6 22
Another Problem
5, 6, 7, 8,1, 2, 3, 4,
Expired
9
Current
ε=1 and N = 8
2
Approximate sketch 3,3,7,1,1,5 1 ,1 ,9
The first tuple in Smerge is , but the rank of 5 is 4. Smerge is not an - approximate sketch
5.3 ,1 ,5
2
2006/6/6 23
Lift• To solve the pervious problem, we use a “l
ift” operation to lift the value of by for each tuple i
• If S is an - approximate sketch, then Slift
is an ε -approximate sketch
ir
2
N
2
2006/6/6 24
QueryStep1. merge the local sketch
…N2
N
2
N
2
N
2
N
2
Smerge
Step2. lift Smerge lift
Slift
Current bucket
Step3. for a given rank r = ,find the first tuple
in Slift such that , return vi
N iii rrv ,,
NrrrNr ii
2006/6/6 25
e3
One-Pass Summary under n-of N
• EH partitioning Technique– EH maintains at most +1 “i-buckets” for each i
1
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10
e1 e2 e3e4
2
1
1-bucket = 4 , merge 1-buckete1 e2 e3 e4 e5
e6
e1 e2 e4 e5 e8e7e6
e3e1 e2 e4 e5 e10e7e6 e9e8
e3e1 e2 e4 e5 e7e6 e9e8
2-bucket = 4 , merge 2-bucket
e10
For N elements, the number of buckets in EH is always
N
Olog
2006/6/6 26
Sketch Construction
• Use the EH technique to partition a data stream
• Maintain a sketch Sb for each bucket b
• Choose λ=
• Maintain an approximate sketch for each Sb
2
2
2006/6/6 27
Example• Construct a sketch Sb for each bucket b to summarize the data ele
ment from the earliest element in b up to now
ef d c b a
Sf
Se
Sd
Sc
Sb
Sa
ef d c b a
Sf
Se
Sd
Sc
Sb
Sa
g
Sg
4-bucket 2-bucket 1-bucket 4-bucket 2-bucket 1-bucket2-bucket 1-bucket
λ= 1/2
2006/6/6 28
n-of-N Query
ef d c b a
Sf
Se
Sd
Sc
Sb
4-bucket 2-bucket 1-bucket
n
Sa
Step1.
2006/6/6 29
n-of-N Query
Step2. Se
Lift by
2
N
Slift
n
Step3.
for a given rank r, find the first tuple in Slift
such that , return vi
iii rrv ,,
nrrrnr ii
2006/6/6 30
Conclusions
• The work presented is among the attempts to develop space efficient, one pass, deterministic quantile summary algorithms with performance guarantees under the sliding window model of data streams