1
CS 361 Lecture 5
Approximate Quantiles and Histograms
9 Oct 2002
Gurmeet Singh Manku([email protected])
2
Frequency Related Frequency Related Problems ...Problems ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Find all elements with frequency > 0.1%
Top-k most frequent elements
What is the frequency of element 3? What is the total frequency
of elements between 8 and 14?
Find elements that occupy 0.1% of the tail.
Mean + Variance?
Median?
How many elements have non-zero frequency?
3
Types of Histograms ...Types of Histograms ...• Equi-Depth Histograms
– Idea: Select buckets such that counts per bucket are equal
Count forbucket
Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count forbucket
Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2)( minimizeB
BB Bv v V
Cf
• V-Optimal Histograms
– Idea: Select buckets to minimize frequency variance within
buckets
4
Histograms: ApplicationsHistograms: Applications
• One Dimensional Data
– Database Query Optimization [Selinger78]
• Selectivity estimation
– Parallel Sorting [DNS91] [NowSort97]
• Jim Gray’s sorting benchmark
– [PIH96] [Poo97] introduced a taxonomy, algorithms, etc.
• Multidimensional Data
– OLTP: not much use (independent attribute assumption)
– OLAP & Mining: yeah
5
Finding The Median ...Finding The Median ...
• Exact median in main memory O(n) [BFPRT
73]
• Exact median in one pass n/2 [Pohl 68]
• Exact median in p passes O(n^(1/p)) [MP 80]
2 passes O(sqrt(n))
How about an approximate median?
6
Approximate Medians & QuantilesApproximate Medians & Quantiles
-Quantile element with rank N 0 <
< 1
( = 0.5 means Median)
-Approximate -quantile any element with rank ( ) N 0
< < 1 Typical = 0.01 (1%) -approximate median
Multiple equi-spaced -approximate quantiles= Equi-depth Histogram
7
Plan for Today ...Plan for Today ...
Greenwald-Khanna Algorithmfor arbitrary length stream
Munro-Paterson Algorithmfor fixed N
Sampling-based Algorithmsfor arbitrary length stream
Randomized Algorithm for fixed N
Randomized Algorithm for arbitrary length stream
Generalization
8
Data distribution assumptions ...Data distribution assumptions ...
Input sequence of ranks is arbitrary.
e.g., warehouse data
9
Munro-Paterson Algorithm [MP 80]Munro-Paterson Algorithm [MP 80]
Munro-Paterson [1980]
1 1
2
1 1
2
3
1 1
2
1 1
2
3
4
b = 4
b buffers, each of size kMemory = bk
Minimize bk subject to following constraints:
Number of elements in leaves = k 2^b > NMax relative error in rank = b/2k <
b log ( N)k 1/ log ( N)
Memory = bk = ))(log1
( 2 NO
How do we collapse two sorted buffers into one? Merge Pick alternate elements
Input: N and
10
Error Propagation ...Error Propagation ...
S S S S S ? ? ? ? L L L L L LDepth d
S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L
S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L LDepth d+1 Depth d+1
Number of “?” elements <= 2x+1
x “?” elements
2x+1 “?” elements
Top-down analysis
11
Error Propagation at Depth 0 ...Error Propagation at Depth 0 ...
S S S S S S S M L L L L L L L
S S S S S S S S S S S S S S S M L L L L L L L L L L L L L L
S S S S S S S S S S S L L L L S S S S M L L L L L L L L L L
Depth 0
Depth 1 Depth 1
12
Error Propagation at Depth 1 ...Error Propagation at Depth 1 ...
S S S S S S S S S S L L L L L
S S S S S S S S S S S S S S S S S S S S ? L L L L L L L L L
S S S S S S S S S S S S L L L S S S S S S S S ? L L L L L L
Depth 1
Depth 2 Depth 2
13
Error propagation at Depth 2 ...Error propagation at Depth 2 ...
S S S S S S S S ? L L L L L L
S S S S S S S S S S S S S S S S ? ? ? L L L L L L L L L L L
S S S S S S S S ? L L L L L L S S S S S S S S ? ? L L L L L
Depth 2
Depth 3 Depth 3
14
Error Propagation ...Error Propagation ...
S S S S S ? ? ? ? L L L L L LDepth d
S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L
S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L LDepth d+1 Depth d+1
Number of ? elements <= 2x+1
x “?” elements
2x+1 “?” elements
15
Error Propagation level by levelError Propagation level by level
Number of elements at depth d = k 2^d
Increase in fractional error in rank is 1/2k per level
Munro-Paterson [1980]
3 3
2
3 3
2
1
3 3
2
3 3
2
1
0
b = 4
b buffers, each of size kMemory = bk
Depth d = 2
Let sum of “?” elements at depth d be XThen fraction of “?” elements at depth d
f = X / (k 2^d)
Sum of “?” elements at depth d+1 is at most 2X+2^dThen fraction of “?” elements at depth d+1 f’ <= (2X + 2^d) / (k 2^(d+1)) = f + 1/2k
Fractional error in rank at depth 0 is 0.Max depth = bSo, total fractional error is <= b/2k
Constraint 2: b/2k <
16
Generalized Munro-Paterson [MRL Generalized Munro-Paterson [MRL 98]98]
b = 5
How do we collapseBuffers with different weights?
Each buffer has a ‘weight’ associated with it.
17
Generalized Collapse ...Generalized Collapse ...
31 37 6 12 5 10 35 8 19 13 28 15 16 25 27
6 10 15 27 35
5 5 6 6 8 8 8 10 10 10 12 12 13 13 13 15 16 19 19 19 25 27 28 31 31 35 35 35 37 37
Weight 6
Weight 2 Weight 3 Weight 1
k = 5
31 31 37 37 6 6 12 12 5 5
10 10 10 35 35 35 8 8 8 19 19 19 13 13 13
28 15 16 25 27
18
Analysis of Generalized Munro-Analysis of Generalized Munro-PatersonPaterson
Munro-Paterson
Generalized Munro-Paterson - But smaller constant
))(log1
( 2 nO
))(log1
( 2 nO
19
Reservoir Sampling [Vitter 85]Reservoir Sampling [Vitter 85]
Maintain a uniform sample of size s
If s = , then with probability at least 1-,
answer is an -approximate median
Input Sequence of length N
Sample of size s
12 log O
Approximate median = median of sample
20
““Non-Reservoir” SamplingNon-Reservoir” Sampling
A B D B A B D F A SC D D B A B D F A T X Y D B A X T F A S X Z D B A B D T G H
Choose 1 out of every N/s successive elements
N/s elements
))1
log(1
(2
O
At end of stream, sample size is sApproximate median = median of sample
If s = , then with probability at least 1-,
answer is an -approximate median
21
Non-uniform Sampling ...Non-uniform Sampling ...
A B D B A B D F A SC D D B A B D F A T X Y D B A X T F A SX Z D B A B D T G H ...
s out ofs elements
Weight = 1
))1
log(1
(2
O
At end of stream, sample size is O(s log(N/s))Approximate median = weighted median of sample
If s = , then with probability at least 1-,
answer is an -approximate median
s out of2s elementsWeight = 2
s out of4s elementsWeight = 4
s out of8s elementsWeight = 8
22
Sampling + Generalized Munro-Paterson [MRL Sampling + Generalized Munro-Paterson [MRL 98]98]
Advance knowledge of N
Output is an -approximate median
with probability at least 1-.
Reservoir SamplingMaintain samples.
12 log O
12 2log5.0
Memory required:
Compute exact median of samples.
Stream of unknown length, and
“1-in-N/s” SamplingChoose s = samples.
Generalized Munro-PatersonCompute -approximate median of samplesMemory required =
Stream of known length N, and
Memory required: )))1
log(1
(log1
( 2
O
12 2log5.0
1)))1((log)1(( 211 sO
23
Unknown-N Algorithm [MRL 99]Unknown-N Algorithm [MRL 99]
Non-uniform Sampling
Modified Deterministic AlgorithmFor Approximate Medians
Stream of unknown length, and
Output is an -approximate median
with probability at least 1-.
Memory required: )))1
log(1
(log1
( 2
O
24
Non-uniform Sampling ...Non-uniform Sampling ...
A B D B A B D F A SC D D B A B D F A T X Y D B A X T F A SX Z D B A B D T. …
s out ofs elements
Weight = 1
))1
log(1
(2
O
At end of stream, sample size is O(s log(N/s))Approximate median = weighted median of sample
If s = , then with probability at least 1-,
answer is an -approximate median
s out of2s elementsWeight = 2
s out of4s elementsWeight = 4
s out of8s elementsWeight = 8
A B D E
s out ofs elements
Weight = 1
25
Modified Deterministic Modified Deterministic Algorithm ...Algorithm ...
h
h+1
h+2
h+3
Height
2s elementswith
W = 1
L = highest levelh = height of tree
Sample Inputs elements
with
W = 2
s elementswith
W = 4
s elementswith
W = 8
s elementswith
W = 2^(L-h)
L
Compute approximate median of weighted samples.b buffers, each of size k
26
Modified Munro-Paterson Modified Munro-Paterson AlgorithmAlgorithm
Height
WeightedSamples 2s elements
with
W = 1
H = highest levelb = height of tree
s elementswith
W = 2
s elementswith
W = 4
s elementswith
W = 8
s elementswith
W = 2^(H-b)
Compute approximate median of weighted samples.
b
b+1
b+2
b+3
H
b buffers, each of size k
27
Error Analysis ...Error Analysis ...
WeightedSamples 2s elements
with
W = 1
b+h = total heightb = height of small tree
s elementswith
W = 2
s elementswith
W = 4
s elementswith
W = 8
s elementswith
W = 2^(H-b)
b
b+1
b+2
b+3
b+h
b buffers, each of size k
Increase in fractional error in rank is 1/2k per level
Total fractional error <=
k
b
k
hb
k
b
k
b
k
bh 12
1
28
1
2
2
4
1
2
1
2
1
2
28
Error Analysis contd...Error Analysis contd...
b O(log ( s))k O(1/ log ( s))
Memory = bk = )))1
log(1
(log1
( 2
O
Minimize bk subject to following constraints:
Number of elements in leaves = k 2^b > s where s =
Max fractional error in rank = b/k < (1-)
12 2log5.0 Almost the same
as before
29Require advance knowledge of n.
))(log1
( 2 nO
Summary of Algorithms ...Summary of Algorithms ...
• Reservoir Sampling [Vitter 85]
– Probabilistic
• Munro-Paterson [MP 80]
– Deterministic
• Generalized Munro-Paterson [MRL 98]
– Deterministic
• Sampling + Generalized MP [MRL98]
– Probabilistic
• Non-uniform Sampling + GMP [MRL 99]
– Probabilistic
• Greenwald & Khanna [GK 01]
– Deterministic
)))1
log(1
(log1
( 2
O
))(log1
( 2 nO
))log(1
( nO
))1
log(1
(2
O
)))1
log(1
(log1
( 2
O
33
List of papers ...List of papers ...
[Hoeffding63] W Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables”, Amer. Stat. Journal, p 13-30, 1963
[MP80] J I Munro and M S Paterson, “Selection and Sorting in Limited Storage”, Theoretical Computer Science, 12:315-323, 1980.
[Vit85] J S Vitter, “Random Sampling with a Reservoir”, ACM Trans. on Math. Software, 11(1):37-57, 1985.
[MRL98] G S Manku, S Rajagopalan and B G Lindsay, “Approximate Medians and other Quantiles in One Pass and with Limited Memory”, ACM SIGMOD 98, p 426-435, 1998.
[MRL99] G S Manku, S Rajagopalan and B G Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets”, ACM SIGMOD 99, pp 251-262, 1999.
[GK01] M Greenwald and S Khanna, “Space-Efficient Online Computation of Quantile Summaries”, ACM SIGMOD 2001, p 58-66, 2001.