Upload
bruno-hawkins
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
outline• introduction• background
– wavelet basics– example
• wavelet synopses– example– error metrics– optimal synopses– interesting issues
• data streams– models– streaming wavelet synopses
• epilogue
introduction• analyzing massive multi-dimensional datasets
– complex aggregate queries over large parts of the data– exploratory nature– promptness over accuracy, but with guarantees– resort in approximate query processing over precomputed
synopses (e.g., histograms, samples, wavelets)
• numerous data management applications require to continuously generate, process and analyze data on-line– the data streaming paradigm– summarize in real time, using small space and in one pass– provide approximate query answers with quality
guarantees
• provide useful data summarization– need to measure inaccuracy, application dependent
outline• introduction• background
– wavelet basics– example
• wavelet synopses– example– error metrics– optimal synopses– interesting issues
• data streams– models– streaming wavelet synopses
• epilogue
wavelets basics• wavelet decomposition is a mathematical tool for the
hierarchical decomposition of functions– applications in signal/image processing
• used extensively as a data reduction tool in db scenarios:– selectivity estimation for large aggregate queries– fast approximate query answers– general purpose streaming synopsis
• features– efficient: performs in linear time and space (vs. histograms
~N2))– high compression ratio, small-B property– generalizes to multiple dimensions
example
12 24 4 16 4 8 -3 -7
d0 d1 d2 d3 d4 d5 d6 d7
assume a data vector d of 8 values
12 24 4 16 4 8 -3 -7
-6 -6 -2 2
d0 d1 d2 d3 d4 d5 d6 d7
18 10 6 -5
4 5.5
12 24 4 16 4 8 -3 -7
-6 -6 -2 2
d0 d1 d2 d3 d4 d5 d6 d7
18 10 6 -5
0.514
7.25
4 5.5
12 24 4 16 4 8 -3 -7
-6 -6 -2 2
6.75
d0 d1 d2 d3 d4 d5 d6 d7
18 10 6 -5
0.514
7.25
4 5.5
12 24 4 16 4 8 -3 -7
-6 -6 -2 2
6.75
d0 d1 d2 d3 d4 d5 d6 d7
wavelet tree (a.k.a. error tree)
every node contributes positively to the leaves in its left subtree andnegatively to the leaves in its right subtree
iteratively perform pair-wise averaging and semi differencing
averages arenot needed
7.25
4 5.5
12 24 4 16 4 8 -3 -7
-6 -6 -2 2
6.75
d0 d1 d2 d3 d4 d5 d6 d7
c4 c5 c6 c7
c2 c3
c1
c0
+
+
-
+
outline• introduction• background
– wavelet basics– example
• wavelet synopses– example– error metrics– optimal synopses– interesting issues
• data streams– models– streaming wavelet synopses
• epilogue
wavelet synopses• any set of B coefficients constitutes a B-term wavelet synopsis
– stored as <index,value> pairs– implicitly all non-stored coefficients are set to zero
• introduces reconstruction error per point estimate
7.25
4 5.5
12 24 4 16 4 8 -3 -7
-6 -6 -2 2
6.75
d0 d1 d2 d3 d4 d5 d6 d7
c4 c5 c6 c7
c2 c3
c1
c0
12 24 4 16 4 8 -3 -7
d0 d1 d2 d3 d4 d5 d6 d7
7.25
4 5.5
14 14 14 14 6 6 -5 -5
-6 -6 -2 2
6.75
d0 d1 d2 d3 d4 d5 d6 d7
c4 c5 c6 c7
c2 c3
c1
c0
14 14 14 14 6 6 -5 -5
d0 d1 d2 d3 d4 d5 d6 d7
12 24 4 16 4 8 -3 -7
d0 d1 d2 d3 d4 d5 d6 d7
2 10 10 2 2 2 2 2
e0 e1 e2 e3 e4 e5 e6 e7
e = |d-d|
measuring accuracyuse some norm to aggregate individual errors
• L2 norm: Σei2 is the sum squared error (sse)
– sse = 224
• L∞ norm: max ei is the maximum absolute error– max-abs-error = 10
• generalized to any weighted Lp norm: Σwieip
– e.g. max-rel-error = max (1/di)ei = 10/4 = 250%
2 10 10 2 2 2 2 2
e0 e1 e2 e3 e4 e5 e6 e7
vector of point errors e
12 24 4 16 4 8 -3 -7
d0 d1 d2 d3 d4 d5 d6 d7
vector of data values d
optimal synopsesa B-term wavelet synopsis can be optimized for any error metric
• sse optimal synopses are straightforward– wavelet transformation is orthonormal (after normalization)
by Parseval’s theorem L2 norm is preserved
– choose the highest in absolute (normalized) value coefficients
• other (weighted or non) Lp norm optimal synopses require superlinear (quadratic) time in N– dynamic programming over the wavelet tree
interesting issues
• I/O efficiency issues when dealing with massive multi-dimensional datasets [M. Jahangiri, D. Sacharidis, C. Shahabi ‘05]
– during transformation try to minimize I/Os– efficient maintenance as new data are appended (requires
more than just some updating)
• how about optimizing for workloads of range-sum queries?– no known results (without using the prefix-sum array)– ranges overlap arbitrarily no easy dynamic programming
formulation exists
outline• introduction• background
– wavelet basics– example
• wavelet synopses– example– error metrics– optimal synopses– interesting issues
• data streams– models– streaming wavelet synopses
• epilogue
working over data streams• main challenges when data are streaming:
– stream items are only seen once– require small working space– process stream items quickly– provide an answer quickly with quality guarantees
two models depending on how a data vector a is rendered
time series model
stream elements are vector values of type (i,a[i]) and appear ordered in i (e.g., time)
turnstile model
stream elements are updates of type (i,±u) which implies a[i] a[i] ± u and, further, do not appear ordered in i
(4,14)
ith element
(3,4)(2,24)(1,12)stream
adata vector
12 24 4
(i,a[i]) items
14
±u
(4,+2)
ith element (i,±u) updates
(1,-2)(2,-4)(4,+10)
stream
adata vector
12 20 8 10 4 13 7 012
streaming wavelet synopses• time series model
– at most only logN coefficients are affected– a large number of coefficients has finalized value– can perform bottom-up dynamic programming (space required is prohibitive)– greedy techniques should be deployed instead
d0 d1 d2 d3 d4 d5 d6 d7
c4 c5 c6 c7
c2 c3
c1
c0
+
+
-
+
(4,d4)
• turnstile model– even optimizing for the sse is hard [G. Cormode, M. Garofalakis, D. Sacharidis ‘06]
– other error metrics have not been studied
outline• introduction• background
– wavelet basics– example
• wavelet synopses– example– error metrics– optimal synopses– interesting issues
• data streams– models– streaming wavelet synopses
• epilogue
epilogue
wavelet synopses are a highly successful data summarization technique
yet, several problems remain open:– optimize for range query workloads– greedy (time-series) streaming algorithms– other metrics for general (turnstile) streaming data
thank you!
http://www.dblab.ntua.gr/
unrestricted wavelet synopses
the retained coefficients can assume any value, not restricted to their decomposed value (even harder optimization problem!)
quick example: optimize for max-abs-error, d = {2, 10, 12, 8} and B=1•restricted synopsis: keep the overall average 8 m.a.e. = 6•unrestricted synopsis: keep the overall average but change its value to 7 m.a.e. = 5