Constructing Optimal Wavelet Synopses Dimitris Sacharidis [email protected] Timos Sellis [email protected]

Constructing Optimal Wavelet Synopses

Dimitris [email protected]

Timos [email protected]

outline• introduction• background

– wavelet basics– example

• wavelet synopses– example– error metrics– optimal synopses– interesting issues

• data streams– models– streaming wavelet synopses

• epilogue

introduction• analyzing massive multi-dimensional datasets

– complex aggregate queries over large parts of the data– exploratory nature– promptness over accuracy, but with guarantees– resort in approximate query processing over precomputed

synopses (e.g., histograms, samples, wavelets)

• numerous data management applications require to continuously generate, process and analyze data on-line– the data streaming paradigm– summarize in real time, using small space and in one pass– provide approximate query answers with quality

guarantees

• provide useful data summarization– need to measure inaccuracy, application dependent





• epilogue

wavelets basics• wavelet decomposition is a mathematical tool for the

hierarchical decomposition of functions– applications in signal/image processing

• used extensively as a data reduction tool in db scenarios:– selectivity estimation for large aggregate queries– fast approximate query answers– general purpose streaming synopsis

• features– efficient: performs in linear time and space (vs. histograms

~N2))– high compression ratio, small-B property– generalizes to multiple dimensions

example

12 24 4 16 4 8 -3 -7

d0 d1 d2 d3 d4 d5 d6 d7

assume a data vector d of 8 values

12 24 4 16 4 8 -3 -7

-6 -6 -2 2

d0 d1 d2 d3 d4 d5 d6 d7

18 10 6 -5

4 5.5

12 24 4 16 4 8 -3 -7

-6 -6 -2 2

d0 d1 d2 d3 d4 d5 d6 d7

18 10 6 -5

0.514

7.25

4 5.5

12 24 4 16 4 8 -3 -7

-6 -6 -2 2

6.75

d0 d1 d2 d3 d4 d5 d6 d7

18 10 6 -5

0.514

7.25

4 5.5

12 24 4 16 4 8 -3 -7

-6 -6 -2 2

6.75

d0 d1 d2 d3 d4 d5 d6 d7

wavelet tree (a.k.a. error tree)

every node contributes positively to the leaves in its left subtree andnegatively to the leaves in its right subtree

iteratively perform pair-wise averaging and semi differencing

averages arenot needed

7.25

4 5.5

12 24 4 16 4 8 -3 -7

-6 -6 -2 2

6.75

d0 d1 d2 d3 d4 d5 d6 d7

c4 c5 c6 c7

c2 c3

c1

c0

+

+

-

+





• epilogue

wavelet synopses• any set of B coefficients constitutes a B-term wavelet synopsis

– stored as <index,value> pairs– implicitly all non-stored coefficients are set to zero

• introduces reconstruction error per point estimate

7.25

4 5.5

12 24 4 16 4 8 -3 -7

-6 -6 -2 2

6.75

d0 d1 d2 d3 d4 d5 d6 d7

c4 c5 c6 c7

c2 c3

c1

c0

12 24 4 16 4 8 -3 -7

d0 d1 d2 d3 d4 d5 d6 d7

7.25

4 5.5

14 14 14 14 6 6 -5 -5

-6 -6 -2 2

6.75

d0 d1 d2 d3 d4 d5 d6 d7

c4 c5 c6 c7

c2 c3

c1

c0

14 14 14 14 6 6 -5 -5

d0 d1 d2 d3 d4 d5 d6 d7

12 24 4 16 4 8 -3 -7

d0 d1 d2 d3 d4 d5 d6 d7

2 10 10 2 2 2 2 2

e0 e1 e2 e3 e4 e5 e6 e7

e = |d-d|

measuring accuracyuse some norm to aggregate individual errors

• L2 norm: Σei2 is the sum squared error (sse)

– sse = 224

• L∞ norm: max ei is the maximum absolute error– max-abs-error = 10

• generalized to any weighted Lp norm: Σwieip

– e.g. max-rel-error = max (1/di)ei = 10/4 = 250%

2 10 10 2 2 2 2 2

e0 e1 e2 e3 e4 e5 e6 e7

vector of point errors e

12 24 4 16 4 8 -3 -7

d0 d1 d2 d3 d4 d5 d6 d7

vector of data values d

optimal synopsesa B-term wavelet synopsis can be optimized for any error metric

• sse optimal synopses are straightforward– wavelet transformation is orthonormal (after normalization)

by Parseval’s theorem L2 norm is preserved

– choose the highest in absolute (normalized) value coefficients

• other (weighted or non) Lp norm optimal synopses require superlinear (quadratic) time in N– dynamic programming over the wavelet tree

interesting issues

• I/O efficiency issues when dealing with massive multi-dimensional datasets [M. Jahangiri, D. Sacharidis, C. Shahabi ‘05]

– during transformation try to minimize I/Os– efficient maintenance as new data are appended (requires

more than just some updating)

• how about optimizing for workloads of range-sum queries?– no known results (without using the prefix-sum array)– ranges overlap arbitrarily no easy dynamic programming

formulation exists





• epilogue

working over data streams• main challenges when data are streaming:

– stream items are only seen once– require small working space– process stream items quickly– provide an answer quickly with quality guarantees

two models depending on how a data vector a is rendered

time series model

stream elements are vector values of type (i,a[i]) and appear ordered in i (e.g., time)

turnstile model

stream elements are updates of type (i,±u) which implies a[i] a[i] ± u and, further, do not appear ordered in i

(4,14)

ith element

(3,4)(2,24)(1,12)stream

adata vector

12 24 4

(i,a[i]) items

14

±u

(4,+2)

ith element (i,±u) updates

(1,-2)(2,-4)(4,+10)

stream

adata vector

12 20 8 10 4 13 7 012

streaming wavelet synopses• time series model

– at most only logN coefficients are affected– a large number of coefficients has finalized value– can perform bottom-up dynamic programming (space required is prohibitive)– greedy techniques should be deployed instead

d0 d1 d2 d3 d4 d5 d6 d7

c4 c5 c6 c7

c2 c3

c1

c0

+

+

-

+

(4,d4)

• turnstile model– even optimizing for the sse is hard [G. Cormode, M. Garofalakis, D. Sacharidis ‘06]

– other error metrics have not been studied





• epilogue

epilogue

wavelet synopses are a highly successful data summarization technique

yet, several problems remain open:– optimize for range query workloads– greedy (time-series) streaming algorithms– other metrics for general (turnstile) streaming data

thank you!

http://www.dblab.ntua.gr/

unrestricted wavelet synopses

the retained coefficients can assume any value, not restricted to their decomposed value (even harder optimization problem!)

quick example: optimize for max-abs-error, d = {2, 10, 12, 8} and B=1•restricted synopsis: keep the overall average 8 m.a.e. = 6•unrestricted synopsis: keep the overall average but change its value to 7 m.a.e. = 5

Documents

Constructing Optimal Wavelet Synopses Dimitris Sacharidis [email protected] Timos Sellis [email protected]