Upload
rosalyn-shields
View
228
Download
0
Tags:
Embed Size (px)
Citation preview
Online Query Processing
Joseph M. HellersteinUC Berkeley
Road Map
Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms
Reordering and Ripple Joins
Implications for “hot topics” Big data, small pipes: P2P, Sensor nets Continuous Queries over data streams
Computers Keep Getting Faster
Moore’s Law # of transistors per chip doubles
every 18 months (1965) 2x price/performance every 18 months
About right, 24 periods running
But Data Grows Faster Yet!
Source: J. Porter, Disk/Trend, Inc.http://www.disktrend.com/pdf/portrpkg.pdf
0
500
1000
1500
2000
2500
3000
3500
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Year
Petabytes
Sales
Moore'sLaw
Disk Appetite, cont.
Greg Papadopoulos, CTO Sun: Disk sales doubling every 9 months
Translate: Time to process all your data doubles every 18 months MOORE’S LAW INVERTED!
Big challenge (opportunity?) for SW systems research Feel free to discount the analysis above significantly Even so, traditional “scalability” research won’t help
“Ideal” linear scaleup is NOT NEARLY ENOUGH! HW will not take us there.
Data Volume: Prognostications
Today SwipeStream
E.g. Wal-Mart >100 TB Data Warehouse ClickStream
One service generates >100GB of log per day Web
Internet Archive WayBack Machine: >100 TB Replicated OS/Apps, MP3s, etc.
Tomorrow: Sensor feeds galore Big data, small pipes: Sensor nets, P2P Continuous queries over data streams
Road Map
Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms
Reordering and Ripple Joins
Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing
The Latest Commercial Technology
Drawbacks of Current Systems
Only exact answers are available A losing proposition as data volume grows
HCI solution: interactive tools don’t do big jobs E.g., spreadsheet programs (64Krow limit)
Systems solution: big jobs aren’t interactive No user feedback or control in big DBMS queries
“back to the 60’s” Long processing times Fundamental mismatch with preferred modes of HCI
Best solutions to date precompute E.g. “OLAP” Don’t handle ad hoc queries or data sets well
CONTROLContinuous Output, Navigation and Transformation with Refinement On-Line
“Of all men's miseries, the bitterest is this:to know so much and have control over nothing”-- Herodotus
Requirements for CONTROL systems Early answers Refinement over time Interaction and ad-hoc control
“Crystal Ball” vs. Black Box vs. “Lucite Watch”
CONTROL Project, ‘96-’01
Main SW Artifacts Online Aggregation in SQL
Postgres, Informix, IBM DB2 Potter’s Wheel
Scalable Spreadsheet for Online Data Cleaning Interactive anomaly detection & transformation
CARMA online association rule mining 2 Online Data Viz prototypes
CLOUDS & Phoebus
Seeded current Telegraph project Adaptive dataflow for querying networked
sources
Goals for Online Processing
New “greedy” performance regime Maximize 1st derivative of the “mirth index” Mirth defined on-the-fly Therefore need FEEDBACK and CONTROL
Time
100%
OnlineTraditional
Road Map
Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms
Reordering and Ripple Joins
Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing
Online Aggregation
SELECT AVG(temp) FROM t GROUP BY site330K rows in table (synthetic data)the exact answer:
Courtesy Peter Haas, IBM
Online Aggregation, cont’d
A simple online aggregation interface (after 74 rows)
Courtesy Peter Haas, IBM
Online Aggregation, cont’d
After 834 rows:
Courtesy Peter Haas, IBM
Example: Online Aggregation
AdditionalFeatures:
Speed upSlow downTerminate
Online Data Visualization
In Tioga DataSplash
Online Browsing & Editing
Potter’s Wheel [VLDB 2001]Scalable spreadsheet A fraction of data is materialized in GUI widget at any
time Scrolling = preference for delivering quantiles to widget
Interactive data cleaning Direct-manipulation interface via visual algebra Transformation by example
Online structure and discrepancy detection MDL meets ADTs
Minimum Description Length [Rissinen] for modeling data Based on a grammar of user-defined Abstract Data Types
Simple API for programmers to register new ADTs
Scalable Spreadsheets
Visual Transformation Shot
Road Map
Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms
Reordering and Ripple Joins
Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing
Sampling w/o Replacement
We want i.i.d. samples w/o replacement At any time, the input to the query is a sample Input grows over time
Can pre-sort tables randomly And start scans from random positions
Best I/O behavior Can re-randomize incrementally in the background Note: can do this as a “secondary index”
Techniques for random sampling in DBs well studied Both from files and from indexes Some tricks here (e.g. acceptance-rejection sampling)
An Aside:Sampling in Commercial DBMSs
“Simulated” Bernoulli sampling SQL: SELECT * WHERE RAND() <= 0.01 Similar capability in SAS
Bernoulli Sampling with pre-specified rate Informix, Oracle 8i, (DB2) Ex: SELECT * FROM T1 SAMPLE ROW(10%), T2 Ex: SELECT * FROM T1 SAMPLE BLOCK(10%), T2
Not for novices Need to pre-specify precision
no feedback/control recall the “multiresolution” patterns from example
No estimators provided in current systems
Preferential Data Delivery
Why needed? Examples. Speedup/slowdown arrows Spreadsheet scrollbars Pipeline quasi-sort
Mechanisms: Juggle [Raman2/Hellerstein, VLDB ‘99]
General purpose: works with streaming data, etc. Index stride [Hellerstein/Haas/Wang, SIGMOD ‘97]
Needs index, high I/O cost, but good for outliers Mix adaptively! [Raman/Hellerstein, ICDE ‘03]
Online Reordering
Deliver “interesting” items first “Interesting” determined on the fly
Exploit rate gap between produce and process/consume
process consume
join
transmit
produce
disk
Online Reordering
Deliver “interesting” items first “Interesting” determined on the fly based on user gestures
at the interface
Exploit rate gap between produce and process/consume
ST
R
reorder process consume
join
transmit
produce
disk
Policies
Want a “good” permutation of items t1…tn to t1…tn
Quality of Feedback for a prefix t1t2
…tk
QoF(UP(t1), UP(t2
), … UP(tk )), UP = user preference
determined by application
Goodness of reordering: dQoF/dt
Delivery Priority (DP): At any given time, try to deliver tuple that improves QoF
most
time
QoF
Example: Online Aggregation
Metric: avg weighted confidence interval Preference acts as weight on confidence interval
QOF = i UPi / (ni)1/2
ni = number of tuples processed from group i
DPi = UPi / (ni)3/2
Metric: explicit proportional rates (NW QoS) QOF = i(ni — nUPi)2
where n = i ni
DPi = nUPj — nj
Explicit ranking (scrollbar) QOF = i UPi
DPi = UPi
Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go
SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]
Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go
SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c
Solution space Simple idea: sample a join
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]
S
R
xx
x
x
xx
x
x
x
x x
x
x
x
x
x
Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go
SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c
Solution space Simple idea: sample a join Better: join samples
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]
S
Rxxxxxxxx
xxxxxxxx
xxxxxxxx
xxxxxxxx
S
R
xx
x
x
xx
x
x
x
x x
x
x
x
x
x
Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go
SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c
Solution space Simple idea: sample a join Better: join samples
Ripple Joins Family of algorithms for joining increasing samples Optimized to shrink confidence intervals ASAP
Q: Which input to sample faster? “Aspect ratio” of ripples? A: Adapt to estimator’s “variance contribution” from different
inputs
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]
S
Rxxxxxxxx
xxxxxxxx
xxxxxxxx
xxxxxxxx
S
R
xx
x
x
xx
x
x
x
x x
x
x
x
x
x
Traditional Nested Loops
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 31 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 72 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
S
R
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
SELECT AVG(R.a * S.b)FROM R, SWHERE R.c = S.c
Basic Ripple Join
Simplest version read new tuples s from S and r from R join r and s join r with old S tuples join s with old R tuples
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Basic Ripple Join
Rxxxx
x
xxxxx
xx
xxxx
S
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
S
Rxxxx
xxxxxxxx
xxxxxxxxxxxx
xx
xxxx
xxxxxx
Better I/O: Block Ripple Joinxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Batch IOs: = 2
S
Rxxxxxxxx
xx
xxxxxxxx
xx
xxxx
xxxxxxxx
Adapt: Rectangular Ripple Join
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
R = 2, S = 1
Ripple Joins, cont’d
Really a family of join algorithms: Block: minimizes I/O in alternating nested loops Hash: symmetric hash tables to avoid scans Index: basic index join
Adaptive aspect ratio User sets “frame rate” of animation at GUI System goal: minibatch optimizations
minimize CI width subject to “frame-rate” time constraint CI’s computable at corners, so a new frame each corner
Challenge: choose the next minibatch corner Sample from higher variance-contributing relation faster System solves linear programming problem
to do this (approximately)
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Nice and Tidy! But…
Significant details missing Systems issues:
Make it work in a re-entrant iterator model RippleJoin.get_next() ?
Multi-join queries “Corner” cases -- e.g. EOT on one input of k
Statistical matters Estimators and CIs for joins of samples
Not textbook statistics! The adaptivity issue: choosing the next corner
Based on cost (I/O model for algorithm) and benefit (CI formula)
Approximate a non-linear integer programming problem
Implementation Issues
Basically a “symmetric” nested loops “inner” and “outer” roles switch back and forth Note asymmetry in inner loops: “mitres” the box
for (ever) { for (i = 1 to max -1) // S fixed, loop on R if (predicate(R[i], S[max])) output(R[i], S[max]); for (i = 1 to max) // R fixed, loop on S if (predicate(R[max], S[i])) output(R[max], S[i]);}
x x x x
x x x xx x x x
x x x xx x x x
x
xx
xx
Now as an iterator
Most DBMSs implement query operators as iterators get_next() must re-enter and generate next
result tuple
state for square, 2-table ripple join: cursors on both inputs (as in nested loops join) target -- where’s the next “corner” currel -- which relation is currently looping
x x x x
x x x xx x x x
x x x xx x x x
xx
Fussy complications in general
Non-square aspect ratios Not just “wrapping” 1 layer on each side
must pad side i with i layers correctly
Multiple joins = pipeline of binary joins But can’t be done naively
“Wrapping” hypercubes not rectangles Given subtree of size i, factor in ’s and
Subtree’s hyperplane is: 12…ii tuples Boundary conditions
Overload control messages: EOT may mean end of sampling step, not end of table Need to signal aggregator of arrival at corner (EOT on all tables)
Also, EOTable on a relation may change “mitre-ing” rules!
ST
R
Empirical Results for Square Ripples (208 sec. Query)
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Select Online AVG(E.grade) From enroll E, student SWhere E.sid = S.sid And S.honors_code IS NULL;
Square Results (cont.)xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Statistical Matters
Need robust formulas for running estimates running CI’s
Adaptive update of aspect ratioEfficient maintenance of running statistics
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
ExampleSELECT SUM(expression) FROM R, SWHERE predicateNatural estimator after n sampling steps:
Unbiased, consistent A scaled average -- hence amenable to CLT
(r,s) Rn x Sn
expressionp(r,s)|Rn| |Sn|
|R| |S|
Running Estimatorsxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Large-Sample Confidence Intervals
Review: single table R # sampling steps: 1 << n << |R| Based on Central Limit Theorem for averages of i.i.d.
observations
Complications for multiple tables Estimator for AVG is a ratio of averages expressionp(r, s) and expressionp(r, s’) are correlated!
Solution: CLT for (ratios of) cross-product averages
Based on theory of “convergence in distribution” Courtesy Peter Haas, IBM
Also “conservative” CIs that can be used for very small n Based on Hoeffding inequality
And deterministic CIs for the endgame
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
General form: For K tables and a 100%p CI:
Where: zp is the area under the standard normal curve 0 ± p n is the number of tuples so far
d(k) is agg-fn dependent, computed on tuples of input dn(k) a consistent estimator by applying d(k) to samples so far the block size, a constant i the number of blocks to fetch from relation i
Tradeoff between update rate/CI shrinkage Bigger ’s = smaller CIs, slower animation
CI width =2 zpn1/2
2 = k = 1
Kd(k)
k
Large-Sample Confidence Intervals
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Cost of block ripple join after n steps: c = 12…kK-1nK + o(nK)
Cost of hash ripple after n steps: c = (1 + 2 + … + k)n
Non-linear integer programming problem:
minimize 2(1, 2, …, k,
such that1. c (1, 2, …, k, constant [animation speed]
2. 1 k |Rk|/ for 1 k K
3. 1, 2, …, k integer
= k = 1
K
d(k)
k
Aspect Ratio Selectionxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Aspect Ratio Selection
Adaptive procedure:1. Start with 1 = 2 = … = K = 12. Execute m sampling steps (m >= 1)3. Update each d(k) estimate 4. Solve optimization problem to get new ’s5. Sweep out new rectangle 6. Go to 2
Approximate optimization algorithm for ’s First ignore all constraints, minimize 2
Then scale up all so smallest i ≥ 1 (constraint 2) Scale down by ever-larger i’s until animation goal met
(constraint 1) Round down to integers (constraint 3)
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
1. c (1, 2, …, k, c
2. 1 k |Rk|/
3. 1, 2, …, k integer
Empirical Results: aspect ratios
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Select Online AVG(D.grade/A.grade) From enroll D, enroll AWhere D.college = Education And A.college = Agriculture And A.year = d.year
--> 1x6
Ripple Joins, cont’d
Prototypes in Postgres, Informix, IBM DB2Follow-on work Subqueries in Singapore [Tan, et al, VLDB 99] Scaling at Wisconsin/IBM [Luo, et al, SIGMOD 02]
Parallel and out-of-core variants Query optimization issues
Motivated eddies [SIGMOD 99] and the rest of the Telegraph project [CIDR 03]
A number of API and other systems issues Informix implementation [DMKD 2000]
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
Road Map
Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms
Reordering and Ripple Joins
Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing
Two Hot DB Topics
Queries over networks Querying sensor nets (e.g. TinyDB) Querying in P2P networks (e.g. PIER)
Continuous queries over data streams E.g. TelegraphCQ, Aurora, STREAM,
NiagaraCQ
Natural settings for these ideas!
Queries over Networks: Big Data, Small Pipes
Given: Large Data Sets E.g. disks, traces in P2P setting E.g. the physical world in sensornets
Constraint: Small pipes P2P != cluster [IPTPS03] Sensor nets have extremely limited comm per battery
Will have long-running dataflow programs! So… Need to provide streaming, approximate answers Need to prioritize delivery a la juggle Need to decide adaptively on how to sample physical
world “Acquisitional Query Processing” in TinyDB related to both
Ripple Join and Juggle Need to consider human factors while the query runs
Continuous Queries over Streams
Hot topic of the moment Berkeley’s TelegraphCQ, OGI/Wisconsin’s
NiagaraCQ, MIT/Brown/Brandeis’ Aurora, Stanford’s STREAM
Again, natural setting for these ideas Need pipelining operators for STREAMS Interest in approximate answers during bursts And prioritized delivery Need to consider human factors!
DB community focused on the next query language Also need to focus on interactivity!
Initiation of a Continuous Query is only half the battle
CONTROL: Lessons Learned
Dream about UIs, work on systems User needs drive systems design For long-running, data-intensive apps especially
Systems and statistics intertwine Adaptive systems
All 3 go together naturally User desires and behavior: 2 more things to model,
predict “Performance” metrics need to reflect key user
needs
“What unlike things must meet and mate…”-- Art, Herman Melville
Related Work on Online QP
Morgenstein’s PhD, Berkeley ’80Online Association Rules Ng, et al’s CAP, SIGMOD ’98 Hidber’s CARMA, SIGMOD ‘99
Implications for deductive DB semantics Monotone aggregation in LDL++, Zaniolo and Wang
Online agg with subqueries Tan, et al. VLDB ’99
Dynamic Pipeline Scheduling Urhan/Franklin VLDB ’01
Pipelining Hash Joins Su/Raschid, Wilschut/Apers, Tukwila, Xjoin Relation to semi-naive evaluation
Anytime Algorithms Zilberstein, Russell, et al.
More information?
http://control.cs.berkeley.edu : Papers Potter’s Wheel open source
Backup Slides
Sampling – Design Issues
Granularity of sample Instance-level (row-level): high I/O cost Block-level (page-level): high variability from
clustering
Type of sample Often simple random sample (SRS)
Especially for on-the-fly With/without replacement usually not critical
Data structure from which to sample Files or relational tables Indexes (B+ trees, etc)
Row-level Sampling Techniques
Maintain file in random order Sampling = scan Is file initially in random order?
Statistical tests needed: e.g., Runs test, Smirnov test In DB systems: cluster via RAND function Must “freshen” ordering (online reorg)
On-the-fly sampling Via index on “random” column Else get random page, then row within page
Ex: extent-map sampling Problem: variable number of records on page
Acceptance/Rejection Sampling
Accept row on page i with probability = ni/nMAX
Commonly used in other settings E.g. sampling from joins E.g. sampling from indexes
r r r r r r
r r r r r
r r r r r r
r r r r r r
r r r r r r
r r r r r r
r r r r r r
Original pages Modified pages
Cost of Row-Level Sampling
• 100,000 pages
• 200 rows/page
Estimation for Aggregates
Point estimates Easy: SUM, COUNT, AVERAGE Hard: MAX, MIN, quantiles, distinct values
Confidence intervals – a measure of precision
Two cases: single-table and joins
Confidence Intervals
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500
CI Length
Sample Size
The Good and Bad News
Good news: 1/n1/2 magic (n chosen on-the-fly)
Bad news: needle-in-a-haystack problem
Running confidence intervals (1)
Confidence parameter p(0,1) is prespecifiedDisplay a precision parameter єn such that running aggregate Yn is within єn of the final answer μ with probability approximately equal to p. [Yn- єn,Yn+ єn] contains μ with probability approximately equal to p
Running confidence intervals (2)
Three types to construct from n retrieved records: Conservative confidence intervals based on
Hoeffding’s inequality or recent extention of this inequality, for all n>=1
Large-sample confidence intervals based on central limit theorems (CLT’s), for n both small and large enough
Deterministic confidence intervals contain μ with probability 1, only for very large n
Running confidence intervals (3)
SELECT AVG(exp) FROM R;v(i) (1 i m): the value of exp when applied to tuple iLi: the random index of the ith tuple retrieved from Ra and b are a priori bounds a v(i) b for 1 i m
Conservative confidence interval equations:
2/1
)/(2
1
1
)1
2ln(
2
1)(
21}|{|
)()/1(
)()/1(
22
⎟⎟⎠
⎞⎜⎜⎝
⎛−
−=
−≥≤−
=
=
−−
=
=
∑∑
pnab
eYP
ivm
LvnY
n
abnn
m
i
n
iin
ε
εμ
με
Running confidence intervals (4)Large-sample confidence interval equationsBy central limit theorems (CLT’s) , Ynapproaches a normal distribution with a mean (μ) and a variance 2/n as n, the sample size, increases. 2can be replaced by the estimator Tn,2(v)
2/1
2,2
2/12,
2/12,
2/12,
1
12,
)(
2/)1()(
1)(
2)()(
(}|{|
))(()1()( 2
⎟⎟⎠
⎞⎜⎜⎝
⎛=
+=Φ
−⎟⎟⎠
⎞⎜⎜⎝
⎛Φ≈
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
≤)−
=≤−
−−= ∑=−
nvTz
pz
vTn
vTn
vTYn
PYnP
YLvnvT
npn
p
nnn
n
n
i nin
ε
εεμεμ
Estimators for SUM, COUNT and AVG
∑×∈
=nn SRsr
pnn
srSR
SRsum
),(
),(exp||*||
||*||
∑×∈
=nn SRsrnn
oneSR
SRcount
),(||*||
||*||
count
sumavg =
Confidence intervals for ripple joins
Use central limit theorems (CLT’s) to compute “large-sample” confidence intervalsFix the problems in classic CLT’s with newly defined 2 for different aggregate queries
∑=
=K
k k
kd
1
2 )(
αβσ
Ripple optimization:choosing aspect ratios (1)
Blocking factor is prespecified, we want to optimize k’s – the aspect-ratio parametersminimize ∑
=
=K
k k
kd
1
2 )(
αβσ
such that
12 3 ...K K-1c (decided by animation speed)
1 k mk/ for 1 k K
1,2 ,3 ,...K interger
Precomputation: Explicit
OLAP Data Cubes (drill-down hierarchies) MOLAP, ROLAP, HOLAP
Semantic hierarchies APPROXIMATE (Vrbsky, et al.) Query Relaxation, e.g. CoBase Multiresolution Data Models
(Silberschatz/Reed/Fussell)
More general materialized views See Gupta/Mumick’s text
Precomputation: Stat. Summaries
Histograms Originally for aggregation queries, many flavors Extended to enumeration queries recently Multi-dimensional histograms
Parametric estimation Wavelets and Fractals Discrete cosine transform Regression Curve fitting and splines Singular-Value Decomposition (aka LSI, PCA)
Indexes: hierarchical histograms Ranking and pseudo-ranking Aoki’s use of GiSTs as estimators for ADTs
Data Mining Clustering, classification, other multidimensional models
Precomputed Samples
Materialized sample views Olken’s original work Chaudhuri et al.: join samples Statistical inferences complicated over “recycled”
samples?
Barbará’s quasi-cubesAQUA “join synopses” on universal relationMaintenance issues AQUA’s backing samples
Can use fancier/more efficient sampling techniques Stratified sampling or AQUA’s “congressional” samples
Haas and Swami AFV statistics Combine precomputed “outliers” with on-the-fly samples