79
Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Query Processing Joseph M. Hellerstein UC Berkeley

Embed Size (px)

Citation preview

Page 1: Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Query Processing

Joseph M. HellersteinUC Berkeley

Page 2: Online Query Processing Joseph M. Hellerstein UC Berkeley

Road Map

Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms

Reordering and Ripple Joins

Implications for “hot topics” Big data, small pipes: P2P, Sensor nets Continuous Queries over data streams

Page 3: Online Query Processing Joseph M. Hellerstein UC Berkeley

Computers Keep Getting Faster

Moore’s Law # of transistors per chip doubles

every 18 months (1965) 2x price/performance every 18 months

About right, 24 periods running

Page 4: Online Query Processing Joseph M. Hellerstein UC Berkeley

But Data Grows Faster Yet!

Source: J. Porter, Disk/Trend, Inc.http://www.disktrend.com/pdf/portrpkg.pdf

0

500

1000

1500

2000

2500

3000

3500

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Year

Petabytes

Sales

Moore'sLaw

Page 5: Online Query Processing Joseph M. Hellerstein UC Berkeley

Disk Appetite, cont.

Greg Papadopoulos, CTO Sun: Disk sales doubling every 9 months

Translate: Time to process all your data doubles every 18 months MOORE’S LAW INVERTED!

Big challenge (opportunity?) for SW systems research Feel free to discount the analysis above significantly Even so, traditional “scalability” research won’t help

“Ideal” linear scaleup is NOT NEARLY ENOUGH! HW will not take us there.

Page 6: Online Query Processing Joseph M. Hellerstein UC Berkeley

Data Volume: Prognostications

Today SwipeStream

E.g. Wal-Mart >100 TB Data Warehouse ClickStream

One service generates >100GB of log per day Web

Internet Archive WayBack Machine: >100 TB Replicated OS/Apps, MP3s, etc.

Tomorrow: Sensor feeds galore Big data, small pipes: Sensor nets, P2P Continuous queries over data streams

Page 7: Online Query Processing Joseph M. Hellerstein UC Berkeley

Road Map

Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms

Reordering and Ripple Joins

Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing

Page 8: Online Query Processing Joseph M. Hellerstein UC Berkeley

The Latest Commercial Technology

Page 9: Online Query Processing Joseph M. Hellerstein UC Berkeley

Drawbacks of Current Systems

Only exact answers are available A losing proposition as data volume grows

HCI solution: interactive tools don’t do big jobs E.g., spreadsheet programs (64Krow limit)

Systems solution: big jobs aren’t interactive No user feedback or control in big DBMS queries

“back to the 60’s” Long processing times Fundamental mismatch with preferred modes of HCI

Best solutions to date precompute E.g. “OLAP” Don’t handle ad hoc queries or data sets well

Page 10: Online Query Processing Joseph M. Hellerstein UC Berkeley

CONTROLContinuous Output, Navigation and Transformation with Refinement On-Line

“Of all men's miseries, the bitterest is this:to know so much and have control over nothing”-- Herodotus

Requirements for CONTROL systems Early answers Refinement over time Interaction and ad-hoc control

“Crystal Ball” vs. Black Box vs. “Lucite Watch”

Page 11: Online Query Processing Joseph M. Hellerstein UC Berkeley

CONTROL Project, ‘96-’01

Main SW Artifacts Online Aggregation in SQL

Postgres, Informix, IBM DB2 Potter’s Wheel

Scalable Spreadsheet for Online Data Cleaning Interactive anomaly detection & transformation

CARMA online association rule mining 2 Online Data Viz prototypes

CLOUDS & Phoebus

Seeded current Telegraph project Adaptive dataflow for querying networked

sources

Page 12: Online Query Processing Joseph M. Hellerstein UC Berkeley

Goals for Online Processing

New “greedy” performance regime Maximize 1st derivative of the “mirth index” Mirth defined on-the-fly Therefore need FEEDBACK and CONTROL

Time

100%

OnlineTraditional

Page 13: Online Query Processing Joseph M. Hellerstein UC Berkeley

Road Map

Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms

Reordering and Ripple Joins

Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing

Page 14: Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Aggregation

SELECT AVG(temp) FROM t GROUP BY site330K rows in table (synthetic data)the exact answer:

Courtesy Peter Haas, IBM

Page 15: Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Aggregation, cont’d

A simple online aggregation interface (after 74 rows)

Courtesy Peter Haas, IBM

Page 16: Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Aggregation, cont’d

After 834 rows:

Courtesy Peter Haas, IBM

Page 17: Online Query Processing Joseph M. Hellerstein UC Berkeley

Example: Online Aggregation

AdditionalFeatures:

Speed upSlow downTerminate

Page 18: Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Data Visualization

In Tioga DataSplash

Page 19: Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Browsing & Editing

Potter’s Wheel [VLDB 2001]Scalable spreadsheet A fraction of data is materialized in GUI widget at any

time Scrolling = preference for delivering quantiles to widget

Interactive data cleaning Direct-manipulation interface via visual algebra Transformation by example

Online structure and discrepancy detection MDL meets ADTs

Minimum Description Length [Rissinen] for modeling data Based on a grammar of user-defined Abstract Data Types

Simple API for programmers to register new ADTs

Page 20: Online Query Processing Joseph M. Hellerstein UC Berkeley

Scalable Spreadsheets

Page 21: Online Query Processing Joseph M. Hellerstein UC Berkeley

Visual Transformation Shot

Page 22: Online Query Processing Joseph M. Hellerstein UC Berkeley
Page 23: Online Query Processing Joseph M. Hellerstein UC Berkeley

Road Map

Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms

Reordering and Ripple Joins

Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing

Page 24: Online Query Processing Joseph M. Hellerstein UC Berkeley

Sampling w/o Replacement

We want i.i.d. samples w/o replacement At any time, the input to the query is a sample Input grows over time

Can pre-sort tables randomly And start scans from random positions

Best I/O behavior Can re-randomize incrementally in the background Note: can do this as a “secondary index”

Techniques for random sampling in DBs well studied Both from files and from indexes Some tricks here (e.g. acceptance-rejection sampling)

Page 25: Online Query Processing Joseph M. Hellerstein UC Berkeley

An Aside:Sampling in Commercial DBMSs

“Simulated” Bernoulli sampling SQL: SELECT * WHERE RAND() <= 0.01 Similar capability in SAS

Bernoulli Sampling with pre-specified rate Informix, Oracle 8i, (DB2) Ex: SELECT * FROM T1 SAMPLE ROW(10%), T2 Ex: SELECT * FROM T1 SAMPLE BLOCK(10%), T2

Not for novices Need to pre-specify precision

no feedback/control recall the “multiresolution” patterns from example

No estimators provided in current systems

Page 26: Online Query Processing Joseph M. Hellerstein UC Berkeley

Preferential Data Delivery

Why needed? Examples. Speedup/slowdown arrows Spreadsheet scrollbars Pipeline quasi-sort

Mechanisms: Juggle [Raman2/Hellerstein, VLDB ‘99]

General purpose: works with streaming data, etc. Index stride [Hellerstein/Haas/Wang, SIGMOD ‘97]

Needs index, high I/O cost, but good for outliers Mix adaptively! [Raman/Hellerstein, ICDE ‘03]

Page 27: Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Reordering

Deliver “interesting” items first “Interesting” determined on the fly

Exploit rate gap between produce and process/consume

process consume

join

transmit

produce

disk

Page 28: Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Reordering

Deliver “interesting” items first “Interesting” determined on the fly based on user gestures

at the interface

Exploit rate gap between produce and process/consume

ST

R

reorder process consume

join

transmit

produce

disk

Page 29: Online Query Processing Joseph M. Hellerstein UC Berkeley

Policies

Want a “good” permutation of items t1…tn to t1…tn

Quality of Feedback for a prefix t1t2

…tk

QoF(UP(t1), UP(t2

), … UP(tk )), UP = user preference

determined by application

Goodness of reordering: dQoF/dt

Delivery Priority (DP): At any given time, try to deliver tuple that improves QoF

most

time

QoF

Page 30: Online Query Processing Joseph M. Hellerstein UC Berkeley

Example: Online Aggregation

Metric: avg weighted confidence interval Preference acts as weight on confidence interval

QOF = i UPi / (ni)1/2

ni = number of tuples processed from group i

DPi = UPi / (ni)3/2

Metric: explicit proportional rates (NW QoS) QOF = i(ni — nUPi)2

where n = i ni

DPi = nUPj — nj

Explicit ranking (scrollbar) QOF = i UPi

DPi = UPi

Page 31: Online Query Processing Joseph M. Hellerstein UC Berkeley

Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go

SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]

Page 32: Online Query Processing Joseph M. Hellerstein UC Berkeley

Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go

SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c

Solution space Simple idea: sample a join

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]

S

R

xx

x

x

xx

x

x

x

x x

x

x

x

x

x

Page 33: Online Query Processing Joseph M. Hellerstein UC Berkeley

Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go

SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c

Solution space Simple idea: sample a join Better: join samples

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]

S

Rxxxxxxxx

xxxxxxxx

xxxxxxxx

xxxxxxxx

S

R

xx

x

x

xx

x

x

x

x x

x

x

x

x

x

Page 34: Online Query Processing Joseph M. Hellerstein UC Berkeley

Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go

SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c

Solution space Simple idea: sample a join Better: join samples

Ripple Joins Family of algorithms for joining increasing samples Optimized to shrink confidence intervals ASAP

Q: Which input to sample faster? “Aspect ratio” of ripples? A: Adapt to estimator’s “variance contribution” from different

inputs

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]

S

Rxxxxxxxx

xxxxxxxx

xxxxxxxx

xxxxxxxx

S

R

xx

x

x

xx

x

x

x

x x

x

x

x

x

x

Page 35: Online Query Processing Joseph M. Hellerstein UC Berkeley

Traditional Nested Loops

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 31 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 72 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

S

R

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

SELECT AVG(R.a * S.b)FROM R, SWHERE R.c = S.c

Page 36: Online Query Processing Joseph M. Hellerstein UC Berkeley

Basic Ripple Join

Simplest version read new tuples s from S and r from R join r and s join r with old S tuples join s with old R tuples

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 37: Online Query Processing Joseph M. Hellerstein UC Berkeley

Basic Ripple Join

Rxxxx

x

xxxxx

xx

xxxx

S

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 38: Online Query Processing Joseph M. Hellerstein UC Berkeley

S

Rxxxx

xxxxxxxx

xxxxxxxxxxxx

xx

xxxx

xxxxxx

Better I/O: Block Ripple Joinxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Batch IOs: = 2

Page 39: Online Query Processing Joseph M. Hellerstein UC Berkeley

S

Rxxxxxxxx

xx

xxxxxxxx

xx

xxxx

xxxxxxxx

Adapt: Rectangular Ripple Join

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

R = 2, S = 1

Page 40: Online Query Processing Joseph M. Hellerstein UC Berkeley

Ripple Joins, cont’d

Really a family of join algorithms: Block: minimizes I/O in alternating nested loops Hash: symmetric hash tables to avoid scans Index: basic index join

Adaptive aspect ratio User sets “frame rate” of animation at GUI System goal: minibatch optimizations

minimize CI width subject to “frame-rate” time constraint CI’s computable at corners, so a new frame each corner

Challenge: choose the next minibatch corner Sample from higher variance-contributing relation faster System solves linear programming problem

to do this (approximately)

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 41: Online Query Processing Joseph M. Hellerstein UC Berkeley

Nice and Tidy! But…

Significant details missing Systems issues:

Make it work in a re-entrant iterator model RippleJoin.get_next() ?

Multi-join queries “Corner” cases -- e.g. EOT on one input of k

Statistical matters Estimators and CIs for joins of samples

Not textbook statistics! The adaptivity issue: choosing the next corner

Based on cost (I/O model for algorithm) and benefit (CI formula)

Approximate a non-linear integer programming problem

Page 42: Online Query Processing Joseph M. Hellerstein UC Berkeley

Implementation Issues

Basically a “symmetric” nested loops “inner” and “outer” roles switch back and forth Note asymmetry in inner loops: “mitres” the box

for (ever) { for (i = 1 to max -1) // S fixed, loop on R if (predicate(R[i], S[max])) output(R[i], S[max]); for (i = 1 to max) // R fixed, loop on S if (predicate(R[max], S[i])) output(R[max], S[i]);}

x x x x

x x x xx x x x

x x x xx x x x

x

xx

xx

Page 43: Online Query Processing Joseph M. Hellerstein UC Berkeley

Now as an iterator

Most DBMSs implement query operators as iterators get_next() must re-enter and generate next

result tuple

state for square, 2-table ripple join: cursors on both inputs (as in nested loops join) target -- where’s the next “corner” currel -- which relation is currently looping

x x x x

x x x xx x x x

x x x xx x x x

xx

Page 44: Online Query Processing Joseph M. Hellerstein UC Berkeley

Fussy complications in general

Non-square aspect ratios Not just “wrapping” 1 layer on each side

must pad side i with i layers correctly

Multiple joins = pipeline of binary joins But can’t be done naively

“Wrapping” hypercubes not rectangles Given subtree of size i, factor in ’s and

Subtree’s hyperplane is: 12…ii tuples Boundary conditions

Overload control messages: EOT may mean end of sampling step, not end of table Need to signal aggregator of arrival at corner (EOT on all tables)

Also, EOTable on a relation may change “mitre-ing” rules!

ST

R

Page 45: Online Query Processing Joseph M. Hellerstein UC Berkeley

Empirical Results for Square Ripples (208 sec. Query)

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Select Online AVG(E.grade) From enroll E, student SWhere E.sid = S.sid And S.honors_code IS NULL;

Page 46: Online Query Processing Joseph M. Hellerstein UC Berkeley

Square Results (cont.)xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 47: Online Query Processing Joseph M. Hellerstein UC Berkeley

Statistical Matters

Need robust formulas for running estimates running CI’s

Adaptive update of aspect ratioEfficient maintenance of running statistics

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 48: Online Query Processing Joseph M. Hellerstein UC Berkeley

ExampleSELECT SUM(expression) FROM R, SWHERE predicateNatural estimator after n sampling steps:

Unbiased, consistent A scaled average -- hence amenable to CLT

(r,s) Rn x Sn

expressionp(r,s)|Rn| |Sn|

|R| |S|

Running Estimatorsxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 49: Online Query Processing Joseph M. Hellerstein UC Berkeley

Large-Sample Confidence Intervals

Review: single table R # sampling steps: 1 << n << |R| Based on Central Limit Theorem for averages of i.i.d.

observations

Complications for multiple tables Estimator for AVG is a ratio of averages expressionp(r, s) and expressionp(r, s’) are correlated!

Solution: CLT for (ratios of) cross-product averages

Based on theory of “convergence in distribution” Courtesy Peter Haas, IBM

Also “conservative” CIs that can be used for very small n Based on Hoeffding inequality

And deterministic CIs for the endgame

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 50: Online Query Processing Joseph M. Hellerstein UC Berkeley

General form: For K tables and a 100%p CI:

Where: zp is the area under the standard normal curve 0 ± p n is the number of tuples so far

d(k) is agg-fn dependent, computed on tuples of input dn(k) a consistent estimator by applying d(k) to samples so far the block size, a constant i the number of blocks to fetch from relation i

Tradeoff between update rate/CI shrinkage Bigger ’s = smaller CIs, slower animation

CI width =2 zpn1/2

2 = k = 1

Kd(k)

k

Large-Sample Confidence Intervals

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 51: Online Query Processing Joseph M. Hellerstein UC Berkeley

Cost of block ripple join after n steps: c = 12…kK-1nK + o(nK)

Cost of hash ripple after n steps: c = (1 + 2 + … + k)n

Non-linear integer programming problem:

minimize 2(1, 2, …, k,

such that1. c (1, 2, …, k, constant [animation speed]

2. 1 k |Rk|/ for 1 k K

3. 1, 2, …, k integer

= k = 1

K

d(k)

k

Aspect Ratio Selectionxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 52: Online Query Processing Joseph M. Hellerstein UC Berkeley

Aspect Ratio Selection

Adaptive procedure:1. Start with 1 = 2 = … = K = 12. Execute m sampling steps (m >= 1)3. Update each d(k) estimate 4. Solve optimization problem to get new ’s5. Sweep out new rectangle 6. Go to 2

Approximate optimization algorithm for ’s First ignore all constraints, minimize 2

Then scale up all so smallest i ≥ 1 (constraint 2) Scale down by ever-larger i’s until animation goal met

(constraint 1) Round down to integers (constraint 3)

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

1. c (1, 2, …, k, c

2. 1 k |Rk|/

3. 1, 2, …, k integer

Page 53: Online Query Processing Joseph M. Hellerstein UC Berkeley

Empirical Results: aspect ratios

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Select Online AVG(D.grade/A.grade) From enroll D, enroll AWhere D.college = Education And A.college = Agriculture And A.year = d.year

--> 1x6

Page 54: Online Query Processing Joseph M. Hellerstein UC Berkeley

Ripple Joins, cont’d

Prototypes in Postgres, Informix, IBM DB2Follow-on work Subqueries in Singapore [Tan, et al, VLDB 99] Scaling at Wisconsin/IBM [Luo, et al, SIGMOD 02]

Parallel and out-of-core variants Query optimization issues

Motivated eddies [SIGMOD 99] and the rest of the Telegraph project [CIDR 03]

A number of API and other systems issues Informix implementation [DMKD 2000]

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Page 55: Online Query Processing Joseph M. Hellerstein UC Berkeley

Road Map

Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms

Reordering and Ripple Joins

Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing

Page 56: Online Query Processing Joseph M. Hellerstein UC Berkeley

Two Hot DB Topics

Queries over networks Querying sensor nets (e.g. TinyDB) Querying in P2P networks (e.g. PIER)

Continuous queries over data streams E.g. TelegraphCQ, Aurora, STREAM,

NiagaraCQ

Natural settings for these ideas!

Page 57: Online Query Processing Joseph M. Hellerstein UC Berkeley

Queries over Networks: Big Data, Small Pipes

Given: Large Data Sets E.g. disks, traces in P2P setting E.g. the physical world in sensornets

Constraint: Small pipes P2P != cluster [IPTPS03] Sensor nets have extremely limited comm per battery

Will have long-running dataflow programs! So… Need to provide streaming, approximate answers Need to prioritize delivery a la juggle Need to decide adaptively on how to sample physical

world “Acquisitional Query Processing” in TinyDB related to both

Ripple Join and Juggle Need to consider human factors while the query runs

Page 58: Online Query Processing Joseph M. Hellerstein UC Berkeley

Continuous Queries over Streams

Hot topic of the moment Berkeley’s TelegraphCQ, OGI/Wisconsin’s

NiagaraCQ, MIT/Brown/Brandeis’ Aurora, Stanford’s STREAM

Again, natural setting for these ideas Need pipelining operators for STREAMS Interest in approximate answers during bursts And prioritized delivery Need to consider human factors!

DB community focused on the next query language Also need to focus on interactivity!

Initiation of a Continuous Query is only half the battle

Page 59: Online Query Processing Joseph M. Hellerstein UC Berkeley

CONTROL: Lessons Learned

Dream about UIs, work on systems User needs drive systems design For long-running, data-intensive apps especially

Systems and statistics intertwine Adaptive systems

All 3 go together naturally User desires and behavior: 2 more things to model,

predict “Performance” metrics need to reflect key user

needs

“What unlike things must meet and mate…”-- Art, Herman Melville

Page 60: Online Query Processing Joseph M. Hellerstein UC Berkeley

Related Work on Online QP

Morgenstein’s PhD, Berkeley ’80Online Association Rules Ng, et al’s CAP, SIGMOD ’98 Hidber’s CARMA, SIGMOD ‘99

Implications for deductive DB semantics Monotone aggregation in LDL++, Zaniolo and Wang

Online agg with subqueries Tan, et al. VLDB ’99

Dynamic Pipeline Scheduling Urhan/Franklin VLDB ’01

Pipelining Hash Joins Su/Raschid, Wilschut/Apers, Tukwila, Xjoin Relation to semi-naive evaluation

Anytime Algorithms Zilberstein, Russell, et al.

Page 61: Online Query Processing Joseph M. Hellerstein UC Berkeley

More information?

http://control.cs.berkeley.edu : Papers Potter’s Wheel open source

[email protected]

Page 62: Online Query Processing Joseph M. Hellerstein UC Berkeley

Backup Slides

Page 63: Online Query Processing Joseph M. Hellerstein UC Berkeley

Sampling – Design Issues

Granularity of sample Instance-level (row-level): high I/O cost Block-level (page-level): high variability from

clustering

Type of sample Often simple random sample (SRS)

Especially for on-the-fly With/without replacement usually not critical

Data structure from which to sample Files or relational tables Indexes (B+ trees, etc)

Page 64: Online Query Processing Joseph M. Hellerstein UC Berkeley

Row-level Sampling Techniques

Maintain file in random order Sampling = scan Is file initially in random order?

Statistical tests needed: e.g., Runs test, Smirnov test In DB systems: cluster via RAND function Must “freshen” ordering (online reorg)

On-the-fly sampling Via index on “random” column Else get random page, then row within page

Ex: extent-map sampling Problem: variable number of records on page

Page 65: Online Query Processing Joseph M. Hellerstein UC Berkeley

Acceptance/Rejection Sampling

Accept row on page i with probability = ni/nMAX

Commonly used in other settings E.g. sampling from joins E.g. sampling from indexes

r r r r r r

r r r r r

r r r r r r

r r r r r r

r r r r r r

r r r r r r

r r r r r r

Original pages Modified pages

Page 66: Online Query Processing Joseph M. Hellerstein UC Berkeley

Cost of Row-Level Sampling

• 100,000 pages

• 200 rows/page

Page 67: Online Query Processing Joseph M. Hellerstein UC Berkeley

Estimation for Aggregates

Point estimates Easy: SUM, COUNT, AVERAGE Hard: MAX, MIN, quantiles, distinct values

Confidence intervals – a measure of precision

Two cases: single-table and joins

Page 68: Online Query Processing Joseph M. Hellerstein UC Berkeley

Confidence Intervals

Page 69: Online Query Processing Joseph M. Hellerstein UC Berkeley

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

CI Length

Sample Size

The Good and Bad News

Good news: 1/n1/2 magic (n chosen on-the-fly)

Bad news: needle-in-a-haystack problem

Page 70: Online Query Processing Joseph M. Hellerstein UC Berkeley

Running confidence intervals (1)

Confidence parameter p(0,1) is prespecifiedDisplay a precision parameter єn such that running aggregate Yn is within єn of the final answer μ with probability approximately equal to p. [Yn- єn,Yn+ єn] contains μ with probability approximately equal to p

Page 71: Online Query Processing Joseph M. Hellerstein UC Berkeley

Running confidence intervals (2)

Three types to construct from n retrieved records: Conservative confidence intervals based on

Hoeffding’s inequality or recent extention of this inequality, for all n>=1

Large-sample confidence intervals based on central limit theorems (CLT’s), for n both small and large enough

Deterministic confidence intervals contain μ with probability 1, only for very large n

Page 72: Online Query Processing Joseph M. Hellerstein UC Berkeley

Running confidence intervals (3)

SELECT AVG(exp) FROM R;v(i) (1 i m): the value of exp when applied to tuple iLi: the random index of the ith tuple retrieved from Ra and b are a priori bounds a v(i) b for 1 i m

Conservative confidence interval equations:

2/1

)/(2

1

1

)1

2ln(

2

1)(

21}|{|

)()/1(

)()/1(

22

⎟⎟⎠

⎞⎜⎜⎝

⎛−

−=

−≥≤−

=

=

−−

=

=

∑∑

pnab

eYP

ivm

LvnY

n

abnn

m

i

n

iin

ε

εμ

με

Page 73: Online Query Processing Joseph M. Hellerstein UC Berkeley

Running confidence intervals (4)Large-sample confidence interval equationsBy central limit theorems (CLT’s) , Ynapproaches a normal distribution with a mean (μ) and a variance 2/n as n, the sample size, increases. 2can be replaced by the estimator Tn,2(v)

2/1

2,2

2/12,

2/12,

2/12,

1

12,

)(

2/)1()(

1)(

2)()(

(}|{|

))(()1()( 2

⎟⎟⎠

⎞⎜⎜⎝

⎛=

+=Φ

−⎟⎟⎠

⎞⎜⎜⎝

⎛Φ≈

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

≤)−

=≤−

−−= ∑=−

nvTz

pz

vTn

vTn

vTYn

PYnP

YLvnvT

npn

p

nnn

n

n

i nin

ε

εεμεμ

Page 74: Online Query Processing Joseph M. Hellerstein UC Berkeley

Estimators for SUM, COUNT and AVG

∑×∈

=nn SRsr

pnn

srSR

SRsum

),(

),(exp||*||

||*||

∑×∈

=nn SRsrnn

oneSR

SRcount

),(||*||

||*||

count

sumavg =

Page 75: Online Query Processing Joseph M. Hellerstein UC Berkeley

Confidence intervals for ripple joins

Use central limit theorems (CLT’s) to compute “large-sample” confidence intervalsFix the problems in classic CLT’s with newly defined 2 for different aggregate queries

∑=

=K

k k

kd

1

2 )(

αβσ

Page 76: Online Query Processing Joseph M. Hellerstein UC Berkeley

Ripple optimization:choosing aspect ratios (1)

Blocking factor is prespecified, we want to optimize k’s – the aspect-ratio parametersminimize ∑

=

=K

k k

kd

1

2 )(

αβσ

such that

12 3 ...K K-1c (decided by animation speed)

1 k mk/ for 1 k K

1,2 ,3 ,...K interger

Page 77: Online Query Processing Joseph M. Hellerstein UC Berkeley

Precomputation: Explicit

OLAP Data Cubes (drill-down hierarchies) MOLAP, ROLAP, HOLAP

Semantic hierarchies APPROXIMATE (Vrbsky, et al.) Query Relaxation, e.g. CoBase Multiresolution Data Models

(Silberschatz/Reed/Fussell)

More general materialized views See Gupta/Mumick’s text

Page 78: Online Query Processing Joseph M. Hellerstein UC Berkeley

Precomputation: Stat. Summaries

Histograms Originally for aggregation queries, many flavors Extended to enumeration queries recently Multi-dimensional histograms

Parametric estimation Wavelets and Fractals Discrete cosine transform Regression Curve fitting and splines Singular-Value Decomposition (aka LSI, PCA)

Indexes: hierarchical histograms Ranking and pseudo-ranking Aoki’s use of GiSTs as estimators for ADTs

Data Mining Clustering, classification, other multidimensional models

Page 79: Online Query Processing Joseph M. Hellerstein UC Berkeley

Precomputed Samples

Materialized sample views Olken’s original work Chaudhuri et al.: join samples Statistical inferences complicated over “recycled”

samples?

Barbará’s quasi-cubesAQUA “join synopses” on universal relationMaintenance issues AQUA’s backing samples

Can use fancier/more efficient sampling techniques Stratified sampling or AQUA’s “congressional” samples

Haas and Swami AFV statistics Combine precomputed “outliers” with on-the-fly samples