Online Query Processing Joseph M. Hellerstein UC Berkeley

Online Query Processing

Joseph M. HellersteinUC Berkeley

Road Map

Requisite tech trends and prognosticationsCONTROL of the situation: online query processing HCI motivations Example applications Example Algorithms

Reordering and Ripple Joins

Implications for “hot topics” Big data, small pipes: P2P, Sensor nets Continuous Queries over data streams

Computers Keep Getting Faster

Moore’s Law # of transistors per chip doubles

every 18 months (1965) 2x price/performance every 18 months

About right, 24 periods running

But Data Grows Faster Yet!

Source: J. Porter, Disk/Trend, Inc.http://www.disktrend.com/pdf/portrpkg.pdf

0

500

1000

1500

2000

2500

3000

3500

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Year

Petabytes

Sales

Moore'sLaw

Disk Appetite, cont.

Greg Papadopoulos, CTO Sun: Disk sales doubling every 9 months

Translate: Time to process all your data doubles every 18 months MOORE’S LAW INVERTED!

Big challenge (opportunity?) for SW systems research Feel free to discount the analysis above significantly Even so, traditional “scalability” research won’t help

“Ideal” linear scaleup is NOT NEARLY ENOUGH! HW will not take us there.

Data Volume: Prognostications

Today SwipeStream

E.g. Wal-Mart >100 TB Data Warehouse ClickStream

One service generates >100GB of log per day Web

Internet Archive WayBack Machine: >100 TB Replicated OS/Apps, MP3s, etc.

Tomorrow: Sensor feeds galore Big data, small pipes: Sensor nets, P2P Continuous queries over data streams

Road Map



Implications for “hot topics” P2P query systems Sensor nets Stream Query Processing

The Latest Commercial Technology

Drawbacks of Current Systems

Only exact answers are available A losing proposition as data volume grows

HCI solution: interactive tools don’t do big jobs E.g., spreadsheet programs (64Krow limit)

Systems solution: big jobs aren’t interactive No user feedback or control in big DBMS queries

“back to the 60’s” Long processing times Fundamental mismatch with preferred modes of HCI

Best solutions to date precompute E.g. “OLAP” Don’t handle ad hoc queries or data sets well

CONTROLContinuous Output, Navigation and Transformation with Refinement On-Line

“Of all men's miseries, the bitterest is this:to know so much and have control over nothing”-- Herodotus

Requirements for CONTROL systems Early answers Refinement over time Interaction and ad-hoc control

“Crystal Ball” vs. Black Box vs. “Lucite Watch”

CONTROL Project, ‘96-’01

Main SW Artifacts Online Aggregation in SQL

Postgres, Informix, IBM DB2 Potter’s Wheel

Scalable Spreadsheet for Online Data Cleaning Interactive anomaly detection & transformation

CARMA online association rule mining 2 Online Data Viz prototypes

CLOUDS & Phoebus

Seeded current Telegraph project Adaptive dataflow for querying networked

sources

Goals for Online Processing

New “greedy” performance regime Maximize 1st derivative of the “mirth index” Mirth defined on-the-fly Therefore need FEEDBACK and CONTROL

Time

100%

OnlineTraditional

Road Map




Online Aggregation

SELECT AVG(temp) FROM t GROUP BY site330K rows in table (synthetic data)the exact answer:

Courtesy Peter Haas, IBM

Online Aggregation, cont’d

A simple online aggregation interface (after 74 rows)


Online Aggregation, cont’d

After 834 rows:


Example: Online Aggregation

AdditionalFeatures:

Speed upSlow downTerminate

Online Data Visualization

In Tioga DataSplash

Online Browsing & Editing

Potter’s Wheel [VLDB 2001]Scalable spreadsheet A fraction of data is materialized in GUI widget at any

time Scrolling = preference for delivering quantiles to widget

Interactive data cleaning Direct-manipulation interface via visual algebra Transformation by example

Online structure and discrepancy detection MDL meets ADTs

Minimum Description Length [Rissinen] for modeling data Based on a grammar of user-defined Abstract Data Types

Simple API for programmers to register new ADTs

Scalable Spreadsheets

Visual Transformation Shot

Road Map




Sampling w/o Replacement

We want i.i.d. samples w/o replacement At any time, the input to the query is a sample Input grows over time

Can pre-sort tables randomly And start scans from random positions

Best I/O behavior Can re-randomize incrementally in the background Note: can do this as a “secondary index”

Techniques for random sampling in DBs well studied Both from files and from indexes Some tricks here (e.g. acceptance-rejection sampling)

An Aside:Sampling in Commercial DBMSs

“Simulated” Bernoulli sampling SQL: SELECT * WHERE RAND() <= 0.01 Similar capability in SAS

Bernoulli Sampling with pre-specified rate Informix, Oracle 8i, (DB2) Ex: SELECT * FROM T1 SAMPLE ROW(10%), T2 Ex: SELECT * FROM T1 SAMPLE BLOCK(10%), T2

Not for novices Need to pre-specify precision

no feedback/control recall the “multiresolution” patterns from example

No estimators provided in current systems

Preferential Data Delivery

Why needed? Examples. Speedup/slowdown arrows Spreadsheet scrollbars Pipeline quasi-sort

Mechanisms: Juggle [Raman2/Hellerstein, VLDB ‘99]

General purpose: works with streaming data, etc. Index stride [Hellerstein/Haas/Wang, SIGMOD ‘97]

Needs index, high I/O cost, but good for outliers Mix adaptively! [Raman/Hellerstein, ICDE ‘03]

Online Reordering

Deliver “interesting” items first “Interesting” determined on the fly

Exploit rate gap between produce and process/consume

process consume

join

transmit

produce

disk

Online Reordering

Deliver “interesting” items first “Interesting” determined on the fly based on user gestures

at the interface

Exploit rate gap between produce and process/consume

ST

R

reorder process consume

join

transmit

produce

disk

Policies

Want a “good” permutation of items t1…tn to t1…tn

Quality of Feedback for a prefix t1t2

…tk

QoF(UP(t1), UP(t2

), … UP(tk )), UP = user preference

determined by application

Goodness of reordering: dQoF/dt

Delivery Priority (DP): At any given time, try to deliver tuple that improves QoF

most

time

QoF

Example: Online Aggregation

Metric: avg weighted confidence interval Preference acts as weight on confidence interval

QOF = i UPi / (ni)1/2

ni = number of tuples processed from group i

DPi = UPi / (ni)3/2

Metric: explicit proportional rates (NW QoS) QOF = i(ni — nUPi)2

where n = i ni

DPi = nUPj — nj

Explicit ranking (scrollbar) QOF = i UPi

DPi = UPi

Need: online join processing Pipelined: produce output while consuming input Statistically robust: estimates/CIs as you go

SELECT AVG(R.a * S.b) FROM R, SWHERE R.c = S.c

xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx Ripple Joins [Haas/Hellerstein, SIGMOD ’99]



Solution space Simple idea: sample a join



S

R

xx

x

x

xx

x

x

x

x x

x

x

x

x

x



Solution space Simple idea: sample a join Better: join samples



S

Rxxxxxxxx

xxxxxxxx

xxxxxxxx

xxxxxxxx

S

R

xx

x

x

xx

x

x

x

x x

x

x

x

x

x



Solution space Simple idea: sample a join Better: join samples

Ripple Joins Family of algorithms for joining increasing samples Optimized to shrink confidence intervals ASAP

Q: Which input to sample faster? “Aspect ratio” of ripples? A: Adapt to estimator’s “variance contribution” from different

inputs



S

Rxxxxxxxx

xxxxxxxx

xxxxxxxx

xxxxxxxx

S

R

xx

x

x

xx

x

x

x

x x

x

x

x

x

x

Traditional Nested Loops

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 31 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 72 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

S

R


xxxxxxxxxxxx

SELECT AVG(R.a * S.b)FROM R, SWHERE R.c = S.c

Basic Ripple Join

Simplest version read new tuples s from S and r from R join r and s join r with old S tuples join s with old R tuples


xxxxxxxxxxxx

Basic Ripple Join

Rxxxx

x

xxxxx

xx

xxxx

S


xxxxxxxxxxxx

S

Rxxxx

xxxxxxxx

xxxxxxxxxxxx

xx

xxxx

xxxxxx

Better I/O: Block Ripple Joinxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Batch IOs: = 2

S

Rxxxxxxxx

xx

xxxxxxxx

xx

xxxx

xxxxxxxx

Adapt: Rectangular Ripple Join


xxxxxxxxxxxx

R = 2, S = 1

Ripple Joins, cont’d

Really a family of join algorithms: Block: minimizes I/O in alternating nested loops Hash: symmetric hash tables to avoid scans Index: basic index join

Adaptive aspect ratio User sets “frame rate” of animation at GUI System goal: minibatch optimizations

minimize CI width subject to “frame-rate” time constraint CI’s computable at corners, so a new frame each corner

Challenge: choose the next minibatch corner Sample from higher variance-contributing relation faster System solves linear programming problem

to do this (approximately)


xxxxxxxxxxxx

Nice and Tidy! But…

Significant details missing Systems issues:

Make it work in a re-entrant iterator model RippleJoin.get_next() ?

Multi-join queries “Corner” cases -- e.g. EOT on one input of k

Statistical matters Estimators and CIs for joins of samples

Not textbook statistics! The adaptivity issue: choosing the next corner

Based on cost (I/O model for algorithm) and benefit (CI formula)

Approximate a non-linear integer programming problem

Implementation Issues

Basically a “symmetric” nested loops “inner” and “outer” roles switch back and forth Note asymmetry in inner loops: “mitres” the box

for (ever) { for (i = 1 to max -1) // S fixed, loop on R if (predicate(R[i], S[max])) output(R[i], S[max]); for (i = 1 to max) // R fixed, loop on S if (predicate(R[max], S[i])) output(R[max], S[i]);}

x x x x

x x x xx x x x

x x x xx x x x

x

xx

xx

Now as an iterator

Most DBMSs implement query operators as iterators get_next() must re-enter and generate next

result tuple

state for square, 2-table ripple join: cursors on both inputs (as in nested loops join) target -- where’s the next “corner” currel -- which relation is currently looping

x x x x

x x x xx x x x

x x x xx x x x

xx

Fussy complications in general

Non-square aspect ratios Not just “wrapping” 1 layer on each side

must pad side i with i layers correctly

Multiple joins = pipeline of binary joins But can’t be done naively

“Wrapping” hypercubes not rectangles Given subtree of size i, factor in ’s and

Subtree’s hyperplane is: 12…ii tuples Boundary conditions

Overload control messages: EOT may mean end of sampling step, not end of table Need to signal aggregator of arrival at corner (EOT on all tables)

Also, EOTable on a relation may change “mitre-ing” rules!

ST

R

Empirical Results for Square Ripples (208 sec. Query)


xxxxxxxxxxxx

Select Online AVG(E.grade) From enroll E, student SWhere E.sid = S.sid And S.honors_code IS NULL;

Square Results (cont.)xxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Statistical Matters

Need robust formulas for running estimates running CI’s

Adaptive update of aspect ratioEfficient maintenance of running statistics


xxxxxxxxxxxx

ExampleSELECT SUM(expression) FROM R, SWHERE predicateNatural estimator after n sampling steps:

Unbiased, consistent A scaled average -- hence amenable to CLT

(r,s) Rn x Sn

expressionp(r,s)|Rn| |Sn|

|R| |S|

Running Estimatorsxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Large-Sample Confidence Intervals

Review: single table R # sampling steps: 1 << n << |R| Based on Central Limit Theorem for averages of i.i.d.

observations

Complications for multiple tables Estimator for AVG is a ratio of averages expressionp(r, s) and expressionp(r, s’) are correlated!

Solution: CLT for (ratios of) cross-product averages

Based on theory of “convergence in distribution” Courtesy Peter Haas, IBM

Also “conservative” CIs that can be used for very small n Based on Hoeffding inequality

And deterministic CIs for the endgame


xxxxxxxxxxxx

General form: For K tables and a 100%p CI:

Where: zp is the area under the standard normal curve 0 ± p n is the number of tuples so far

d(k) is agg-fn dependent, computed on tuples of input dn(k) a consistent estimator by applying d(k) to samples so far the block size, a constant i the number of blocks to fetch from relation i

Tradeoff between update rate/CI shrinkage Bigger ’s = smaller CIs, slower animation

CI width =2 zpn1/2

2 = k = 1

Kd(k)

k

Large-Sample Confidence Intervals


xxxxxxxxxxxx

Cost of block ripple join after n steps: c = 12…kK-1nK + o(nK)

Cost of hash ripple after n steps: c = (1 + 2 + … + k)n

Non-linear integer programming problem:

minimize 2(1, 2, …, k,

such that1. c (1, 2, …, k, constant [animation speed]

2. 1 k |Rk|/ for 1 k K

3. 1, 2, …, k integer

= k = 1

K

d(k)

k

Aspect Ratio Selectionxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxx

Aspect Ratio Selection

Adaptive procedure:1. Start with 1 = 2 = … = K = 12. Execute m sampling steps (m >= 1)3. Update each d(k) estimate 4. Solve optimization problem to get new ’s5. Sweep out new rectangle 6. Go to 2

Approximate optimization algorithm for ’s First ignore all constraints, minimize 2

Then scale up all so smallest i ≥ 1 (constraint 2) Scale down by ever-larger i’s until animation goal met

(constraint 1) Round down to integers (constraint 3)


xxxxxxxxxxxx

1. c (1, 2, …, k, c

2. 1 k |Rk|/

3. 1, 2, …, k integer

Empirical Results: aspect ratios


xxxxxxxxxxxx

Select Online AVG(D.grade/A.grade) From enroll D, enroll AWhere D.college = Education And A.college = Agriculture And A.year = d.year

--> 1x6

Ripple Joins, cont’d

Prototypes in Postgres, Informix, IBM DB2Follow-on work Subqueries in Singapore [Tan, et al, VLDB 99] Scaling at Wisconsin/IBM [Luo, et al, SIGMOD 02]

Parallel and out-of-core variants Query optimization issues

Motivated eddies [SIGMOD 99] and the rest of the Telegraph project [CIDR 03]

A number of API and other systems issues Informix implementation [DMKD 2000]


xxxxxxxxxxxx

Road Map




Two Hot DB Topics

Queries over networks Querying sensor nets (e.g. TinyDB) Querying in P2P networks (e.g. PIER)

Continuous queries over data streams E.g. TelegraphCQ, Aurora, STREAM,

NiagaraCQ

Natural settings for these ideas!

Queries over Networks: Big Data, Small Pipes

Given: Large Data Sets E.g. disks, traces in P2P setting E.g. the physical world in sensornets

Constraint: Small pipes P2P != cluster [IPTPS03] Sensor nets have extremely limited comm per battery

Will have long-running dataflow programs! So… Need to provide streaming, approximate answers Need to prioritize delivery a la juggle Need to decide adaptively on how to sample physical

world “Acquisitional Query Processing” in TinyDB related to both

Ripple Join and Juggle Need to consider human factors while the query runs

Continuous Queries over Streams

Hot topic of the moment Berkeley’s TelegraphCQ, OGI/Wisconsin’s

NiagaraCQ, MIT/Brown/Brandeis’ Aurora, Stanford’s STREAM

Again, natural setting for these ideas Need pipelining operators for STREAMS Interest in approximate answers during bursts And prioritized delivery Need to consider human factors!

DB community focused on the next query language Also need to focus on interactivity!

Initiation of a Continuous Query is only half the battle

CONTROL: Lessons Learned

Dream about UIs, work on systems User needs drive systems design For long-running, data-intensive apps especially

Systems and statistics intertwine Adaptive systems

All 3 go together naturally User desires and behavior: 2 more things to model,

predict “Performance” metrics need to reflect key user

needs

“What unlike things must meet and mate…”-- Art, Herman Melville

Related Work on Online QP

Morgenstein’s PhD, Berkeley ’80Online Association Rules Ng, et al’s CAP, SIGMOD ’98 Hidber’s CARMA, SIGMOD ‘99

Implications for deductive DB semantics Monotone aggregation in LDL++, Zaniolo and Wang

Online agg with subqueries Tan, et al. VLDB ’99

Dynamic Pipeline Scheduling Urhan/Franklin VLDB ’01

Pipelining Hash Joins Su/Raschid, Wilschut/Apers, Tukwila, Xjoin Relation to semi-naive evaluation

Anytime Algorithms Zilberstein, Russell, et al.

More information?

http://control.cs.berkeley.edu : Papers Potter’s Wheel open source

[email protected]

Backup Slides

Sampling – Design Issues

Granularity of sample Instance-level (row-level): high I/O cost Block-level (page-level): high variability from

clustering

Type of sample Often simple random sample (SRS)

Especially for on-the-fly With/without replacement usually not critical

Data structure from which to sample Files or relational tables Indexes (B+ trees, etc)

Row-level Sampling Techniques

Maintain file in random order Sampling = scan Is file initially in random order?

Statistical tests needed: e.g., Runs test, Smirnov test In DB systems: cluster via RAND function Must “freshen” ordering (online reorg)

On-the-fly sampling Via index on “random” column Else get random page, then row within page

Ex: extent-map sampling Problem: variable number of records on page

Acceptance/Rejection Sampling

Accept row on page i with probability = ni/nMAX

Commonly used in other settings E.g. sampling from joins E.g. sampling from indexes

r r r r r r

r r r r r

r r r r r r

r r r r r r

r r r r r r

r r r r r r

r r r r r r

Original pages Modified pages

Cost of Row-Level Sampling

• 100,000 pages

• 200 rows/page

Estimation for Aggregates

Point estimates Easy: SUM, COUNT, AVERAGE Hard: MAX, MIN, quantiles, distinct values

Confidence intervals – a measure of precision

Two cases: single-table and joins

Confidence Intervals

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

CI Length

Sample Size

The Good and Bad News

Good news: 1/n1/2 magic (n chosen on-the-fly)

Bad news: needle-in-a-haystack problem

Running confidence intervals (1)

Confidence parameter p(0,1) is prespecifiedDisplay a precision parameter єn such that running aggregate Yn is within єn of the final answer μ with probability approximately equal to p. [Yn- єn,Yn+ єn] contains μ with probability approximately equal to p


Three types to construct from n retrieved records: Conservative confidence intervals based on

Hoeffding’s inequality or recent extention of this inequality, for all n>=1

Large-sample confidence intervals based on central limit theorems (CLT’s), for n both small and large enough

Deterministic confidence intervals contain μ with probability 1, only for very large n


SELECT AVG(exp) FROM R;v(i) (1 i m): the value of exp when applied to tuple iLi: the random index of the ith tuple retrieved from Ra and b are a priori bounds a v(i) b for 1 i m

Conservative confidence interval equations:

2/1

)/(2

1

1

)1

2ln(

2

1)(

21}|{|

)()/1(

)()/1(

22

⎟⎟⎠

⎞⎜⎜⎝

⎛−

−=

−≥≤−

=

=

−−

=

=

∑∑

pnab

eYP

ivm

LvnY

n

abnn

m

i

n

iin

ε

εμ

με

Running confidence intervals (4)Large-sample confidence interval equationsBy central limit theorems (CLT’s) , Ynapproaches a normal distribution with a mean (μ) and a variance 2/n as n, the sample size, increases. 2can be replaced by the estimator Tn,2(v)

2/1

2,2

2/12,

2/12,

2/12,

1

12,

)(

2/)1()(

1)(

2)()(

(}|{|

))(()1()( 2

⎟⎟⎠

⎞⎜⎜⎝

⎛=

+=Φ

−⎟⎟⎠

⎞⎜⎜⎝

⎛Φ≈

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

≤)−

=≤−

−−= ∑=−

nvTz

pz

vTn

vTn

vTYn

PYnP

YLvnvT

npn

p

nnn

n

n

i nin

ε

εεμεμ

Estimators for SUM, COUNT and AVG

∑×∈

=nn SRsr

pnn

srSR

SRsum

),(

),(exp||*||

||*||

∑×∈

=nn SRsrnn

oneSR

SRcount

),(||*||

||*||

count

sumavg =

Confidence intervals for ripple joins

Use central limit theorems (CLT’s) to compute “large-sample” confidence intervalsFix the problems in classic CLT’s with newly defined 2 for different aggregate queries

∑=

=K

k k

kd

1

2 )(

αβσ

Ripple optimization:choosing aspect ratios (1)

Blocking factor is prespecified, we want to optimize k’s – the aspect-ratio parametersminimize ∑

=

=K

k k

kd

1

2 )(

αβσ

such that

12 3 ...K K-1c (decided by animation speed)

1 k mk/ for 1 k K

1,2 ,3 ,...K interger

Precomputation: Explicit

OLAP Data Cubes (drill-down hierarchies) MOLAP, ROLAP, HOLAP

Semantic hierarchies APPROXIMATE (Vrbsky, et al.) Query Relaxation, e.g. CoBase Multiresolution Data Models

(Silberschatz/Reed/Fussell)

More general materialized views See Gupta/Mumick’s text

Precomputation: Stat. Summaries

Histograms Originally for aggregation queries, many flavors Extended to enumeration queries recently Multi-dimensional histograms

Parametric estimation Wavelets and Fractals Discrete cosine transform Regression Curve fitting and splines Singular-Value Decomposition (aka LSI, PCA)

Indexes: hierarchical histograms Ranking and pseudo-ranking Aoki’s use of GiSTs as estimators for ADTs

Data Mining Clustering, classification, other multidimensional models

Precomputed Samples

Materialized sample views Olken’s original work Chaudhuri et al.: join samples Statistical inferences complicated over “recycled”

samples?

Barbará’s quasi-cubesAQUA “join synopses” on universal relationMaintenance issues AQUA’s backing samples

Can use fancier/more efficient sampling techniques Stratified sampling or AQUA’s “congressional” samples

Haas and Swami AFV statistics Combine precomputed “outliers” with on-the-fly samples

Documents

Online Query Processing Joseph M. Hellerstein UC Berkeley