CONTROL Overview

CONTROL Overview

CONTROL groupJoe Hellerstein, Ron Avnur, Christian

Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk

Wylie, UC BerkeleyPeter Haas, IBM Almaden

Context (wild assertions)

• Value from information– The pressing problem in CS (?) (!!)

• “Point” querying and data management is a solved problem– at least for traditional data (business data,

documents)

• “Big picture” analysis still hard

Data Analysis c. 1998

• Complex: people using many tools– SQL Aggregation (Decision Support Sys, OLAP)– AI-style WYGIWIGY systems (e.g. Data Mining, IR)

• Both are Black Boxes– Users must iterate to get what they want– batch processing (big picture = big wait)

• We are failing important users!– Decision support is for decision-makers!– Black box is the world’s worst UI

Black Box Begone!

• Black boxes are bad– cannot be observed while running– cannot be controlled while running

• These tools can be very slow– exacerbates previous problems

• Thesis:– there will always be slow computer

programs, usually data-intensive– fundamental issue: looking into the box...

Crystal Balls

• Allow users to observe processing– as opposed to “lucite watches”

• Allow users to predict future• Ideally, allow users to change

future– online control of processing

• The CONTROL Project:– online delivery, estimation, and control for

data-intensive processes

Performance Regime for CONTROL

• Online performance:– Maximize 1st derivative of the “mirth index”

Time

100%

CONTROLTraditional

Examples

• Online Aggregation– Informix Dynamic Server

• Enhanced by UCB students with Control algorithms• Lots of algorithmics, many fussy end-to-end system

issues [Avnur, Hellerstein, Raman DMKD ’00]– IBM has ongoing project to do this in DB2– IBM buys Informix (4/01)

• Online Visualization– Visual enumeration & aggregation

• Interactive data cleaning & analysis– Potter’s Wheel ABC– Online “enumeration” and discrepancy detection

Example: Online Aggregation

SELECT AVG(gpa) FROM studentsGROUP BY college

Example: Online Data Visualization

• In Tioga DataSplash

Visual Transformation Shot

Scalable Spreadsheets

Decision-Support in DBMSs

• Aggregation queries– compute a set of qualifying records– partition the set into groups– compute aggregation functions on the

groups– e.g.:

Select college, AVG(grade)From ENROLLGroup By college;

Interactive Decision Support?• Precomputation

– the typical “OLAP” approach (a.k.a. Data Cubes)– doesn’t scale, no ad hoc analysis– blindingly fast when it works

• Sampling– makes real people nervous?– no ad hoc precision

• sample in advance• can’t vary stats requirements

– per-query granularity only

Online Aggregation• Think “progressive” sampling

– a la images in a web browser– good estimates quickly, improve over time

• Shift in performance goals– online mirth index

• Shift in the science– UI emphasis drives system design– leads to different data delivery, result estimation– motivates online control

Not everything can be CONTROLed• “needle in haystack” scenarios

– the nemesis of any sampling approach– e.g. highly selective queries, MIN, MAX,

MEDIAN

• not useless, though– unlike presampling, users can get some info

(e.g. max-so-far)

• we advocate a mixed approach– explore the big picture with online processing– when you drill down to the needles, or want

full precision, go batch-style– can do both in parallel

New Techniques

• Online Reordering– gives control of group delivery rates– applicable outside the RDBMS setting

• Ripple Join family of join algorithms– comes in naïve, block & hash

• Statistical estimators & confidence intervals– for single-table & multi-table queries– for AVG, SUM, COUNT, STDEV– Leave it to Peter

• Visual estimators & analysis

ST

RS

T

R

Online Reordering

• users perceive data being processed over time– prioritize processing for “interesting” tuples– interest based on user-specified preferences

• reorder dataflow so that interesting tuples go first

• encapsulate reordering as pipelined dataflow operator

• online aggregation– for SQL aggregate queries, give gradually improving

estimates – with confidence intervals

– allow users to speed up estimate refinement for groups of interest

– prioritize for processing at a per-group granularity

SELECT AVG(gpa) FROM studentsGROUP BY college

Context: an application of reordering

Framework for Online Reordering

• want no delay in processing in general, reordering can only be best-effort

• typically process/consume slower than produce– exploit throughput difference to reorder

• two aspects– mechanism for best-effort reordering– reordering policy

acddbadb...

abcdabc..

reorder consumeproduceprocess

f(t)

user interest

network xfer.

Juggle mechanism for reordering

• two threads -- prefetch from input -- spool/enrich from auxiliary side disk

• juggle data between buffer and side disk

– keep buffer full of “interesting” items

– getNext chooses best item currently on buffer

• getNext, enrich/spool decisions -- based on reordering policy

• side disk management– hash index, populated in a way that postpones random I/O

buffer

spoolprefetch enrich

getNext

side diskproduce

process/consume

Reordering policies

• quality of feedback for a prefix t1t2

…tk

QOF(UP(t1), UP(t2

), … UP(tk )), UP = user preference

– determined by application

• goodness of reordering: dQOF/dt

• implication for juggle mechanism – process gets item from buffer that increases QOF the most

– juggle tries to maintain buffer with such items

time

QOFGOAL: “good” permutation of

items t1…tn to t1…tn

QOF in Online Aggregation

• avg weighted confidence interval

• preference acts as weight on confidence interval(Recall from Central Limit Theorem that sample mean’s confidence

interval half-width is proportional to /n. Conservative (Hoeffding) confidence intervals also have a n in the denominator. So…)

QOF = UPi / ni , ni = number of tuples processed from group I

process pulls items from group with max UPi / nini

desired ratio of group i tuples on buffer = UPi2/3/ UPj

2/3

– juggle tries to maintain this by enrich/spool

Other QOF functions

• rate of processing (for a group) preference – QOF = (ni - nUPi)

2 (variance from ideal proportions)

process pulls items from group with max (nUPi - ni )

desired ratio of group i tuples in buffer = UPi

Results: Reordering in Online Aggregation

• implemented in Informix UDO server• experiments with modified TPC-D queries• questions:

– how much throughput difference is needed for reordering– can we reorder handle skewed data

• one stress test: skew, very small proc. cost

– index-only join– 5 orderpriorities, zipf distribution

A B C D E 1 1/2 1/3 1/4 1/5

consume

process

scan

juggle

indexSELECT AVG(o_totalprice), o_orderpriority

FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)GROUP BY o_orderpriority

Performance results A B C D E

initial preferences 1 1 1 5 3

after T1 1 1 3.5 0.5 1

time

# t

up

les

pro

cess

ed

• 3 times faster for interesting groups• 2% completion time overhead

E

C A

con

fid

en

ce

inte

rval

time

Ripple Joins

• Good confidence intervals for joins of samples– Vs. samples of joins! – Requires “Cross-Product CLT”

• Progressively Refining join:– ever-larger rectangles in R S– we can update confidence intervals at “corners”– comes in loop, index and hash flavors

• Benefits:– sample from both relations simultaneously– “animation rate”:

• Goal for the next “corner”, determines an optimization problem based on observations so far

• Old-fashioned systems are one extreme– adaptively tune “aspect ratio” for next “corner”

• sample from higher-variance relation faster – intimate relationship between delivery and estimation

Traditional

R

S

Ripple

R

S

Haas & Hellerstein, SIGMOD 99

Aspect Ratios

• Consider an extreme example:

• In general, to get to the next corner:– Need a cost model parameterized by relation

• Different for block and hash

– “Benefit”: change in confidence interval– An online linear optimization problem

• Arguments about estimates converging quickly, stabilizing…

Fussy Implementation Details• How to implement as an iterator? Issues:

– Need cursors on all inputs (as usual)– Need to maintain aspect ratios– Need to maintain current “inner” & cursor

• I.e. the relation currently being scanned– Need to know current sampling step

• To know how far to scan current “inner”– Need to know “starter” for next step

• Determines length of scan (see pic), end of sampling step• And pass that role along at EOF

Ensuring Aspect Ratios

Ripple Join Performance

• Too lazy to fetch graphs, but…– Typical orders of magnitude benefit vs.

batch…

CONTROL Lessons

• Dream about UIs, work on systems– User needs drive systems design!

• Systems and statistics intertwine– “what unlike things must meet and mate”

• Art, Herman Melville

• Sloppy, adaptive systems a promising direction

Questions

• Where else do these lessons apply?– Outside of data analysis, manipulation

• Systems people think a lot about interfaces (APIs)…– Encapsulation, narrow interfaces …– In the CONTROL regime, how do you design these

APIs and build systems?

• Ubiquitous computing:– Is it about portable computing and point

access/delivery?– Or sensors/actuators, dataflow, big-picture queries?

More?

• CONTROL: http://control.cs.berkeley.edu– Overview: IEEE Computer, 8/99

• Telegraph: http://db.cs.berkeley.edu/telegraph

Backup slides

• The following slides may be used to answer questions...

Sampling

• Much is known here– Olken’s thesis– DB Sampling literature– more recent work by Peter Haas

• Progressive random sampling– can use a randomized access method (watch

dups!)– can maintain file in random order– can verify statistically that values are

independent of order as stored

Estimators & Confidence Intervals

• Conservative Confidence Intervals– Extensions of Hoeffding’s inequality– Appropriate early on, give wide intervals

• Large-Sample Confidence Intervals– Use Central Limit Theorem– Appropriate after “a while” (~dozens of tuples)– linear memory consumption– tight bounds

• Deterministic Intervals– only useful in “the endgame”

Documents

CONTROL Overview