37
CONTROL Overview CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley Peter Haas, IBM Almaden

CONTROL Overview

Embed Size (px)

DESCRIPTION

CONTROL Overview. CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley Peter Haas, IBM Almaden. Context (wild assertions). Value from information The pressing problem in CS (?) (!!) - PowerPoint PPT Presentation

Citation preview

Page 1: CONTROL Overview

CONTROL Overview

CONTROL groupJoe Hellerstein, Ron Avnur, Christian

Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk

Wylie, UC BerkeleyPeter Haas, IBM Almaden

Page 2: CONTROL Overview

Context (wild assertions)

• Value from information– The pressing problem in CS (?) (!!)

• “Point” querying and data management is a solved problem– at least for traditional data (business data,

documents)

• “Big picture” analysis still hard

Page 3: CONTROL Overview

Data Analysis c. 1998

• Complex: people using many tools– SQL Aggregation (Decision Support Sys, OLAP)– AI-style WYGIWIGY systems (e.g. Data Mining, IR)

• Both are Black Boxes– Users must iterate to get what they want– batch processing (big picture = big wait)

• We are failing important users!– Decision support is for decision-makers!– Black box is the world’s worst UI

Page 4: CONTROL Overview

Black Box Begone!

• Black boxes are bad– cannot be observed while running– cannot be controlled while running

• These tools can be very slow– exacerbates previous problems

• Thesis:– there will always be slow computer

programs, usually data-intensive– fundamental issue: looking into the box...

Page 5: CONTROL Overview

Crystal Balls

• Allow users to observe processing– as opposed to “lucite watches”

• Allow users to predict future• Ideally, allow users to change

future– online control of processing

• The CONTROL Project:– online delivery, estimation, and control for

data-intensive processes

Page 6: CONTROL Overview

Performance Regime for CONTROL

• Online performance:– Maximize 1st derivative of the “mirth index”

Time

100%

CONTROLTraditional

Page 7: CONTROL Overview

Examples

• Online Aggregation– Informix Dynamic Server

• Enhanced by UCB students with Control algorithms• Lots of algorithmics, many fussy end-to-end system

issues [Avnur, Hellerstein, Raman DMKD ’00]– IBM has ongoing project to do this in DB2– IBM buys Informix (4/01)

• Online Visualization– Visual enumeration & aggregation

• Interactive data cleaning & analysis– Potter’s Wheel ABC– Online “enumeration” and discrepancy detection

Page 8: CONTROL Overview

Example: Online Aggregation

SELECT AVG(gpa) FROM studentsGROUP BY college

Page 9: CONTROL Overview

Example: Online Data Visualization

• In Tioga DataSplash

Page 10: CONTROL Overview

Visual Transformation Shot

Page 11: CONTROL Overview
Page 12: CONTROL Overview

Scalable Spreadsheets

Page 13: CONTROL Overview

Decision-Support in DBMSs

• Aggregation queries– compute a set of qualifying records– partition the set into groups– compute aggregation functions on the

groups– e.g.:

Select college, AVG(grade)From ENROLLGroup By college;

Page 14: CONTROL Overview

Interactive Decision Support?• Precomputation

– the typical “OLAP” approach (a.k.a. Data Cubes)– doesn’t scale, no ad hoc analysis– blindingly fast when it works

• Sampling– makes real people nervous?– no ad hoc precision

• sample in advance• can’t vary stats requirements

– per-query granularity only

Page 15: CONTROL Overview

Online Aggregation• Think “progressive” sampling

– a la images in a web browser– good estimates quickly, improve over time

• Shift in performance goals– online mirth index

• Shift in the science– UI emphasis drives system design– leads to different data delivery, result estimation– motivates online control

Page 16: CONTROL Overview

Not everything can be CONTROLed• “needle in haystack” scenarios

– the nemesis of any sampling approach– e.g. highly selective queries, MIN, MAX,

MEDIAN

• not useless, though– unlike presampling, users can get some info

(e.g. max-so-far)

• we advocate a mixed approach– explore the big picture with online processing– when you drill down to the needles, or want

full precision, go batch-style– can do both in parallel

Page 17: CONTROL Overview

New Techniques

• Online Reordering– gives control of group delivery rates– applicable outside the RDBMS setting

• Ripple Join family of join algorithms– comes in naïve, block & hash

• Statistical estimators & confidence intervals– for single-table & multi-table queries– for AVG, SUM, COUNT, STDEV– Leave it to Peter

• Visual estimators & analysis

Page 18: CONTROL Overview

ST

RS

T

R

Online Reordering

• users perceive data being processed over time– prioritize processing for “interesting” tuples– interest based on user-specified preferences

• reorder dataflow so that interesting tuples go first

• encapsulate reordering as pipelined dataflow operator

Page 19: CONTROL Overview

• online aggregation– for SQL aggregate queries, give gradually improving

estimates – with confidence intervals

– allow users to speed up estimate refinement for groups of interest

– prioritize for processing at a per-group granularity

SELECT AVG(gpa) FROM studentsGROUP BY college

Context: an application of reordering

Page 20: CONTROL Overview

Framework for Online Reordering

• want no delay in processing in general, reordering can only be best-effort

• typically process/consume slower than produce– exploit throughput difference to reorder

• two aspects– mechanism for best-effort reordering– reordering policy

acddbadb...

abcdabc..

reorder consumeproduceprocess

f(t)

user interest

network xfer.

Page 21: CONTROL Overview

Juggle mechanism for reordering

• two threads -- prefetch from input -- spool/enrich from auxiliary side disk

• juggle data between buffer and side disk

– keep buffer full of “interesting” items

– getNext chooses best item currently on buffer

• getNext, enrich/spool decisions -- based on reordering policy

• side disk management– hash index, populated in a way that postpones random I/O

buffer

spoolprefetch enrich

getNext

side diskproduce

process/consume

Page 22: CONTROL Overview

Reordering policies

• quality of feedback for a prefix t1t2

…tk

QOF(UP(t1), UP(t2

), … UP(tk )), UP = user preference

– determined by application

• goodness of reordering: dQOF/dt

• implication for juggle mechanism – process gets item from buffer that increases QOF the most

– juggle tries to maintain buffer with such items

time

QOFGOAL: “good” permutation of

items t1…tn to t1…tn

Page 23: CONTROL Overview

QOF in Online Aggregation

• avg weighted confidence interval

• preference acts as weight on confidence interval(Recall from Central Limit Theorem that sample mean’s confidence

interval half-width is proportional to /n. Conservative (Hoeffding) confidence intervals also have a n in the denominator. So…)

QOF = UPi / ni , ni = number of tuples processed from group I

process pulls items from group with max UPi / nini

desired ratio of group i tuples on buffer = UPi2/3/ UPj

2/3

– juggle tries to maintain this by enrich/spool

Page 24: CONTROL Overview

Other QOF functions

• rate of processing (for a group) preference – QOF = (ni - nUPi)

2 (variance from ideal proportions)

process pulls items from group with max (nUPi - ni )

desired ratio of group i tuples in buffer = UPi

Page 25: CONTROL Overview

Results: Reordering in Online Aggregation

• implemented in Informix UDO server• experiments with modified TPC-D queries• questions:

– how much throughput difference is needed for reordering– can we reorder handle skewed data

• one stress test: skew, very small proc. cost

– index-only join– 5 orderpriorities, zipf distribution

A B C D E 1 1/2 1/3 1/4 1/5

consume

process

scan

juggle

indexSELECT AVG(o_totalprice), o_orderpriority

FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)GROUP BY o_orderpriority

Page 26: CONTROL Overview

Performance results A B C D E

initial preferences 1 1 1 5 3

after T1 1 1 3.5 0.5 1

time

# t

up

les

pro

cess

ed

• 3 times faster for interesting groups• 2% completion time overhead

E

C A

con

fid

en

ce

inte

rval

time

Page 27: CONTROL Overview

Ripple Joins

• Good confidence intervals for joins of samples– Vs. samples of joins! – Requires “Cross-Product CLT”

• Progressively Refining join:– ever-larger rectangles in R S– we can update confidence intervals at “corners”– comes in loop, index and hash flavors

• Benefits:– sample from both relations simultaneously– “animation rate”:

• Goal for the next “corner”, determines an optimization problem based on observations so far

• Old-fashioned systems are one extreme– adaptively tune “aspect ratio” for next “corner”

• sample from higher-variance relation faster – intimate relationship between delivery and estimation

Traditional

R

S

Ripple

R

S

Haas & Hellerstein, SIGMOD 99

Page 28: CONTROL Overview

Aspect Ratios

• Consider an extreme example:

• In general, to get to the next corner:– Need a cost model parameterized by relation

• Different for block and hash

– “Benefit”: change in confidence interval– An online linear optimization problem

• Arguments about estimates converging quickly, stabilizing…

Page 29: CONTROL Overview

Fussy Implementation Details• How to implement as an iterator? Issues:

– Need cursors on all inputs (as usual)– Need to maintain aspect ratios– Need to maintain current “inner” & cursor

• I.e. the relation currently being scanned– Need to know current sampling step

• To know how far to scan current “inner”– Need to know “starter” for next step

• Determines length of scan (see pic), end of sampling step• And pass that role along at EOF

Page 30: CONTROL Overview

Ensuring Aspect Ratios

Page 31: CONTROL Overview

Ripple Join Performance

• Too lazy to fetch graphs, but…– Typical orders of magnitude benefit vs.

batch…

Page 32: CONTROL Overview

CONTROL Lessons

• Dream about UIs, work on systems– User needs drive systems design!

• Systems and statistics intertwine– “what unlike things must meet and mate”

• Art, Herman Melville

• Sloppy, adaptive systems a promising direction

Page 33: CONTROL Overview

Questions

• Where else do these lessons apply?– Outside of data analysis, manipulation

• Systems people think a lot about interfaces (APIs)…– Encapsulation, narrow interfaces …– In the CONTROL regime, how do you design these

APIs and build systems?

• Ubiquitous computing:– Is it about portable computing and point

access/delivery?– Or sensors/actuators, dataflow, big-picture queries?

Page 34: CONTROL Overview

More?

• CONTROL: http://control.cs.berkeley.edu– Overview: IEEE Computer, 8/99

• Telegraph: http://db.cs.berkeley.edu/telegraph

Page 35: CONTROL Overview

Backup slides

• The following slides may be used to answer questions...

Page 36: CONTROL Overview

Sampling

• Much is known here– Olken’s thesis– DB Sampling literature– more recent work by Peter Haas

• Progressive random sampling– can use a randomized access method (watch

dups!)– can maintain file in random order– can verify statistically that values are

independent of order as stored

Page 37: CONTROL Overview

Estimators & Confidence Intervals

• Conservative Confidence Intervals– Extensions of Hoeffding’s inequality– Appropriate early on, give wide intervals

• Large-Sample Confidence Intervals– Use Central Limit Theorem– Appropriate after “a while” (~dozens of tuples)– linear memory consumption– tight bounds

• Deterministic Intervals– only useful in “the endgame”