28
BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU {lerner,shasha}@cs.nyu.edu

BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU {lerner,shasha}@cs.nyu.edu

Embed Size (px)

Citation preview

BUILDING A DATABASE SYSTEM FOR ORDER

New England Database Seminars April 2002

Alberto Lerner – ENST ParisDennis Shasha – NYU

{lerner,shasha}@cs.nyu.edu

NEDS April 2002 – Lerner and Shasha

Agenda

Motivation

SQL + Order Transformations Conclusion

NEDS April 2002 – Lerner and Shasha

MotivationThe need for ordered data

Some queries rely on order

Examples:

Moving averages

Top N

Rank

“SQL can handle it.” Can it really?

NEDS April 2002 – Lerner and Shasha

MotivationMoving Averages: algorithmically linear

Sales(month, total)

SELECT t1.month+1 AS forecastMonth, (t1.total+ t2.total + t3.total)/3 AS 3MonthMovingAverageFROM Sales AS t1, Sales AS t2, Sales AS t3WHERE t1.month = t2.month - 1 AND t1.month = t3.month – 2

Can optimizer make a 3-way (in general, n-way) join linear time?

Ref: Data Mining and Statistical Analysis Using SQLTrueblood and LovettApress, 2001

NEDS April 2002 – Lerner and Shasha

MotivationTop N

Employee(Id, salary)

SELECT DISTINCT count(*), t1.salaryFROM Employee AS t1, Employee AS t2WHERE t1.salary <= t2.salaryGROUP BY t1.salaryHAVING count(*) <= N

How many elements of cross-product have salaries at least as large as t1.salary? Will optimizer see essential sort-count trick?

Ref: SQL for SmartiesJoe CelkoMorgan Kauffman, 1995

NEDS April 2002 – Lerner and Shasha

MotivationProblems Extending SQL with Order

Queries are hard to read

Cost of execution is often non-

linear (would not pass basic

algorithms course)

Few operators preserve order, so

optimization hard.

NEDS April 2002 – Lerner and Shasha

Agenda

Motivation

SQL + Order Transformations Conclusion

NEDS April 2002 – Lerner and Shasha

SQL + OrderDesirable Features

Express order-dependent predicates

and clauses in a readable, clear way

Make optimization opportunities

explicit (by getting rid of complex

idioms, see above)

Execution in linear (or n log n) time

when possible

NEDS April 2002 – Lerner and Shasha

SQL + Orderthree steps in solution

1. Give SQL a vector-oriented semantics – Database is a set of array-tables “arrables”; variables in the queries do not refer to a single tuple at a time anymore, but to a whole column vector

2. Provide new vector-to-vector functions – Supporting order-based manipulations of column vectors

3. Streaming: new data may need special treatment.

NEDS April 2002 – Lerner and Shasha

SQL + OrderMoving Averages

Sales(month, total)

SELECT month, avgs(8, total)FROM Sales ASSUMING ORDER month

Execution (Sales is an arrable):1. FROM clause – enforces the order in

ASSUMING clause2. SELECT clause – for each month yields

the moving average (window size 8) ending at that month. No 8-way join.

avgs: vector-to-vector function, order-

dependant and size-preserving

order to be used on vector-to-vector

functions

NEDS April 2002 – Lerner and Shasha

SQL + OrderTop N

Employee(ID, salary)

SELECT first(N, salary) FROM Employee ASSUMING ORDER Salary

first: vector-to-vector function, order-

dependant and non size-preserving

Execution:1. FROM clause – orders arrable by Salary2. SELECT clause – applies first() to the

‘salary’ vector, yielding first N values of that vector given the order. Could get the top earning IDs by saying first(N, ID).

NEDS April 2002 – Lerner and Shasha

SQL + OrderRanking

SalesReport(salesPerson, territory, total)

SELECT territory, salesPerson, total, rank(total)FROM SalesReport WHERE rank(total) < N rank: vector-to-vector

function, non order-dependant and size-

preservingExecution:1. FROM clause – assuming is NOT needed.2. rank is applied to the ‘total’ vector and

maps each position into an integer.

NEDS April 2002 – Lerner and Shasha

SQL + OrderVector-to-Vector Functions

prev, next, $, []

avgs(*), prds(*), sums(*), deltas(*), ratios(*), reverse,

drop, first, lastorder-dependant

non order-dependant

size-preserving

non size-preserving

rank, tile min, max, avg, count

NEDS April 2002 – Lerner and Shasha

SQL + OrderComplex queries: Best spread

In a given day, what would be the maximum difference between a buying and selling point of each security?

Ticks(ID, price, tradeDate, timestamp, …)

SELECT ID, max(price – mins(price))FROM Ticks ASSUMING ORDER timestampWHERE tradeDate = ‘99/99/99’GROUP BY IDExecution:1. For each security, compute the running

minimum vector for price and then subtract from the price vector itself; result is a vector of spreads.

2. Note that max – min would overstate spread.

max

min

bestspread

running min

NEDS April 2002 – Lerner and Shasha

SQL + OrderComplex queries: Crossing averages part I

When does the 21-day average cross the 5-month average?

Market(ID, closePrice, tradeDate, …)TradedStocks(ID, Exchange,…)

INSERT INTO temp FROMSELECT ID, tradeDate, avgs(21 days, closePrice) AS a21, avgs(5 months, closePrice) AS a5, prev(avgs(21 days, closePrice)) AS pa21, prev(avgs(5 months, closePrice)) AS pa5FROM TradedStocks NATURAL JOIN Market ASSUMING ORDER tradeDateGROUP BY ID

NEDS April 2002 – Lerner and Shasha

SQL + OrderComplex queries: Crossing averages part I

Execution:1. FROM clause – order-preserving join2. GROUP BY clause – groups are defined

based on the value of the Id column3. SELECT clause – functions are applied;

non-grouped columns become vector fields so that target cardinality is met. Violates first normal form

groups in ID and non-grouped column

grouped ID and non-grouped column

Vector

field

two columns withthe same cardinality

NEDS April 2002 – Lerner and Shasha

SQL + OrderComplex queries: Crossing averages part II

Get the result from the resulting non first normal form relation temp

SELECT ID, tradeDateFROM flatten(temp)WHERE a21 > a5 AND pa21 <= pa5

Execution:1. FROM clause – flatten transforms temp

into a first normal form relation (for row r, every vector field in r MUST have the same cardinality). Could have been placed at end of previous query.

2. Standard query processing after that.

NEDS April 2002 – Lerner and Shasha

SQL + OrderRelated Work: Research

SEQUIN – Seshadri et al. Sequences are first-class objects Difficult to mix tables and sequences.

SRQL – Ramakrishnan et al. Elegant algebra and language No work on transformations.

SQL-TS – Sadri et al. Language for finding patterns in

sequence But: Not everything is a pattern!

NEDS April 2002 – Lerner and Shasha

SQL + OrderRelated Works: Products

RISQL – Red Brick Some vector-to-vector, order-dependent,

size-preserving functions Low-hanging fruit approach to language

design.

Analysis Functions – Oracle 9i Quite complete set of vector-to-vector

functions But: Can only be used in the select clause;

poor optimization (our preliminary study) KSQL – Kx Systems Arrable extension to SQL but syntactically

incompatible. No cost-based optimization.

NEDS April 2002 – Lerner and Shasha

Agenda

Motivation SQL + Order Transformations Conclusion

NEDS April 2002 – Lerner and Shasha

SELECT ts.ID, ts.Exchange, avgs(10, hq.ClosePrice)FROM TradedStocks AS ts NATURAL JOIN HistoricQuotes AS hq ASSUMING ORDER hq.TradeDateGROUP BY Id

TransformationsEarly sorting + order preserving operators

(1) Sort then joinpreserving order

(2) Preserve existingorder

(3) Join then sortbefore grouping

op

sort

g-by

avgs

op avgs

g-by

op

op

avgs

g-byop

sort

(4) Join then sortafter grouping

avgs

g-by

sort

NEDS April 2002 – Lerner and Shasha

TransformationsEarly sorting + order preserving operators

0

2040

60

80

100120

140

110

020

030

040

050

058

1

Number of traded Securities (total of 581 securities and 127062 quotes)

Tim

e in

mil

isec

on

ds

Sort before op join

Existing order

Sort after a reg join

Sort after reg join andg-by

NEDS April 2002 – Lerner and Shasha

TransformationsUDFs evaluation order

Gene(geneId, seq)SELECT t1.geneId, t2.geneId, dist(t1.seq, t2.seq)FROM Gene AS t1, Gene AS tWHERE dist(t1.seq, t2.seq) < 5 AND posA(t1.seq, t2.seq)

posA asks whether sequences have Nucleo A in same position. Dist gives edit distance between two Sequences.

posA

dist

dist

posA

(2)(1) (3)

Switch dynamically

between (1) and (2)

depending on the

execution history

NEDS April 2002 – Lerner and Shasha

TransformationsUDFs Evaluation Order

110

1001000

10000100000

1000000

10 100 1000

Number of sequences

Tim

e in

mili

se

co

nd

s

dist then pos

pos then dist

NEDS April 2002 – Lerner and Shasha

TransformationsOrder preserving joins

select lineitem.orderid, avgs(10, lineitem.qty), lineitem.lineidfrom order, lineitem assuming order lineidwhere order.date > 45and order.date < 55 and lineitem.orderid = order.orderid

• Basic strategy 1: restrict based on date. Create hash on order. Run through lineitem, performing the join and pulling out the qty.

• Basic strategy 2: Arrange for lineitem.orderid to be an index into order. Then restrict order based on date giving a bit vector. The bit vector, indexed by lineitem.orderid, gives the relevant lineitem rows.The relevant order rows are then fetched using the surviving lineitem.orderid.

Strategy 2 is often 3-10 times faster.

NEDS April 2002 – Lerner and Shasha

Transformations Building Blocks

Order optimization Simmens et al. `96 – push-down sorts over joins, and

combining and avoiding sorts Order preserving operators KSQL – joins on vector Claussen et al. `00 – OP hash-based join

Push-down aggregating functions Chaudhuri and Shim `94, Yan and Larson `94 –

evaluate aggregation before joins UDF evaluation Hellerstein and Stonebraker ’93 – evaluate UDF

according to its ((output/input) – 1)/cost per tuple Porto et al. `00 – take correlation into account

NEDS April 2002 – Lerner and Shasha

Agenda

Motivation SQL + Order Transformations Conclusion

NEDS April 2002 – Lerner and Shasha

Conclusion

Arrable-based approach to ordered databases may be scary – dependency on order, vector-to-vector functions – but it’s expressive and fast.SQL extension that includes order is possible and reasonably simple. Optimization possibilities are vast.