Join Synopses for Approximate Query Answering

Join Synopses for Approximate Query Answering

Swarup AchryaPhilip B. Gibbons

Viswanath Poosala Sridhar Ramaswamy

Presented by Bhushan Pachpande

Contents

1. Introduction

2. Need for approximate answers

3. Problem with joins

4. Join synopses

5. Allocation

6. Maintenance of join synopses

7. Experimental Evaluation

Introduction

This paper demonstrates difficulty of providing good approximate

answers to join queries proposes join synopses as the efficient solution for this

problem presents strategy for allocating available space for join

synopses provides efficient algorithm for maintaining join

synopses in presence of updates to the base relations

Why Approximate answers ? Reduce overhead for large DBs and improve response

time Reduce access to the base relation Example of Approximate answers

Initial queries in the data mining which are used to determine what the interesting queries are

Queries requesting numerical answers and full precision of exact answer not needed e.g. total, average

The research in this paper was conducted while developing efficient approximate query answering system, Aqua.

Aqua System

improve response time by avoiding frequent access to original data

maintains smaller sized statistical summaries, called synopses, on warehouse.

sits on the top of the DBMS.

There key components Statistic Collection Query Rewriting Maintenance

Collects all synopses, uses it to answer

queries posed by userparses sql input, rewrite queries for scaling certain operators to fit for synopses

Keep synopses up to date during updating of original data

Problem with Joins

Natural set of synopses for an approximate query includes uniform random samples of each base relations

Non-uniform result samples - For the join to be uniform random sample, probabilities of tuples in join samples must be equal

Small join result

Join of relations R & S on attribute X

aa

bb

a

b

R.X S.X

a2a1

b1

Probabilities of tuples a1 and a2 being selectedshould be same as prob. of tuples a1 and b1selected in join

uniform random sampling

prob. (a1,a2)=(1/r)*(1/r)*(1/r)=1/r3

prob. (a1,b1)=(1/r)*(1/r)*(1/r) *(1/r)=1/r4

To get uniform join samples is very difficult

Join Synopses

Naïve way - execute all possible join queries and collect samples Join synopses - samples are taken from small set of distinguished

joins Can obtain random samples of all possible joins in the schema This is scheme is for foreign key joins Modeled database schema as a

graph vertex - base relation directed edge (u to v) – if u has at

least one attribute which is foreign key in v

Join Synopses

Key result proved - There is 1-1 correspondence between a tuple in relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendents in the graph.

A sample Sr of a relation ‘r’ can be used to produce another relation J(Sr) called a join synopsis of ‘r’. ( provides random samples).

Join synopses of R is simply a sample of R where as for C it is the join of N, R and sample of C.

Join Synopses

For each node u in database schema G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r1xr2x..xrk with source r1.

Let Su be a uniform random sample of r1.

The join synopsis J(Su) is the output of Suxr2xr3…..xrk.

J(Su) is a uniform random sample of J(u) with |Su| tuples.

Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join.

From 1 join synopsis for a node whose foreign key join has k relations, we can extract a URS of the output of between k-1 & pow(2,k-1)-1 distinct foreign key joins.

Allocation

Allocate space among various join synopses when certain properties of query workload are known.

Identify heuristics for the common case when such properties are not known.

Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins.

For each relation Ri, find fraction Fi of queries in S for which Ri is the source relation in a foreign key join.

It is known that the error bounds are inversely proportional to sqrt(n).(n- number of tuples in join sample).

Select join synopsis sizes so as to minimize the average relative error.

Allocation

The average relative error bound over the queries is proportional to sum(fi/sqrt(ni))

ni is selected so as to minimize the above equation for the total memory allocated for join synopses

For each relation Ri if si = size of single join synopses tuple then join synopses size is chosen so as

sum (nisi) <= Total memory allocated

In the absence of query work load information heuristic strategies can be used. EqJoin CubeJoin PropJoin

divides space equally amongst the relationsdivides space proportional to cube root of their join synopses tuple sizes

divides space proportional to their join synopses tuple sizes

Maintenance of Join Synopses Need to maintain the join synopses when base

relation is updated (insert or delete) does not require frequent access to base relation If a new tuple is inserted

Let Pu be the probability of newly arrived tuple for relation u in random sample Su

Let uxr2xr3x….xrk be the max foreign key join with source u.

We add ‘T’ (new tuple) to Su with probability Pu.

If ‘T’ is added to Su, we add to J(Su) the tuple Txr2xr3x….rk

Maintenance of Join Synopses

If T is added to Su and Su exceeds its target size, then select uniformly at random a tuple T’ to evict from Su and remove the tuple in J(Su) corresponding to T’.

On delete of a tuple T from u T is in Su delete the tuple from Su and remove the tuple from

J(Su) corresponding to T If sample becomes too small due to many deletions

repopulate by scanning relation u.

This algorithm performs lookups with the base relation with small probability Pu

Experimental Evaluation

Two classes of experiments Accuracy experiments Maintenance experiments

Accuracy Experiments Compares accuracy of techniques based on join synopses and

based on base samples parameters varied - query selectivity and total space allocated to

precomputed summaries (summary size/join synopses size)

Maintenance Experiments Study cost of keeping join synopses up to date in presence of

insertions/deletions to the underlying data.


Ran results on TPC-D decision support benchmark

Query used is an aggregate that is computed on join of Lineitem, Customer, Order, Supplier, Nation and Region.

The query used is

region parameter is set to ‘ASIA’ and selection predicate is on o_orderdate column to the range [1/1/94, 1/1/95]

Query selectivity is varied using these parameters


Accuracy Experiments

Experimental EvaluationMaintenance Experiments

tuples inserted in lineitem table

Conclusion

Provides uniform random sampling for joins in the database having foreign key joins.

Focus on computing approximate answers to aggregates computed on multi-way joins.

Join synopses can be maintained effectively during updates.

Documents

Join Synopses for Approximate Query Answering