Upload
basil-black
View
15
Download
0
Embed Size (px)
DESCRIPTION
Join Synopses for Approximate Query Answering. Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande. Contents. Introduction Need for approximate answers Problem with joins Join synopses Allocation Maintenance of join synopses - PowerPoint PPT Presentation
Citation preview
Join Synopses for Approximate Query Answering
Swarup AchryaPhilip B. Gibbons
Viswanath Poosala Sridhar Ramaswamy
Presented by Bhushan Pachpande
Contents
1. Introduction
2. Need for approximate answers
3. Problem with joins
4. Join synopses
5. Allocation
6. Maintenance of join synopses
7. Experimental Evaluation
Introduction
This paper demonstrates difficulty of providing good approximate
answers to join queries proposes join synopses as the efficient solution for this
problem presents strategy for allocating available space for join
synopses provides efficient algorithm for maintaining join
synopses in presence of updates to the base relations
Why Approximate answers ? Reduce overhead for large DBs and improve response
time Reduce access to the base relation Example of Approximate answers
Initial queries in the data mining which are used to determine what the interesting queries are
Queries requesting numerical answers and full precision of exact answer not needed e.g. total, average
The research in this paper was conducted while developing efficient approximate query answering system, Aqua.
Aqua System
improve response time by avoiding frequent access to original data
maintains smaller sized statistical summaries, called synopses, on warehouse.
sits on the top of the DBMS.
There key components Statistic Collection Query Rewriting Maintenance
Collects all synopses, uses it to answer
queries posed by userparses sql input, rewrite queries for scaling certain operators to fit for synopses
Keep synopses up to date during updating of original data
Problem with Joins
Natural set of synopses for an approximate query includes uniform random samples of each base relations
Non-uniform result samples - For the join to be uniform random sample, probabilities of tuples in join samples must be equal
Small join result
Join of relations R & S on attribute X
aa
bb
a
b
R.X S.X
a2a1
b1
Probabilities of tuples a1 and a2 being selectedshould be same as prob. of tuples a1 and b1selected in join
uniform random sampling
prob. (a1,a2)=(1/r)*(1/r)*(1/r)=1/r3
prob. (a1,b1)=(1/r)*(1/r)*(1/r) *(1/r)=1/r4
To get uniform join samples is very difficult
Join Synopses
Naïve way - execute all possible join queries and collect samples Join synopses - samples are taken from small set of distinguished
joins Can obtain random samples of all possible joins in the schema This is scheme is for foreign key joins Modeled database schema as a
graph vertex - base relation directed edge (u to v) – if u has at
least one attribute which is foreign key in v
Join Synopses
Key result proved - There is 1-1 correspondence between a tuple in relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendents in the graph.
A sample Sr of a relation ‘r’ can be used to produce another relation J(Sr) called a join synopsis of ‘r’. ( provides random samples).
Join synopses of R is simply a sample of R where as for C it is the join of N, R and sample of C.
Join Synopses
For each node u in database schema G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r1xr2x..xrk with source r1.
Let Su be a uniform random sample of r1.
The join synopsis J(Su) is the output of Suxr2xr3…..xrk.
J(Su) is a uniform random sample of J(u) with |Su| tuples.
Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join.
From 1 join synopsis for a node whose foreign key join has k relations, we can extract a URS of the output of between k-1 & pow(2,k-1)-1 distinct foreign key joins.
Allocation
Allocate space among various join synopses when certain properties of query workload are known.
Identify heuristics for the common case when such properties are not known.
Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins.
For each relation Ri, find fraction Fi of queries in S for which Ri is the source relation in a foreign key join.
It is known that the error bounds are inversely proportional to sqrt(n).(n- number of tuples in join sample).
Select join synopsis sizes so as to minimize the average relative error.
Allocation
The average relative error bound over the queries is proportional to sum(fi/sqrt(ni))
ni is selected so as to minimize the above equation for the total memory allocated for join synopses
For each relation Ri if si = size of single join synopses tuple then join synopses size is chosen so as
sum (nisi) <= Total memory allocated
In the absence of query work load information heuristic strategies can be used. EqJoin CubeJoin PropJoin
divides space equally amongst the relationsdivides space proportional to cube root of their join synopses tuple sizes
divides space proportional to their join synopses tuple sizes
Maintenance of Join Synopses Need to maintain the join synopses when base
relation is updated (insert or delete) does not require frequent access to base relation If a new tuple is inserted
Let Pu be the probability of newly arrived tuple for relation u in random sample Su
Let uxr2xr3x….xrk be the max foreign key join with source u.
We add ‘T’ (new tuple) to Su with probability Pu.
If ‘T’ is added to Su, we add to J(Su) the tuple Txr2xr3x….rk
Maintenance of Join Synopses
If T is added to Su and Su exceeds its target size, then select uniformly at random a tuple T’ to evict from Su and remove the tuple in J(Su) corresponding to T’.
On delete of a tuple T from u T is in Su delete the tuple from Su and remove the tuple from
J(Su) corresponding to T If sample becomes too small due to many deletions
repopulate by scanning relation u.
This algorithm performs lookups with the base relation with small probability Pu
Experimental Evaluation
Two classes of experiments Accuracy experiments Maintenance experiments
Accuracy Experiments Compares accuracy of techniques based on join synopses and
based on base samples parameters varied - query selectivity and total space allocated to
precomputed summaries (summary size/join synopses size)
Maintenance Experiments Study cost of keeping join synopses up to date in presence of
insertions/deletions to the underlying data.
Experimental Evaluation
Ran results on TPC-D decision support benchmark
Query used is an aggregate that is computed on join of Lineitem, Customer, Order, Supplier, Nation and Region.
The query used is
region parameter is set to ‘ASIA’ and selection predicate is on o_orderdate column to the range [1/1/94, 1/1/95]
Query selectivity is varied using these parameters
Experimental Evaluation
Accuracy Experiments
Experimental EvaluationMaintenance Experiments
tuples inserted in lineitem table
Conclusion
Provides uniform random sampling for joins in the database having foreign key joins.
Focus on computing approximate answers to aggregates computed on multi-way joins.
Join synopses can be maintained effectively during updates.