Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes

A ROBUST, OPTIMIZATION-BASED APPROACH FOR

APPROXIMATE ANSWERING OF AGGREGATE QUERIES

Written BySurajit Chaudhuri , Gautam Das , Vivek

Marasayya

(Microsoft Research, Washington)

Presented ByMelissa J Fernandes

Goal of the papers we have studied so far is to approximately answer aggregation queries

Accurately Efficiently

Main Idea of This Paper : Tailor the choice of samples to be robust for W similar not necessary identical to give workloads.

Two ways of Data mining for analyzing Large DB.

1)OLAP :- Online analytical processing is an approach to quickly answer multi-dimensional analytical queries using OLAP cubes.

OLAP cube :- is a data structure that allows fast analysis of data.Example Data of Product 1) By city2) By type3) By product type

Drawbacks:

1) Expensive2) Resource intensive

-wiki

2) Pre-computed samples

-This method gives approximate answers very efficiently.

-But may have larger errors, cause finding the samples with large variance is almost impossible.

3) Using Workloads What is a Workload ? Set of Transact-SQL statements that

execute against a DBor databases that have to be tuned. - msdn

Tuned :- Optimize performance of DB

How Sample set is generated from Workload?

- In practical world Queries fall under a particular pattern. Eg : Info of Texas state.

-The queries are run on the entire data base (R).

-A column is added to the records. This column holds the data that states if the record was selected by the queries. (tagging)

-If the record was selected by many queries its probability of being in the sample is higher.

-This way Workload is used to generate Sample set.

RIcicle

R(Q1)R(Q2)R(Q3)

ICICLES : A new class of samples that tune themselves to a dynamic Workload.

Outliers : Identify the tuples with Outlier values and store them in a separate relation.

1) Run query on T12) URSAMP on T23) Estimate the true result4) Using methods in paper

combine results.

Outlier table (T1)

No Outlier values Table (T2)

Drawbacks of Previous studies :-

1)They have intuitive appeal they lack rigorous problem formulation.

2)Do not deal with uncertainty in Workloads.

3)Ignore data variance in the data distribution of the aggregate column.

Architecture for Approximate Query Processing

Offline component for selecting samples from (R)

a) Rewrite the queries to use the sample to answer the query approximately.

b) Report the answers with estimate errors.

Offline component selects samples from R.

1) Each record has Scale Factor Column (or in different relation)

2) Value for the aggregate column of each record is scaled up by multiplying by scale factor and then aggregated.

Error Metrics :

y = correct ans y’ = approximate ans

Relative error :E(Q) = (|y-y’|)/|y|

Squared relative error : SE(Q) = (|y-y’|)2 / |y|2

Group By Query includes g groups

yi = correct ans for ith group

SE(Q) = (1/g) Σi (yi – yi’) 2 / yi 2

A group by query with g groups g select queries with 1/g weight each .

pw = Probability Distribution pw(Q) = probability of query Q is given

Mean Square error is

MSE(pw) = ΣQ pw(Q) * SE(Q)

Root mean squared error :

)()( WW pMSEpRMSE

Fundamental Regions

R = Relation W = Workspace n = no. of records in R

R is divided into min no. of regions R1,R2,…… Rr

Rj = each query in the W selects either all records in Rj or none.

Upper Bound on no. of regions is min (2|w| , n)

Q1 = select no. between 10 and 50

Q2= select values between 40 and 70

Fixed Workload

Problem Statement : (Fixed Samp)

Input : R, W, k

Output: A sample of k records such that MSE(W) is minimized.

Soultion : (3 steps)

1) Identify fundamental regions

2) Picking exactly one record from each imp fundamental region

3) Assign appropriate values to additional columns in the sample records.

Step 1 : Number of fundamental regions (r) is induced by the Workload W. Case 1 : (r <=k) Case 2 : (r >k)

____________________________

Case 1: (r<=k)

Step 2 : Pick one sample from each fundamental region(10,40,60,80)

Step 3: Column RegionCount and AggSum are updated Region Count = { 3,2,2,2,)AggSum = {60,90,130,170)

We can ans COUNT , SUM and AVG queries.

Case 2 :- (r>k)Step 2:1) Sort all regions by importance Importance = fj*nj2 fj = sum of weights of all the queries in W that select

the region j. nj = no. of records in the region j.

(fj) = measures the weights of the queries that are affected by Rj

(nj2) = measures the effect on the error by not including Rj.

2) Pick the top k

Step 3 Assign the values to additional coulmns (Regioncount , AggSum)Not same as pervious (k rec , 2k unknows ({RC1, … RCk} & {AS1, ,ASk} )MSE(W) = quadratic eq. on Differentiation we get Linear eq.

Disadvantage : If queries not identical unpredictable errors.

Lifting Workload

Q an Q’ => similar when records selected by them significantly overlap

R = Relation Q = Query

RQ = rec selected by the query

p{q} (R’) => denotes the probability of occurrence of any query that selects the set of records R’.

δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½) Define the degree to which W influences Q’s distribution.

δ Probability that an incoming query will select this record

γ Probability that an incoming query will select this record

For any record inside RQ

For any record outside RQ

n1, n2, n3, and n4 are the counts of records in the regions. n2 or n4 large (large overlap), P{Q}(R’) is

high n1 or n3 large (small overlap), P{Q}(R’) is

low

δ = inside RQ &selected by R’

γ = outside RQ &selected by R’

δ => 1 γ => 0 RQ and R’ identical δ => 1 γ => ½ R’ super set of RQ

δ => ½ γ => 0 R’ subset of RQ

δ => ½ γ => ½ R’ is unrestricted

STRATIFIED SAMPLING

PROBLEM : SAMP

Input : R, pw , k (pw probability distribution fun specified by W)

Output : a sample of k records such that MSE(pw) is minimized

Solution STRAT for single-table selection queries with Aggregation

3 Steps :

1.Stratification (a) How many strata to partition R into ? (b) How many records in each strata?

2. Allocation Determine the number of samples required

across each strata

3.Sampling

Count Aggregation

1) Stratification : No of statum = No. of fundamental regions

(Lemma1)

2) Allocation : We want to minimize the error over queries in pw .k1, … kr are unknown variables such that Σkj = k.

From Equation on an earlier slide,

MSE(pW) can be expressed as a weighted sum of the MSE of each query in the workload:Lemma 2: MSE(pW) = Σi wi MSE(p{Q})

q

iQi RpwRpi

1}{W )'()'(

Lemma 3 : For a COUNT query Q in W,let ApproxMSE(p{Q}) =

Then

Expected squared error in estimating the count (RQ union Rj)

Expected squared error in estimation the count of (RQ union R/RQ)

Expected relative squared error in estimating count of RQ

Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’s variables.

Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of n1,…,nr, δ, and γ.

αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj.

Now we can minimize MSE(pw).

Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = k

if kj = k * ( sqrt(αj) / Σi sqrt(αi) )

This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region.

Solution For Sum Aggregate

1) Stratification -Can not use same stratification as in Count.

-Use Bucketing Technique Fundamental regions are with large variance are divided into finer regions with significantly lower internal variance.

Each finer region treated as a strata.

2) Allocation

-Like COUNT, we express an optimization problem with h*r unknowns k1,…, kh*r.

-Unlike COUNT, the specific values of the aggregate column in each region (as well as the variance of values in each region) influence MSE(p{Q}).

-Let yj(Yj) be the average (sum) of the aggregate column values of all records in region Rj.

-Since the variance = small, so approximate each value in the region to yj.

Thus to express MSE(p{Q}) as a function of the kj’s for a SUM query Q in W:

As with COUNT, MSE(pW) for SUM is functionally of the form Σj(αj / kj), and αj depends on the same parameters n1, …nh*r , δ, and γ (Corollary 1).

Pragmatic Issues

-Identifying Fundamental Regions

-Handling Large Number of Fundamental Regions

-Obtaining Integer Solutions

-Obtaining an Unbiased Estimator

Extensions for more general Workloads

Group Queries : - -Q partitions R into g groups -Lifting model : - Replace Q with g separate selection queries -Tagging step :- append <c,v> (c = column id used in Group by. V = value of Group by in record t.)

Join Query :- Star queries contains -One source relation and a set of dimension relations connected via foreign key joins.-Group by and selection on source and dimension relation-Aggregation over columns of the source relation.

-Approach 1 - identify samples only over source relation. ( source relation is large)-Approach 2 - Identify source relation and precompute its join with all dimension relations.

Experimental Results :

FIXED – solution for FIXEDSAMP, fixed workload, identical queries

STRAT – solution for SAMP, workloads with single-table selection queries with aggregation

PREVIOUS WORKUSAMP – uniform random sampling

WSAMP – weighted sampling

OTLIDX – outlier indexing combined with weighted sampling

CONG – Congressional sampling

Conclusion:

The solutions FIXED and STRAT handle the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.

Documents

Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes