A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

A Robust, Optimization-Based Approach for Approximate Answering of

Aggregate QueriesBy :

Surajid ChaudhuriGautam Das

Vivek Narasayya

Presented by :Sayed Muchallil September 21st, 2010

CONTENTS

1. INTRODUCTION2. ARCHITECTURE FOR APPROXIMATE QUERY

PROCESSING3. FIXED WORKLOAD4. STRATIFIED SAMPLING5. SOLUTION6. SUMMARY

Pre-computed samples

Can give approximate answer very efficiently.

Workload are used to make sure that errors are acceptable.

Previous Studies

Solution is difficult to evaluate theoretically.

Do not formally deal with uncertainty in the expected workload.

Ignoring the variance in the data distribution.

Sample Product ID Revenue

1 10

2 10

3 10

4 1000

Only 50% of R records can be used as sample

Query : “SELECT SUM(Revenue) FROM R”

The answer for is 1030

Table R

Sample (cont.)

Product ID

Revenue

1 10

4 1000

The answer for the query for table S1 is 40.

The answer for the query for table S2 is 2020.

How to get these answer?

Sample Table S1

Sample Table S2

Sample (cont.)

large variance in the aggregate column can lead to large relative errors.

Relative error = |y - y’| / y

Relative error for S1 = |1030 – 40| / 1030

Relative error for S2 = |1030 – 2020| / 1030

What’s New ?

The goal is to pick sample that minimize error.

If actual workload is identical to the given workload (fixed), error will be smaller.

Can work for identical and similar query to the given workload.

Sampling

• Two ways for selecting samples– Randomized– Deterministic

• A Workload W is a set of pairs of queries and their weight.– W = {<Q1, w1>,<Q2, w2>,…<Qq, wq>}

– Σiwi = 1.

Architecture for

Approximate Query Processing

Architecture (cont.)

Offline ComponentSelects sample or records from relation R

Online ComponentRewrites an incoming query to use the sample. What is “rewrites” means?

Reports answer with an estimate error

Architecture (cont.)

New method for automatically lifting a given workload.

It is unrealistic to assume that the incoming queries will be identical to the given workload.

The key : the ability to compute a probability distribution Pw.

Error Metrics Relative Error : |y - y’| / y Squared Error : SE(Q) = (|y - y’| / y)² Squared Error for GROUP BY query

SE(Q) = (1/g) Σi ((yi – yi’)/ yi)²

a probability distribution of queries pw

Mean squared error for the distribution:

MSE(pw) =ΣQ pw(Q)*SE(Q)

Root mean squared error :

RMSE(pw) = √MSE(pw)

Fixed Workload

Special case ?

A given workload are “identical” to the incoming queries.

Problem: FIXEDSAMPInput: R, W, kOutput: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.

Fundamental Regions

Relation R contains 9 records

W consists of 2 queries Q1 = select records with C values between 10 -50Q2 = select records with C values between 40 -70

These queries divide Relation R into 4 fundamental regions.

Fundamental Regions (cont.)

Fundamental Regions (cont.)

• partitioning the records in R into a minimum number of regions R1, R2, …, Rr such that for any region Rj, each query in W selects either all records in Rj or none.

• Total number fundamental regions =? Min(2|W|, n)

FIXEDSAMP Solution

Step 1. Identify Fundamental Regions in R r <= k r > k

Step 2 Pick Sample Records

Step 3 Assign values to additional columns

LIFTING WORKLOAD TO QUERY DISTRIBUTION

Query Q’ is not identical, Pw(Q’) is high if Q’ is similar to queries in the workload, and Low if not.

Q’ and Q are similar if selected records have significant overlap.

LIFTED WORKLOAD

P{Q}(R’) is the probability of occurrence of any query that selects exactly the set of records R’.

For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record

LIFTED WORKLOAD (Cont.)

LIFTED WORKLOAD (Cont.)

δ → 1 and γ → 0: implies that incoming queries are identical to workload queries.

δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries.

δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries.

δ → ½ and γ → ½: implies that incoming queries are unrestricted.

RATIONALE FOR STRATIFIED SAMPLING

A population is partitioned into multiple strata, and samples are selected uniformly from each stratum.

STRATIFIED SAMPLING

a stratified sampling scheme partitions R into r strata containing n1, ., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k).

Q1 = SELECT COUNT(*) FROM R WHERE ProductID IN(3,4);

POPQ is population of query Q

POPQ1 = {0,0,1,1} = non-zero variance

Divided into two strata {0,0} and {1,1}

Product ID Revenue

1 10

2 10

3 10

4 1000

SOLUTION FOR SINGLE-TABLE SELECTION QUERIES WITH AGGREGATION

StratificationHow many strata How many records for each stratum

AllocationDetermines how to divide k

SamplingForms the final sample of k record

SOLUTION FOR COUNT AGGREGATE

Stratification (lemma 1)r is not known, divide R into fundamental regions

and treat them as strata.

Allocation (lemma 2)MSE(pW) = Σi wi MSE(p{Q})MSE(pW) can be expressed as a weighted sum of

the MSE of each query in the workload

SOLUTION FOR COUNT AGGREGATE (Cont.)

For any Q ε W, we express MSE(p{Q}) as a function of the kj’s

Lemma 3 :ApproxMSE(p{Q}) =

Then,


Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’s variables.

Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of n1,…,nr, δ, and γ.

αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj.

Now we can minimize MSE(pw).


Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = k

if kj = k * ( sqrt(αj) / Σi sqrt(αi) )

This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region

SOLUTION FOR SUM AGGREGATE

StratificationBucketing technique

Divide fundamental regions with large variance into a set of finer regions.

Treat each region as strata

AllocationYj is average (sum) of the aggregate column values

of all records in region Rj

SOLUTION FOR SUM AGGREGATE (Cont.)

Each value in the region can be approximated as yj

An approximate formula for MSE(P{Q}) for SUM query Q in W

Pragmatic Issues

Identifying Fundamental Regions

Handling Large Number of Fundamental Regions

Obtaining Integer Solution

Obtaining unbiased error

STRAT ALGORITHM

IMPLEMENTATION AND EXPERIMENTAL RESULT

This experiment compares the STRAT method to other methods.USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with

weighted sampling CONG – Congressional sampling

COUNT AGGREGATE

SUM AGGREGATE

COUNT AGGREGATE

THANK YOU

Documents

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed