38
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed Muchallil September 21 st , 2010

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Embed Size (px)

Citation preview

Page 1: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

A Robust, Optimization-Based Approach for Approximate Answering of

Aggregate QueriesBy :

Surajid ChaudhuriGautam Das

Vivek Narasayya

Presented by :Sayed Muchallil September 21st, 2010

Page 2: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

CONTENTS

1. INTRODUCTION2. ARCHITECTURE FOR APPROXIMATE QUERY

PROCESSING3. FIXED WORKLOAD4. STRATIFIED SAMPLING5. SOLUTION6. SUMMARY

Page 3: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Pre-computed samples

Can give approximate answer very efficiently.

Workload are used to make sure that errors are acceptable.

Page 4: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Previous Studies

Solution is difficult to evaluate theoretically.

Do not formally deal with uncertainty in the expected workload.

Ignoring the variance in the data distribution.

Page 5: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Sample Product ID Revenue

1 10

2 10

3 10

4 1000

Only 50% of R records can be used as sample

Query : “SELECT SUM(Revenue) FROM R”

The answer for is 1030

Table R

Page 6: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Sample (cont.)

Product ID

Revenue

1 10

4 1000

The answer for the query for table S1 is 40.

The answer for the query for table S2 is 2020.

How to get these answer?

Sample Table S1

Sample Table S2

Page 7: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Sample (cont.)

large variance in the aggregate column can lead to large relative errors.

Relative error = |y - y’| / y

Relative error for S1 = |1030 – 40| / 1030

Relative error for S2 = |1030 – 2020| / 1030

Page 8: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

What’s New ?

The goal is to pick sample that minimize error.

If actual workload is identical to the given workload (fixed), error will be smaller.

Can work for identical and similar query to the given workload.

Page 9: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Sampling

• Two ways for selecting samples– Randomized– Deterministic

• A Workload W is a set of pairs of queries and their weight.– W = {<Q1, w1>,<Q2, w2>,…<Qq, wq>}

– Σiwi = 1.

Page 10: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Architecture for

Approximate Query Processing

Page 11: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Architecture (cont.)

Offline ComponentSelects sample or records from relation R

Online ComponentRewrites an incoming query to use the sample. What is “rewrites” means?

Reports answer with an estimate error

Page 12: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Architecture (cont.)

New method for automatically lifting a given workload.

It is unrealistic to assume that the incoming queries will be identical to the given workload.

The key : the ability to compute a probability distribution Pw.

Page 13: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Error Metrics Relative Error : |y - y’| / y Squared Error : SE(Q) = (|y - y’| / y)² Squared Error for GROUP BY query

SE(Q) = (1/g) Σi ((yi – yi’)/ yi)²

a probability distribution of queries pw

Mean squared error for the distribution:

MSE(pw) =ΣQ pw(Q)*SE(Q)

Root mean squared error :

RMSE(pw) = √MSE(pw)

Page 14: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Fixed Workload

Special case ?

A given workload are “identical” to the incoming queries.

Problem: FIXEDSAMPInput: R, W, kOutput: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.

Page 15: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Fundamental Regions

Relation R contains 9 records

W consists of 2 queries Q1 = select records with C values between 10 -50Q2 = select records with C values between 40 -70

These queries divide Relation R into 4 fundamental regions.

Page 16: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Fundamental Regions (cont.)

Page 17: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Fundamental Regions (cont.)

• partitioning the records in R into a minimum number of regions R1, R2, …, Rr such that for any region Rj, each query in W selects either all records in Rj or none.

• Total number fundamental regions =? Min(2|W|, n)

Page 18: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

FIXEDSAMP Solution

Step 1. Identify Fundamental Regions in R r <= k r > k

Step 2 Pick Sample Records

Step 3 Assign values to additional columns

Page 19: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

LIFTING WORKLOAD TO QUERY DISTRIBUTION

Query Q’ is not identical, Pw(Q’) is high if Q’ is similar to queries in the workload, and Low if not.

Q’ and Q are similar if selected records have significant overlap.

Page 20: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

LIFTED WORKLOAD

P{Q}(R’) is the probability of occurrence of any query that selects exactly the set of records R’.

For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record

Page 21: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

LIFTED WORKLOAD (Cont.)

Page 22: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

LIFTED WORKLOAD (Cont.)

δ → 1 and γ → 0: implies that incoming queries are identical to workload queries.

δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries.

δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries.

δ → ½ and γ → ½: implies that incoming queries are unrestricted.

Page 23: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

RATIONALE FOR STRATIFIED SAMPLING

A population is partitioned into multiple strata, and samples are selected uniformly from each stratum.

Page 24: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

STRATIFIED SAMPLING

a stratified sampling scheme partitions R into r strata containing n1, ., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k).

Q1 = SELECT COUNT(*) FROM R WHERE ProductID IN(3,4);

POPQ is population of query Q

POPQ1 = {0,0,1,1} = non-zero variance

Divided into two strata {0,0} and {1,1}

Product ID Revenue

1 10

2 10

3 10

4 1000

Page 25: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

SOLUTION FOR SINGLE-TABLE SELECTION QUERIES WITH AGGREGATION

StratificationHow many strata How many records for each stratum

AllocationDetermines how to divide k

SamplingForms the final sample of k record

Page 26: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

SOLUTION FOR COUNT AGGREGATE

Stratification (lemma 1)r is not known, divide R into fundamental regions

and treat them as strata.

Allocation (lemma 2)MSE(pW) = Σi wi MSE(p{Q})MSE(pW) can be expressed as a weighted sum of

the MSE of each query in the workload

Page 27: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

SOLUTION FOR COUNT AGGREGATE (Cont.)

For any Q ε W, we express MSE(p{Q}) as a function of the kj’s

Lemma 3 :ApproxMSE(p{Q}) =

Then,

Page 28: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

SOLUTION FOR COUNT AGGREGATE (Cont.)

Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’s variables.

Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of n1,…,nr, δ, and γ.

αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj.

Now we can minimize MSE(pw).

Page 29: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

SOLUTION FOR COUNT AGGREGATE (Cont.)

Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = k

if kj = k * ( sqrt(αj) / Σi sqrt(αi) )

This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region

Page 30: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

SOLUTION FOR SUM AGGREGATE

StratificationBucketing technique

Divide fundamental regions with large variance into a set of finer regions.

Treat each region as strata

AllocationYj is average (sum) of the aggregate column values

of all records in region Rj

Page 31: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

SOLUTION FOR SUM AGGREGATE (Cont.)

Each value in the region can be approximated as yj

An approximate formula for MSE(P{Q}) for SUM query Q in W

Page 32: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

Pragmatic Issues

Identifying Fundamental Regions

Handling Large Number of Fundamental Regions

Obtaining Integer Solution

Obtaining unbiased error

Page 33: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

STRAT ALGORITHM

Page 34: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

IMPLEMENTATION AND EXPERIMENTAL RESULT

This experiment compares the STRAT method to other methods.USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with

weighted sampling CONG – Congressional sampling

Page 35: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

COUNT AGGREGATE

Page 36: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

SUM AGGREGATE

Page 37: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

COUNT AGGREGATE

Page 38: A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed

THANK YOU