12
Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research [email protected] Liang Jeff Chen Microsoft Research [email protected] Zheng Zhang Microsoft Research [email protected] ABSTRACT Aggregation queries are at the core of business intelligence and data analytics. In the big data era, many scalable shared- nothing systems have been developed to process aggregation queries over massive amount of data. Microsoft’s SCOPE is a well-known instance in this category. Nevertheless, aggre- gation queries are still expensive, because query processing needs to consume the entire data set, which is often hundreds of terabytes. Data sampling is a technique that samples a small portion of data to process and returns an approximate result with an error bound, thereby reducing the query’s execution time. While similar problems were studied in the database literature, we encountered new challenges that dis- able most of prior efforts: (1) error bounds are dictated by end users and cannot be compromised, (2) data is sparse, meaning data has a limited population but a wide range. For such cases, conventional uniform sampling often yield high sampling rates and thus deliver limited or no performance gains. In this paper, we propose error-bounded stratified sampling to reduce sample size. The technique relies on the insight that we may only reduce the sampling rate with the knowledge of data distributions. The technique has been implemented into Microsoft internal search query platform. Results show that the proposed approach can reduce up to 99% sample size comparing with uniform sampling, and its performance is robust against data volume and other key performance metrics. 1. INTRODUCTION With the rapid growth in data volume, velocity and vari- ety, efficient analysis of massive amount of data is attracting more and more attention in industry. In an Internet company, extracting and analyzing click streams or search logs help the company’s mission critical businesses to improve service qual- ity, find future revenue growth opportunities, monitor trends and detect root causes of live-site events in a timely fashion. At the core of these analytics tasks are decision support queries that aggregate massive amount of data. Nowadays, This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- mission prior to any use beyond those covered by the license. Contact copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 40th International Conference on Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. Proceedings of the VLDB Endowment, Vol. 7, No. 13 Copyright 2014 VLDB Endowment 2150-8097/14/08. aggregation queries over big data are often processed in a shared-nothing distributed system that scales to thousands of machines. MapReduce and Hadoop are two well-known software frameworks. Microsoft’s SCOPE is another instance in this category [4, 22]. It combines MapReduce’s explicit parallel executions and SQL’s declarative query constructs. Today inside Microsoft, SCOPE jobs are running on tens of thousands of machines everyday. While a shared-nothing distributed system provides mas- sive parallelism to improve performance, aggregation queries are still expensive to process. The fundamental cause is that query processing by default consumes the entire data set, which is often hundreds or thousands of terabytes. Our investigation in one cluster reveals that 90% of 2,000 data mining jobs are aggregation queries. These queries consume two-thousand machine hours on average, and some of them take up to 10 hours. They exhaust computation resources and block other time-sensitive jobs. One technique to cope with the problem is sampling: queries are evaluated against a small randomly-sampled data, returning an approximated result with an error bound. An error bound is an interval in which the real value falls with a high possibility (a.k.a. confidence). Such an approximated result with an error bound is often good enough for analy- sis tasks. For instance, a study on a product’s popularity along the time dimension may not care for the accurate sales number in a particular time, but only the trend. An analy- sis identifying the most promising sales region may be only interested in relative positions. A smaller amount of data to be processed improves not only query response time, but also the overall system throughput. Data sampling for aggregation queries has been extensively studied in the database literature as approximate query pro- cessing [1, 3, 6, 10, 16]. The theory foundation lies in statis- tical sampling [11, 15]: given a uniform sample from a finite or infinite population, what is the estimation’s error bound? Existing techniques utilize the theory and proceed with dif- ferent system settings. For example, online aggregation [1] assumes that data is first shuffled and then processed in a pipelined execution, updating estimations and error bounds continuously. Query execution can be terminated whenever the error bound reaches a satisfactory level. Alternatively, a number of techniques use different sampling schemes to build data sketches offline and use them to answer queries at runtime. When we tried to apply these known technologies in our production environment, however, we encounter a few critical challenges, summarized as below:

Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research [email protected] Liang Jeff Chen

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

Error-bounded Sampling for Analytics on Big Sparse Data

Ying YanMicrosoft Research

[email protected]

Liang Jeff ChenMicrosoft Research

[email protected]

Zheng ZhangMicrosoft Research

[email protected]

ABSTRACTAggregation queries are at the core of business intelligenceand data analytics. In the big data era, many scalable shared-nothing systems have been developed to process aggregationqueries over massive amount of data. Microsoft’s SCOPE isa well-known instance in this category. Nevertheless, aggre-gation queries are still expensive, because query processingneeds to consume the entire data set, which is often hundredsof terabytes. Data sampling is a technique that samples asmall portion of data to process and returns an approximateresult with an error bound, thereby reducing the query’sexecution time. While similar problems were studied in thedatabase literature, we encountered new challenges that dis-able most of prior efforts: (1) error bounds are dictated byend users and cannot be compromised, (2) data is sparse,meaning data has a limited population but a wide range. Forsuch cases, conventional uniform sampling often yield highsampling rates and thus deliver limited or no performancegains. In this paper, we propose error-bounded stratifiedsampling to reduce sample size. The technique relies on theinsight that we may only reduce the sampling rate with theknowledge of data distributions. The technique has beenimplemented into Microsoft internal search query platform.Results show that the proposed approach can reduce up to99% sample size comparing with uniform sampling, and itsperformance is robust against data volume and other keyperformance metrics.

1. INTRODUCTIONWith the rapid growth in data volume, velocity and vari-

ety, efficient analysis of massive amount of data is attractingmore and more attention in industry. In an Internet company,extracting and analyzing click streams or search logs help thecompany’s mission critical businesses to improve service qual-ity, find future revenue growth opportunities, monitor trendsand detect root causes of live-site events in a timely fashion.At the core of these analytics tasks are decision supportqueries that aggregate massive amount of data. Nowadays,

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-mission prior to any use beyond those covered by the license. Contactcopyright holder by emailing [email protected]. Articles from this volumewere invited to present their results at the 40th International Conference onVery Large Data Bases, September 1st - 5th 2014, Hangzhou, China.Proceedings of the VLDB Endowment, Vol. 7, No. 13Copyright 2014 VLDB Endowment 2150-8097/14/08.

aggregation queries over big data are often processed in ashared-nothing distributed system that scales to thousandsof machines. MapReduce and Hadoop are two well-knownsoftware frameworks. Microsoft’s SCOPE is another instancein this category [4, 22]. It combines MapReduce’s explicitparallel executions and SQL’s declarative query constructs.Today inside Microsoft, SCOPE jobs are running on tens ofthousands of machines everyday.

While a shared-nothing distributed system provides mas-sive parallelism to improve performance, aggregation queriesare still expensive to process. The fundamental cause isthat query processing by default consumes the entire dataset, which is often hundreds or thousands of terabytes. Ourinvestigation in one cluster reveals that 90% of 2,000 datamining jobs are aggregation queries. These queries consumetwo-thousand machine hours on average, and some of themtake up to 10 hours. They exhaust computation resourcesand block other time-sensitive jobs.

One technique to cope with the problem is sampling:queries are evaluated against a small randomly-sampled data,returning an approximated result with an error bound. Anerror bound is an interval in which the real value falls witha high possibility (a.k.a. confidence). Such an approximatedresult with an error bound is often good enough for analy-sis tasks. For instance, a study on a product’s popularityalong the time dimension may not care for the accurate salesnumber in a particular time, but only the trend. An analy-sis identifying the most promising sales region may be onlyinterested in relative positions. A smaller amount of datato be processed improves not only query response time, butalso the overall system throughput.

Data sampling for aggregation queries has been extensivelystudied in the database literature as approximate query pro-cessing [1, 3, 6, 10, 16]. The theory foundation lies in statis-tical sampling [11, 15]: given a uniform sample from a finiteor infinite population, what is the estimation’s error bound?Existing techniques utilize the theory and proceed with dif-ferent system settings. For example, online aggregation [1]assumes that data is first shuffled and then processed in apipelined execution, updating estimations and error boundscontinuously. Query execution can be terminated wheneverthe error bound reaches a satisfactory level. Alternatively,a number of techniques use different sampling schemes tobuild data sketches offline and use them to answer queriesat runtime.

When we tried to apply these known technologies in ourproduction environment, however, we encounter a few criticalchallenges, summarized as below:

Page 2: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

• Tight error bounds. Aggregation queries of data miningjobs in our environment are not standalone queries. Rather,they are often data providers for later consumers in theanalytics pipeline. As such, the adoption of samplingis predicated on tight error bounds that must not becompromised.

• Sparseness of data. From the perspective of sampling,the sparseness of data depends on both the population andthe value range. A medium or large population may stillbe deemed sparse by the theory of statistical sampling, ifthe value is distributed over a very wide range. In suchcases, uniform sampling may require sample size as largeas the raw data size in order to satisfy the error boundrequirement. This brings no or negative performance gains,due to sampling overhead. The phenomena of sparsedata are not uncommon in the production environment.Indeed, the 80/20 rule applies in many domains in practice.Figure 1 shows a typical data distribution over whichan average needs to be computed; the pattern is long-tail. Uniform sampling with 20% error bound and 95%confidence needs to consume 99.91% of the data.

The unique combination of error bound requirements andsparse data invalidates most existing techniques. Conven-tional offline sampling schemes do not provide commitmentsto error bounds, but only promise to minimize them [1, 16,3]. Other techniques, such as online aggregation [10] andbootstrapping [12], can provide an error bound guarantee.But they do not target sparse data intentionally. As a result,their executions often cannot stop early and have to processalmost all data before the error bound requirement can besatisfied.

Intuitively, to deal with the above problems, error-boundedsampling must understand the data distribution, which isthe central theme of this paper. This will translate intoa more expensive way of generating samples. Such costcan be absorbed if samples can be shared among queries.Fortunately, this is the case in our production environment,as shown in Table 1. Among the 2,000 jobs we analyzed, 3inputs (0.1%) are shared by more than 600 jobs and 37 inputs(1.2%) are shared by over 100 jobs. These jobs often sharemany commonalities on targeted attributes, if not completelythe same. It means we can afford higher expenses on buildingsamples of these inputs for the targeted attributes. Moreover,the cost of error-bounded sampling can further be amortized,if we can incrementally maintain samples upon updates. Aswe will discuss later, this is the most effective if/when datadistributions change slowly over time.

Table 1: Inputs shared by multiple jobsNo. of Inputs No. of Jobs Sharing

3 ≥ 60037 ≥ 100680 ≥ 101700 ≥ 2580 0

1.1 Our ContributionIn this paper, we propose a new sampling scheme to address

the aforementioned challenges. The technique is a variantof stratified sampling [13] and relies on the insight that we

50

100

150

200

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Popu

latio

n

Value

min = 0.1 max = 20777

AVG = 101

Figure 1: Sparseness of one representative produc-tion Data

can only reduce sampling rates and secure error bounds forsparse data if we have prior knowledge of its distribution.Consider the distribution in Figure 1, we can partition thedata into two buckets, one covering the head and the othercovering the rest. The small value range of the first bucketleads to a small sampling rate. As most of the data fall intothe first bucket, the overall saving is substantial, even though(in the extreme case) all data in the second bucket must beincluded.

Bucket partitioning with bucket-dependent sampling ratesis called a sampling scheme. Applying a sampling schemeto raw data generates a sample sketch. The total savingis achieved when we re-write eligible queries to run againstthe sketch, instead of the original data. Intuitively, derivingthe best sampling scheme is a data-dependent optimizationproblem: given a complete view of the data and a user-specified error bound, find the sampling scheme such thatthe total sampled data is minimized. Unlike existing workthat sets a space constraint, our formalization focuses onminimizing sample size while securing error bounds. Indeed,most application users we experienced are interested in resultquality and response time, instead of space constraint.

The first contribution of this paper is identifying newrequirements and challenges of data sampling for aggrega-tion queries in the production environment. We present analgorithm for finding an optimal scheme of error-boundedsampling with the complexity of O(N4) (N being the num-ber of data points), and a more practical heuristic algorithmwith O(N). The algorithm has an appealing property thatits performance is no worse than the uniform sampling.

The most straightforward application of our technique isto build a sample sketch before queries are issued. As such,once data is updated, the sketch must be rebuilt, beforeit can answer new queries. This is a nuisance, as data iscontinuously updated in the product environment. Our sec-ond contribution is mechanisms that incrementally maintainsample sketches when recurring queries have temporal over-laps of inputs between successive invocations, e.g., landmark-or sliding-windowed queries. Incremental maintenance en-ables us to save computation resources for building samplesketches, without sacrificing error bound guarantees.

Our final contribution is a number of experiments wehave run against real-world production jobs in SCOPE, with

Page 3: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

and without our technique. These experiments have shownpromising results: the saving is substantial and across theboard, comparing to uniform sampling. For some productionqueries, our approach can save up to 99% of the sample size,or a thousand times of reduction. Moreover, our technique isrobust against changes of data volume and other performancemetrics. The lessons we learned here is also general: opti-mizations of big data computing demand a tighter couplingwith data itself.

The remaining of the paper is organized as follows. Sec-tion 2 introduces preliminaries on data sampling for aggre-gation queries, and describes when and why existing tech-niques fail. Section 3 formalizes the problem and presentsan algorithm that delivers the optimal solution, and themore practical heuristic algorithm. Error-bounded sketchupdates are introduced in Section 4. Experimental resultsare reported in Section 5. Section 6 discusses related work.Section 7 concludes the paper.

2. PRELIMINARIESWe start by introducing sampling principles for aggregation

queries. We limit ourself to queries with a single AVGaggregation in this and the next section, for ease of expositionof sampling principles and our new sampling schemes. Wewill extend the technique to a wider range of queries inSection 4.1.

An aggregation query Q specifies a set of GROUP-BYattributes and an aggregation attribute A over a source—either a structured table or a unstructured text file (e.g.,a log file). Let G = {g1, . . . , gm} be the set of non-emptygroups, and Xg = {xg1, . . . , x

gNg} be A’s values in group g.

The AVG value for group g, denoted by Xg, is computed byaggregating the Ng values in g.

A sample over group g is a subset of ng (≤ Ng) values.The estimated AVG value for group g based on this sampleis:

Xg =xg1 + . . .+ xgng

ng

and the estimation’s error is:

εg = |Xg − Xg|

A sample is a probabilistic event. Two independent sam-ples may yield different estimations and errors. We say that asampling scheme for group g satisfies the error bound ε0 > 0under confidence δ ∈ [0, 1], iff among M independent samplesproduced by the scheme, at least M · δ estimations satisfyεg ≤ ε0.

A sampling scheme for query Q is a union of samplingschemes for all groups G, so that all groups’ estimations areclose to their real values. It is not uncommon that usersmay only be interested in groups above/below a certainpopulation. For instance, revenue analysis may focus oneither top or bottom contributors, depending on applicationscenarios. In such cases, we may discard uninterested groupswhen designing a sampling scheme.

A common sampling scheme for a group of values is uniformsampling. That is: each value in group g is randomly kept ordiscarded with equal probability. A simple implementationis to generate a random number for each value to determinewhether or not this value is kept. If we want to samplea fixed number of values from g, the reservoir samplingalgorithm [18] can be used.

Given a confidence δ and sample size ng, the Hoeffdingequation [11] from statistical sampling defines the estima-tion’s error bound:

εg = H(ng, δ) = (bg − ag)

√1

2ngln

2

1− δ (1)

where ag and bg are lower and upper bounds of the values ingroup g. By this equation, given a user-specified error boundε0, we may compute how much data needs to be sampled sothat the estimation’s error bound is at most ε0:

ng = min

(bg − ag)2 ln 21−δ

2ε20

⌉Ng

(2)

The Hoeffding equation originally targets sampling froman infinite population. Alternatively, it can be compensatedby a factor for a finite population [15], yielding the samplesize as:

ng =

1

2ε20

(bg − ag)2 ln 21−δ

+1

Ng

(3)

When the error bound ε0 is large or bg − ag is small,Equation 2 is equivalent to Equation 3. In other cases,Equation 3 provides a smaller sample size. Both equationsprovide theoretical guarantees on the error bound underthe confidence. We will only be discussing Equation 1 (orEquation 2) in the following for simplicity. All the techniquesand analysis presented later can also be applied by replacingEquation 2 with Equation 3.

Error-bounded uniform sampling scheme based on Equa-tion 2 is straightforward. Its problem, however, is that whenthe values in group g spread over a wide range, the mini-mal data required to satisfy the error bound is close to theoriginal data. Specifically, ng is proportional to (bg − ag)2in Equation 2. When bg − ag is large and Ng is not largeenough, ng/Ng will be close to 1. It means that, in orderto satisfy the error bound, sampling can hardly get rid ofany data; query execution over the sample would producelimited performance gains, or even penalties overall due tosampling overhead. In this paper, we refer to the data over awide range (i.e., large bg − ag) but with a limited population(i.e., small Ng) as sparse. Even when data volume increases,sparseness problem still exists, especially when the numberof GROUP-BY attributes increase.

3. ERROR-BOUNDED STRATIFIED SAM-PLING

Our solution to big sparse data is based on the insightthat we may only reduce the sample size and secure the errorbound if we have better knowledge of Xg. Specifically, if wepartition the range [ag, bg] into buckets [a1g, b

1g]∪ . . .∪ [akg , b

kg ],

we will be able to reduce sampling rates of individual bucketsdue to their smaller (big−aig) in Equation 1, thereby possiblyreducing the total sample size too.

A simple sampling scheme based on the above intuition isto equally partition the data range and apply the Hoeffdingequation to sub-ranges using the global error bound ε0. Whilethis simple scheme may reduce the sample size to some

Page 4: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

extends, it ignores the data distribution and may not leadto a small sample. First, equal-length partitions mean thatevery sub-range is sampled in the same way, because big − aigand ε0 are the same across all sub-ranges. But sub-rangesdo not have to have the same error bound, as long as theglobal error bound requirement is satisfied. Intuitively, somesub-ranges are more important than others in influencing theglobal estimation and the error bound. This flexibility allowsus to increase the error bounds for less critical sub-rangesand reduce the total sample size. Second, the number of sub-ranges is still a parameter that needs to be determined. Morepartitions do not necessarily lead to a smaller sample size.As we will discuss later, the sample size of a sub-range mustbe an integer. Due to integer rounding, the total sample sizecan only increase when we keep adding more partitions. InFigure 2, we compare equal-length partitioning and variable-length partitioning (using the algorithm introduced later).As we can see, even if we can find the optimal number ofpartitions, equal-length partitioning still produces a largersample size than variable-length partitioning does.

100

200

300

400

500

600

700

800

50 100 150 200 250 300 350 400 450 500

Sam

ple

Size

Number of Partitions

Equal PartitionStratified

42 partitions, size 199

Figure 2: Equal-length partitioning v.s. variable-length Partitioning

We need an algorithm to find a sampling scheme thatleads to an optimal sample size, while satisfying the globalerror bound requirement. In this section, we introduceerror-bounded stratified sampling, a variant of stratifiedsampling [13], that aims to minimize the sample size whilesatisfying the error bound requirement. For ease of dispo-sition, we only consider designing a sampling scheme for asingle group of values X = {x1, . . . , xN} and omit all g’s insub- or super-scripts.

3.1 Problem DefinitionAssume X = {x1, . . . , xN}, xi ∈ [a, b] are sorted. A strati-

fied sampling scheme specifies two types of parameters:

1. Bucket partitioning. The sampling scheme partitions therange into k buckets, Bi = [ai, bi], i = 1, . . . , k. BucketBi contains Ni values.

2. Buckets’ error bounds. Given bucket Bi’s boundaries[ai, bi] and its error bound εi, Equation 2 determinesthe number of values ni must be sampled out of Bi’spopulation Ni. A special case is when εi = 0, ni = Ni,i.e. all data in the i-th bucket is kept.

Let Xi be the estimation using the ni values in Bi. Thetotal sample size is n = n1 + . . .+ nk. The global estimation

X and the error bound ε0 and those from individual bucketssatisfy the following equations:

N = N1 + . . .+Nk (4)

X =N1 · X1 + . . .+Nk · Xk

N(5)

ε0 =N1 · ε1 + . . .+Nk · εk

N(6)

Problem 1 (Error-bounded Stratified Sampling).Given N sorted values {x1, . . . , xN}, xi ∈ [a, b], an errorbound ε0 and a confidence δ, find a partition of [a, b] and theerror bound εi for partition Bi, such that

1.N1 · ε1 + . . .+Nk · εk

N≤ ε0, and

2. n1 + . . .+ nk is minimized.

Note that in the problem definition, we do not limit thenumber of buckets k.

3.2 An Optimal SolutionWe present a dynamic programming algorithm that gives

an optimal solution to Problem 1. At the first glance, buck-ets’ ranges and error bounds are parameters of a stratifiedsampling scheme that form the search space. An algorithmneeds to search the space to find an optimal sampling schemethat yields the minimal sample size. However, error boundsfrom individual buckets may not be integers and we cannotenumerate them to search the whole space. Our algorithmis derived from an alternative view: while error bounds areinnumerable, sample size can be enumerated. The algorithmenumerates the sample size from 1 through N (i.e., all data)and finds the minimal size that satisfies the error boundrequirement.

Consider N sorted values X = {x1, . . . , xN}. We defineEmin(xi,m) to be the minimal error bound that any stratifiedsampling scheme can achieve when sampling m values from[x1, xi] to estimate the AVG value in this range. Note that[x1, xi] may consist of more than one bucket. When m = 1or m = i, Emin(xi,m) is given by:

Emin(xi, 1) = H(1, δ) = (xi − x1)

√1

2ln

2

1− δ (7)

Emin(xi, i) = 0 (8)

In Equation 7, only one value is sampled. The error boundis given by the Hoeffding equation with sample size set to1. In Equation 8, the minimal error bound is 0, because alldata are processed and the estimation is accurate.

When m is between 1 and i, Emin(xi,m) is computed fromone of the two cases: (1) the m values are sampled uniformlyfrom [x1, xi], or (2) they are sampled from two or morebuckets with different sampling rates. For the first case,Emin(xi,m) is given by the Hoeffding equation.

For the second case, we decompose the computation ofEmin(xi,m) into two components: let [xc+1, xi] be the right-most bucket of [x1, xi], and [x1, xc] be the remaining sub-range. We designate s values sampled from the sub-range,and the remaining m− s values uniformly sampled from therightmost bucket. The global error bound of the m values

Page 5: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

from [x1, xi] is given by Equation 6 that combines the twosub-ranges’ error bounds:

Emin(xc, s) · c+H(m− s, δ) · (i− c)i

By the definition of Emin(xi,m), we need to find c and s thatlead to the minimal error bound of all possible rightmostbuckets when sampling m valuse.

Combining the two cases, we have:

Emin(xi,m) = min

Edmin(xi,m)

(xi − x1)

√1

2mln

2

1− δ(9)

Edmin(xi,m) = arg min1≤c<i

1≤s≤min{c,m−1}

(Emin(xc, s) · c+

(xi − xc+1)

√1

2(m− s) ln2

1− δ · (i− c))/

i

By Equation 9, we can compute Emin(xN , 1) throughEmin(xN , N). The optimal solution to Problem 1 is theminimal m such that Emin(xN ,m) ≤ ε0. Backtracking mmin

will give us buckets’ ranges and sampling rates.

Theorem 1. The complexity of the algorithm based onEquation 9 is O(N4).

To compute Emin(xN , 1) through Emin(xN , N), there areO(N2) entries to fill. To compute each entry, the min oper-ator in Equation 9 iterates through c and s, and comparesO(N2) values. The final step on finding mmin is O(N).Altogether, the algorithm’s complexity is O(N4).

3.3 Heuristic AlgorithmsThe algorithm presented in last section has a high com-

plexity. Our experience with SCOPE is that any samplingalgorithm whose complexity is higher than O(N2) will intro-duce considerable cost that outweighs sampling’s benefits.In this section, we present a heuristic algorithm that compro-mises the optimality but improves efficiency. We show thatwhile the algorithm may not lead to the minimal samplingsize, the derived stratified sampling cannot be worse thanconventional uniform sampling and often lead to significantimprovements.

One cause of the algorithm’s high complexity is variableerror bounds of individual buckets. A simplification is thatwe limit each bucket’s error bound to be the same, i.e.,ε1 = ... = εk = ε0. By Equation 6, the global error boundrequirement is obviously satisfied.

When we set εi = ε0, i ∈ [1, k], an O(N2) algorithm canfind the global optimum: let S(xi) be the minimal samplesize between [x1, xi] while each bucket’s error bound is ε0.A dynamic programming algorithm that finds the minimalsample size is illustrated in the following equation:

S(xi) = arg min1≤j≤i−1

(S(xj−1)+min

{ (xj − xi)2 ln 21−δ

2ε20, Ni−Nj

})Instead of seeking the global optimum, we use a local-

optimum search heuristic to further reduce complexity. Thepseudo code is shown in Algorithm 1. Given an array ofsorted values, the algorithm scans the array from the head tothe tail. For each new value encountered, the algorithm tries

to merge it into the current bucket (Line 5). If the mergingextends the bucket’s value range and increases the numberof values to be sampled under the error bound ε0, a newbucket is created (Line 10 through 12); otherwise, the valueis merged into the bucket and the bucket’s range is updated(Line 8). The algorithm’s complexity is O(N).

Algorithm 1: A linear algorithm for finding an error-bounded stratified sampling scheme

Input: An error bound ε0, a sorted array of valuesX = {x1, . . . , xN}

Output: A set of buckets B = {B1, . . . , Bk}1: Bcurr = [x1, x1], Bcurr.SampleSize = 12: B = {Bcurr}3: i← 24: while i ≤ N do5: Let Bt = [Bcurr.lower, xi]6: Compute Bt.SampleSize by Equation 27: if Bcurr.SampleSize + 1 ≤ Bt.SampleSize then8: Bcurr = Bt9: else

10: Bnew = [xi, xi], Bnew.SampleSize = 111: B ← B ∪ {Bnew}12: Bcurr = Bnew13: end if14: i← i+ 115: end while16: return B

Theorem 2. Given a user-specified error bound ε0, thestratified sampling scheme given by Algorithm 1 samples nomore data than uniform sampling.

Proof. If the algorithm returns one bucket, stratifiedsampling is equivalent to uniform sampling. Next, we provethat stratified sampling with two buckets samples no moredata than uniform sampling.

Let [x1, xN ] be the value range, and [x1, xc] and [xc+1, xN ]be the two buckets’ ranges. From the Hoeffding equation,we have:

ln2

1− δ ·(xN − x1)2

2ε2

≥ ln2

1− δ

[(xN − xc+1)2

2ε2+

(xc − x1)2

2ε2

](10)

By case analysis, we can also prove the following inequality:

min{x+ y,N

}≥ min

{x,N1

}+ min

{y,N2

}(11)

where N = N1 +N2. Putting Equation 10 and 11 together,we have:

min

{ln

2

1− δ ·(xN − x1)2

2ε2, N

}≥ min

{ln

2

1− δ(xN − xc+1)2

2ε2, N1

}+

min

{ln

2

1− δ(xc − x1)2

2ε2, N2

}The left hand of the inequality is the sample size whenapplying uniform sampling. The right hand is the totalsample size when using uniform sampling for two buckets.

Page 6: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

For stratified samplings with more than two buckets, wecan prove by induction: let [xj , xN ] be the rightmost bucketand [x1, xj−1] be the remaining range. We first show thatusing two or more buckets for [x1, xj−1] is no worse than usingone bucket. Then we show that combining [x1, xj−1] and[xj , xN ] is no worse than uniform sampling over [x1, xN ].

Even when the data is uniformly distributed, stratifiedsampling can still drive down the sample size through rangepartitioning, because inequality functions 10 and 11 aredistribution-oblivious. However, they do not always imply“the more buckets the better” claim. Intuitively, when xN −x1 is small enough, adding more buckets would not help.In particular, the sample size computed by the Hoeffdingequation must be rounded to integers. When adding bucketsonly changes decimals, there is no need to do so.

4. OFFLINE SAMPLINGThis section develops necessary steps and extensions in or-

der to apply the core algorithm to real-world production work-loads. We first extend single-AVG queries to a wider scope ofqueries. We then discuss generating sampling schemes offline,which is then used to build sample sketches to answer queries.Finally, we study the problem of incremental maintenanceof sample sketches in the face of changing data, and presentalgorithms for two most common update modes.

4.1 Beyond a Single AVG AggregationSo far we have been focusing on queries with a single AVG

aggregation. We extend the proposed sampling technique intwo ways: queries with SUM and COUNT aggregations, andqueries with multiple aggregations.

A COUNT aggregation calculates the number of recordsfrom an input. Unlike the average, this number is propor-tional to input size, a parameter that cannot be derived froma sample but only through a full scan. Fortunately, thisnumber is usually computed once when a sample is built.It is later stored as meta-data and does not need onlinecomputations any more.

Given an AVG estimation and the data size, a SUM aggre-gation can be easily computed by multiplying the averageand the full data size. The SUM estimation’s confidence isthe same as that of AVG. Its error bound is εavg ·N whereεavg is the error bound of the AVG estimation and N is thedata size. In general, any aggregation that is an algebraiccombination f of basic aggregations (i.e., SUM, AVG orCOUNT) can be estimated by combining their individualestimations f(e1, . . . , en). The error bound is given by theminimal and the maximal values f(e1 ± ε1, . . . , en ± εn) canachieve.

When a query includes more than one aggregation (overmore than one column), we may decompose the original queryinto multiple sub-queries, and generate one sample for eachaggregation. This approach requires efforts to decomposethe input query and the subsequent step to join sub-queries’results. Alternatively, we may produce one sample thatserves all aggregations. This can be done by union sampleschemes of each individual columns. The combined scheme isessentially a voting machine: a row or record from the inputis sampled if any scheme says so. While this mechanism mayover-sample in the sense that a sample may not be needed forall aggregations, it ensures that all aggregations’ estimationscan satisfy the error bound requirements.

4.2 Sampling Scheme and Sample SketchesThe output of a sampling algorithm for a group of values is

a sampling scheme. A sampling scheme is a set of buckets B ={B1, . . . , Bk}, each represented as a triple Bi = 〈li, ui, εi〉.εi is Bi’s error bound, and determines how many recordsBi is expect to receive, as specified by Equation 2. We useεi instead of ε0 or ni to deal with the case when we needto reuse the sampling schemes over changing data, as willbe clear later. Also, if εi = 0, Bi’s sample size ni is aslarge as its population Ni. When there are multiple groupsg1, . . . , gn, the sampling scheme is the union of the samplingschemes of all groups: {Bg1 , . . . ,Bgn}.

We can use a sampling scheme to sample raw data andproduce a sample sketch. A sample sketch is basically asubset of raw data. In addition to raw data’s columns,each record in the sketch contains one additional columnScaleFactor. The value of this column is Ngi

j /ngij , the inverse

of the overall sampling rate of bucket Bgij to which the recordbelongs. This column is maintained to facilitate estimatingaggregations that are scale-dependent, e.g., SUM, so thatduring query answering we can reconstruct the estimation ofthe real scale. A similar technique is introduced in [1].

To answer queries using sketches, we need to rewritequeries by replacing the input with the sample sketch. Scale-independent aggregation operators, such as AVG, are un-changed; scale-dependent aggregation operators, such asSUM and COUNT, need to incorporate ScaleFactor to recon-struct the estimation for the raw data’s scale. Similar to theapproach introduced in [1], a SUM aggregation SUM(X) isrewritten to SUM(X∗ScaleFactor); a COUNT aggregationCOUNT(∗) is rewritten to SUM(ScaleFactor).

Given a sampling scheme, a sketch building processor com-putes a sample sketch by scanning the raw data. For eachrecord/row scanned, the processor finds the group gi andthe bucket Bgij that this record falls in. The processor thendetermines whether this record should be kept or discarded,according to its error bound εgij : εgij specifies how many

records (i.e., ngij ) must be kept in bucket Bgij . This is imple-mented by inserting records into a reservoir and randomlyreplacing records in the reservoir after it is full. This is alsoknown as the reservoir sampling algorithm [18]. After alldata is scanned, bucket Bgij ’s population Ngi

j is also known,from which ScaleFactor can be computed.

Overall, building a sample sketch consists of two steps:generating a sampling scheme and building a sample sketch.These two steps need to scan raw data twice, incurring non-negligible cost. However, this generated sample sketch can beused by multiple queries many times, effectively amortizingthe cost to a low level.

4.3 Incremental Maintenance of SampleSketches

The sampling algorithms proposed in Section 3.3 all havecomplexities no lower than O(N), which can still be deemedexpensive when the data scale is very large. While a samplesketch is only built once and can be re-used by many queries,re-building a new sketch from scratch upon data updateswill incur significant overhead, especially when updates arefrequent.

A common approach to circumvent this overhead is toincrementally update the sample sketch, so that each timeonly updated data is processed. This approach fits our appli-cations well, because our production jobs mainly target logs

Page 7: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

that continuously receive new data on an hourly or daily ba-sis. We are particularly interested in two update modes thatare widely observed in production jobs: landmark windowedqueries and sliding windowed queries. Landmark windowedqueries target data from a landmark to present.When newlogs are appended, they are patched to the sample sketch.Sliding windowed queries focus on logs in a fixed-length win-dow. As new logs arrive, out-of-interest old logs are removedfrom the sketch. In either cases, logs are immutable, and thequeries only see log additions and deletions.

In the following, we first analyze the maintenance guidelineand then discuss detailed maintenance mechanisms of thetwo modes.

4.3.1 Error-bounded MaintenanceA sampling scheme of a group of values is represented asB = {B1, . . . , Bk} and Bi = 〈li, ui, εi〉. When new data isadded or old data is deleted, samples in individual bucketsare changed too. Let Bi’s new sample size and populationbe n′i and N ′i after updates. Bi’s new error bound ε′i iscomputed by Equation 12:

ε′i =

(bi − ai)

√1

2n′iln

2

1− δ n′i < N ′i

0 n′i = N ′i

(12)

Incremental maintenance algorithms must ensure that theglobal error bound ε′ is still no bigger than the initial errorbound requirement ε0 after the updates, i.e.,

ε′ =N ′1 · ε′1 + . . .+N ′k · ε′k

N ′1 + . . .+N ′k≤ ε0 (13)

When the above equation cannot be satisfied, the samplesketch cannot be incrementally updated; a new sample sketchmust be re-built from scratch.Remarks Incrementally updating a sample sketch is equiv-alent to using the old sampling scheme to sample the newsnapshot. While by Equation 13 we can ensure the errorbound requirement is still satisfied, the updated sample’s sizemay not be optimal, because the new snapshot may describea new distribution, demanding a new sampling scheme toachieve optimality. Nevertheless, incremental maintenanceis still appealing to our applications. Logs received fromvarious sources often reflect applications’ states or users’ be-haviors. These signals usually change gradually, and so dothe underlying data distributions. A sampling system usingan old scheme can still keep the sample size at a low level,while benefiting from the low maintenance overhead. Froman engineering perspective, we may periodically re-computethe sampling scheme, e.g., once in a month, and use the samescheme to incrementally maintain the sample sketch duringthe time window.

4.3.2 Landmark Windowed QueriesA landmark windowed query is over the data from a land-

mark to present. Once a sample sketch is built over an oldsnapshot, newly added data is incrementally patched to thesketch. In other words, the maintenance is insertion only.

Maintaining a sample sketch upon data insertions is similarto the sample building procedure. When a new data recordxi arrives, the maintenance algorithm inserts it to the sketchwith a certain probability. Specifically, the algorithm firstlooks up bucket Bi to which this record belongs, and increases

Bi’s population to N ′i = Ni+1. With reservoir sampling, thisnew record randomly replaces an existing record in Bi. SinceBi’s population is changed to N ′i , the ScaleFactor column ofall the records in the bucket are updated as well.

A special case during the maintenance is that a new recordxi may fall out of all existing buckets. It means that this newvalue is out of the boundaries of the prior data distribution.Such new records must all be kept in the sample sketch,with ScaleFactor = 1. Intuitively, we have no knowledgeof the data distribution out of existing boundaries. Theonly way we are able to secure the error bound is that wesample all of them. It is equivalent to adding a new bucketBxi = 〈xi, xi, ε = 0〉 to the sampling scheme B.

By using the above maintenance algorithm, the error boundin the updated sample sketch is always no bigger than the ini-tial error bound. This is because the error bounds of existingbuckets never change and the error bounds of new bucketsare always zero. Overall, error bound is still guaranteed.

4.3.3 Sliding Windowed QueriesA sliding windowed query targets data in a fixed-length

window. As new data arrives, it is patched to the samplesketch. Old data out of interests is removed from the rawdata and the sketch. Let B = {B1, . . . , Bk} be a group’ssampling scheme. We also use Bi to denote a set of sampledrecords in this bucket. Let Nd

i be the number of old recordsin the raw data, and ndi be the number of old records inbucket Bi in the sketch. Let Na

i be the number of newincoming records in the raw data. While all ndi old recordsmust be removed from Bi, only a fraction of Na

i records maybe eventually added to the sketch, because not all new datahas to be sampled. We denote this number as nai ≤ Na

i .After data is removed and added to the sketch, a bucket

Bi’s population becomes

N ′i = Ni −Ndi +Na

i

The number of sampled records in Bi is

n′i = ni − ndi + nai

Given N ′i , n′i and Bi’s boundaries [ai, bi], Equation 12 gives

Bi’s new error bound. The new sketch must satisfy theglobal error bound, as specified by Equation 13.

Among all the above variables, Ndi , N

ai and ndi are known.

A maintenance algorithm needs to find nai , so that eachbucket knows its reservoir size. This is an integer program-ming problem: the algorithm aims to find na1 , . . . , n

ak, such

that the size of the updated sample sketch is minimized:

min∑ki=1(ni − ndi + nai )

subject to

∑ki=1N

′i ·G(N ′i , n

′i, ai, bi, δ)∑k

i=1N′i

≤ ε0

0 ≤ nai ≤ Nai

nai ∈ Z

where G(·) is a shorthand for Equation 12. Note that nai has afinite space, from 0 to Na

i . It is possible that any value in thespace cannot satisfy the global error bound requirement. Insuch a case, the sketch cannot be incrementally maintained,and must be re-built from scratch.

Integer programming is known to be NP-hard. We proposea heuristic algorithm (see Appendix) to find a maintenancescheme for the sketch. The idea is to consider buckets withhigh populations first. A record added to these buckets

Page 8: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

is likely to have more impacts in reducing the global errorbound than the one added to low-population buckets, becausethe global error is a weighted sum. The algorithm iteratesthrough all buckets, from the highest population to the lowestone, and uses a local optimal heuristic in deciding how muchdata to be added to each bucket.

5. EXPERIMENTAL STUDYIn this section, we report an experimental study of error-

bounded sampling. We implemented the proposed techniquesin SCOPE and evaluated their performance. All experimentswere run in a cluster of 200 nodes. To our best knowledge, noexisting techniques target big sparse data with error boundguarantees. Hence, we compare our approaches against error-bounded uniform sampling that randomly samples data ineach group. The sample size is given by Equation 2 for eachgroup. This is also the approach used in a recent systemthat promises error bounds [2]. In the following figures, weuse Stratified to denote our approach and Uniform to denotethe uniform sampling.

We drive the experiments with both synthetic and pro-duction workloads. For the latter, we use log streams thatconsist of various search logs whose sizes are 20 TB on aver-age. We select 50 representative queries out of hundreds ofdaily jobs. These queries are in the following template:

SELECT C1, C2 ... Cn,

AggFunc(A1), AggFunc(A2)... AggFunc(Am)

FROM LogStream PARAMS(StartDate, EndDate)

WHERE Predicates

GROUP BY C1, C2 ...Cn;

The number of GROUP-BY attributes n varies from 2 to 5,and the number of aggregations m ranges from 1 to 3. Theaggregation operators are SUM, COUNT and AVG. Thesequeries probe different base logs, ranging from one day toone month. The selectivities of the queries’ predicates varyfrom 0.01% to 80%.

All errors in the experiments are reported as relative errors,to help readers understand approximation introduced bysampling without revealing application contexts.

5.1 Effects of Data SkewnessSparse data is the major factor that invalidates most prior

efforts. In this subsection, we evaluate the effectivenessof our sampling technique over sparse data. We first usesynthetic data sets whose sparseness is quantifiable. Wethen randomly select 10 production queries to show the realsparseness of production data and examine the effectivenessof our approach.

To quantify data sparseness, we use the Zipf distributionwhose skewness can be controlled by the z-parameter. Whenz is 0, the data is uniformly distributed; When z is 1.5, datais highly skewed (and sparse by our definition). We generatea number of single-group data sets whose values range from1K to 100K. Each data set has the same size: 10K. For afair comparison, we fix the mean to 500. The aggregationoperator is AVG.

First, we examine the effects of data skewness under dif-ferent error bound constraints. We fix the value range to1K, confidence to 95% and vary the error bound and dataskewness. As shown in Figure 3(a), our approach generatesmuch smaller samples than the uniform sampling under dif-ferent error bounds. For the uniform sampling, when the

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1% 5% 10% 20% 50%

Sam

plin

g R

ate

(%)

Relative Error Bound

Stratified Z=0Stratified Z=0.8Stratified Z=1.5Uniform Z=0Uniform Z=0.8Uniform Z=1.5

(a) Sampling rate with data skewness

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1% 5% 10% 20% 50%

Sam

plin

g R

ate

(%)

Relative Error Bound

Stratified 1KStratified 10KStratified 100KUniform 1KUniform 10KUniform 100K

(b) Sampling rate with value range

Figure 3: Effects of data sparseness

error bound is small, even if data is uniformly distributed(z = 0), the sampling rate is still close to 100%. When datais more skewed (z = 0.8 and 1.5), uniform sampling showssimilar performance. The reason is that uniform samplingis distribution-oblivious; the Hoeffding equation is based onthe value range and the error bound, which in this case yielda very high sampling rate. By comparison, our approach isdistribution-aware. It creates buckets by data distribution.When data becomes more skewed, some buckets have higherpopulations and smaller sampling rates. Though we mayhave to use a high sampling rate for the remaining buckets,their populations are not significant. Therefore, the totalsample size is small.

Second, we evaluate sampling performance with differentvalue ranges. We change the value range from 1K to 100Kand fix z to be 1.5. The sampling rates of both approaches arecompared in Figure 3(b). When the value range increases,by Equation 2, the sample size increases. Therefore, theuniform sampling generates larger samples. When the valuerange increases to 10K, the uniform sampling samples almost100% of data even when the error bound reaches 20%. Whenthe value range becomes 100K, it almost needs to samplethe whole data set to satisfy the 50% error bound. Bycomparison, our stratified approach always generate smallsamples with different value ranges.

Next, we inspect the impact of confidence by varying itfrom 50% to 95%, with a fixed value range 1K and an er-ror bound 5%. As illustrated in Figure 4, as confidenceincreases, the sample size of the uniform sampling increasesdramatically However, our approach shows stable perfor-mance. When the confidence becomes 95%, the uniformsampling needs to sample 7 times more data than the strati-fied approach does in order to satisfy a 5% error bound.

Page 9: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

From the above experiments, we conclude that our strat-ified sampling performs much better under different datasparseness, error bound and confidence settings than thebaseline of the uniform sampling.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

50% 55% 60% 65% 70% 75% 80% 85% 90% 95%

Sam

plin

g R

ate

(%)

Confidence

Stratified Z=0Stratified Z=0.8Stratified Z=1.5Uniform Z=0Uniform Z=0.8Uniform Z=1.5

Figure 4: Effects of confidence

Finally, we evaluate both approaches for real-life produc-tion workloads. We randomly select 10 production queriesand generate the offline samples using the uniform samplingand the stratified sampling. The error bound is 5% andconfidence is 95%. The input data volume of those queriesis 8T on average. Since some of the queries’ predicates areselective, we cannot calculate the sampling rate by using theoriginal data size as the base. Instead, we use the size ofthe data after the predicates’ evaluation as the base. Theresults are illustrated in Table 2. We can see that whenusing the uniform sampling, the sampling rates range from57% to 97%. The stratified sampling, on the other hand,is able to reduce the sampling rate to less than 5%. ForQuery 4, the stratified sampling rate is only 0.06%, saving99% more compared with the uniform sampling. This queryhas 5 groups and most of them are big sparse data, and thusstratified sampling is particularly effective.

5.2 Effects of Data VolumeIn this section, we evaluate the performance of both ap-

proaches over the inputs with different data volumes. Theexperiments are conducted using production workloads. Wefix the error bound to 5%, confidence to 95% and build of-fline sample sketches over log streams whose sizes range from2T to 20T. The performance results we show here are theaverage of a number of selected queries.

First, we examine the effects of data volume on the samplesize. As shown in Figure 5(a), as the input volume increases,both approaches generate larger samples. For the uniformsampling, the sample size grows linearly with respect todata volume. This is because when data volume increases,the value range increases too. By Equation 2, the samplesize increases as the value range increases. For the stratifiedapproach, the increase of sample size is not dramatic, becauseit has resilient performance against different value ranges asshown in the experiments in Section 5.1. When the inputlog grows to be 20 TB, the uniform sampling produces 1340times more data than the stratified approach.

In addition to sample sizes, machine time is a measurethat sums up the processing time of all participating nodesin SCOPE. It reflects how many computation resources aquery consume. In Figure 5(b), we use the samples builtfrom two approaches to answer queries and report machinetime of query processing. As we can see, the machine time

1

10

100

1000

10000

2T 4T 8T 12T 16T 20T

Sam

ple

Sket

ch S

ize

(M)

Data Volume

StratifiedUniform

(a) Sample size over different data volumes

100

1000

2T 4T 8T 12T 16T 20T

Mac

hine

Tim

e (s

)

Data Volume

StratifiedUniform

(b) Machine time over different data volumes

0%

5%

10%

15%

20%

25%

30%

2T 4T 8T 12T 16T 20TRes

pons

e T

ime

Nor

mal

ized

by

Uni

form

Data Volume

Stratified / Uniform

(c) Response time over different data volumes

Figure 5: Effects of data volume

increases linearly with the increase of data size for the uni-form sampling. This is due to the increase of sample size.Our stratified approach consumes much less machine time,because it generates smaller samples for queries to use. Acareful reader may notice that while the samples built fromthe uniform sampling are a few hundred times larger thanthose built from the stratified sampling, the machine timeof using the former samples is only dozens of times bigger.The reason is that there is a relatively bigger initializationcost in the early stages of the query execution pipeline, acost that is independent of data volume.

We also measure query execution time in this experiment.In a parallel environment, query execution time is dependenton how many machines or nodes are allocated for a query.In SCOPE scheduling, the number of nodes allocated to aquery is usually determined by the size of the input. Suchscheduling means that it is not completely fair to comparequery execution time when we use samples generated by dif-ferent approaches for query answering, because these samplesmay have very different sizes. At any rate, we normalize

Page 10: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

Table 2: Sampling rate of production workloadSampling Approach Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Uniform 89.85% 75.37% 94.38% 57.11% 81.81% 85.89% 92.64% 96.96% 95.07% 86.71%Stratified 0.17% 0.14% 0.11% 0.06% 0.1% 4.86% 3.04% 1.5% 2.01% 0.11%

query execution time of the stratified sampling by that of theuniform sampling and show the results in Figure 5(c). Wecan see that the stratified sampling takes only 7% to 23% ofthe execution time of the uniform sampling. Savings in queryexecution time is not as significant as savings in machinetime, since these queries are scheduled with fewer nodesbecause their inputs (i.e. the samples) are smaller. Thismeans that the our sampling approach saves both executiontime and resources.

5.3 Effects of Error Bounds

1

10

100

1000

10000

1% 5% 10% 20% 50%

Sam

ple

Sket

ch S

ize

(M)

Relative Error Bound

StratifiedUniform

(a) Sample size with different error bounds

0

200

400

600

800

1000

1% 5% 10% 20% 50%

Mac

hine

Tim

e (s

)

Relative Error Bound

StratifiedUniform

(b) Machine time with different error bounds

Figure 6: Effects of error bound

We compare both approaches with different error boundsin this group of experiments, using production workloads.Results here are averaged among all selected queries. Theconfidence is set to 95%. The input stream is 8 TB. Thesizes of the samples produced by the two approaches areshown in Figure 6(a). As we can see, when the error boundincreases, the sample size of the uniform sampling decreasesdramatically. Stratified approach is able to produce a smallnumber of samples with different error bound constraints.

The machine time of using samples to answer queries overdifferent data volumes is shown in Figure 6(b). For bothapproaches, machine time drops with respect to the increasesof error bounds. Comparing with the uniform sampling, thestratified sampling is much more robust against the changeof error bounds. The machine time saved by the stratifiedsampling compared with the uniform sampling is not as

significant as sample sizes because of the query initializationcost in SCOPE.

5.4 Incremental Maintenance of SamplesIn this group of experiments, we evaluate the performance

of the algorithms that incrementally maintain samples forsliding-windowed queries, by varying window sizes, slidingsteps and error bounds over one-week production log streamwith production workloads. Results here are averaged amongall selected queries. First, we set the sliding step to 3 hours,window size to 2 days and change the error bound. Thesample sizes through either incremental maintenance or re-building are shown in Figure 7. The figure shows that inmost cases incrementally updated samples are not signifi-cantly larger than the rebuilt ones, if maintenance does notviolate the error bound guarantee. In the worse-case scenario,the updated sample is 50% larger than the one rebuilt fromscratch. This increase is significant. But given that thesample size is itself very small, such an increase is acceptablein practice.

An interesting observation of Figure 7 is that sometimesthe sample size after incremental maintenance can even besmaller than rebuilt samples, e.g., as shown in Figure 7(d).The explanation is that the heuristic stratified algorithmassumes that the error bound of each bucket is the same, inorder to lower the algorithm’s complexity. During incremen-tal maintenance when old data is deleted and new data isinserted, the maintenance algorithm may adjust the buckets’error bounds such that the global error bound requirementis satisfied. It means that some buckets have error boundshigher than the default one, and therefore produce smallersamples than the samples before the maintenance. This isalso the reason why we maintain error bound instead ofsample size in the sampling scheme.

During incremental maintenance, it is possible that a sam-ple after maintenance cannot satisfy the error bound con-straint and therefore it must be rebuilt. These cases arehighlighted in Figure 7 as circled points. Across the fourfigures in Figure 7, the number of rebuilding timestampsreduces, as the error bound increases. This is understandable:the larger the error bound, the more space the algorithm hasto adjust each bucket’s error bound and therefore make theglobal error bound smaller. Overall, the number of rebuildingtimestamps is not significant for real-life workloads and data;samples can be incrementally updated in many cases, savinga fair amount of computation resources.

In the second experiment in this group, we set the errorbound to 15%, sliding step to 12 hours and change the windowsize from 2 days to 4 days. We report the results in Figure 8.Similar to last experiment, the sample size after incrementalmaintenance is close to that of the rebuilt sample, except ahandful of cases when the sample must be rebuilt.

6. RELATED WORKData sampling for aggregation queries has been studied

in the database literature as approximate query processing

Page 11: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

4 4.5

5 5.5

6 6.5

7 7.5

8 8.5

9 9.5 10

10.5 11

11.5 12

Mon Tue Wed Thu Fri Sat Sun

Sam

ple

Size

(M

)

Update Timestamp (Step = 3 Hour)

MaintenanceAways RebuildRebuild TimeStamp

(a) Error bound 5%

6

6.5

7

7.5

8

8.5

9

9.5

10

Mon Tue Wed Thu Fri Sat Sun

Sam

ple

Size

(M

)

Update Timestamp (Step = 3 Hour)

MaintenanceAways RebuildRebuild TimeStamp

(b) Error bound 10%

4.5 5

5.5 6

6.5 7

7.5 8

8.5 9

9.5 10

10.5

Mon Tue Wed Thu Fri Sat Sun

Sam

ple

Size

(M

)

Update Timestamp (Step = 3 Hour)

MaintenanceAways RebuildRebuild TimeStamp

(c) Error bound 15%

6

6.5

7

7.5

8

8.5

9

Mon Tue Wed Thu Fri Sat Sun

Sam

ple

Size

(M

)

Update Timestamp (Step = 3 Hour)

MaintenanceAways RebuildRebuild TimeStamp

(d) Error bound 20%

Figure 7: Incremental maintenance with different error bounds

7.5

8

8.5

9

9.5

10

10.5

11

Mon Tue Wed Thu Fri Sat Sun

Sam

ple

Size

(M

)

Update Timestamp (step = 12h)

W=4D,MaintenanceW=4D,RebuildW=3D,MaintenanceW=3D,RebuildW=2D,MaintenanceW=2D,RebuildRebuild TimeStamp

Figure 8: Incremental maintenance with differentsliding window sizes

(AQP). Chaudhuri et al. gave a thorough survey in AQP [6].According to the optimization goals, sampling techniquescan be classified into two categories: (1) Space constraint—minimizing the error bound within limited sample space, and(2) error bound constraint—minimizing the sample size whilesatisfying a pre-defined error bound, which is the problemwe addressed in this paper.

For the first category, the maximum available samplingspace (or sample size) is considered as a budget constraint,a setting that is common in traditional database manage-ment systems. The optimization target is to minimize theerror bound. A few papers presented biased sampling ap-proaches for GROUP-BY queries when data distributionsacross groups have big variances [1, 16]. Chaudhuri et al.proposed to build the outliers of data into a separate in-dex [5]. This approach, however, is only effective when dataskewness is caused by outliers or deviants. Chaudhuri et al.also proposed a stratified sampling plan for lifted workloadto minimize the errors [6]. Ganti et al. proposed a weight-based approach that prefers to sample data records that areasked by most queries [8]. As a result, this approach willdeliver poor error estimations for less frequent queries, whichis unacceptable in many scenarios. Babcock et al. built afamily of precomputed samples to answer queries [3]. Foreach incoming query, a subset of samples is selected dynami-cally to improve the accuracy. Overall, the techniques in thefirst category can not be applied directly to solve the errorbounded problem that we are addressing.

Techniques in the second category can provide guaranteeson the error bound. Online aggregation [10] and its vari-ants[20, 21, 19, 9] process users’ aggregation queries in anonline fashion. It continuously produces aggregated resultstogether with an error bound and a confidence. Users canstop the execution whenever the error bound meets theirrequirement. Some efforts have been focused on implement-ing online aggregation in MapReduce environments [7, 14].

SciBORQ is a new system that restricts error bounds for an-alytical queries [17]. Recent attempts use the bootstrappingtechnique over the MapReduce framework [12]. The idea isto perform multiple sampling iterations until the computederrors are below the user’s requirements. Nevertheless, these“on-the-fly” approaches do not target sparse data intention-ally. As a result, their executions often cannot stop earlyand have to process almost all data before the error boundcan be satisfied. In the recent work BlinkDB [2], the authorsbuild multiple offline samples with different error bounds. Inthe online stage, they select the most appropriate sample toanswer a query. However, they apply stratified sampling onindividual groups, which can not be effective when data issparse inside each group. As such, these techniques may notbe effective.

The error-bounded stratified sampling technique proposedin this paper not only solves the error-bound-constrainedproblem, but also is effective when data is sparse, a commonsetting for in-production workloads.

7. CONCLUSIONSSampling as a means to reduce the cost of aggregation

queries is a well-received technique. It is believed that sam-pling would bring more benefits to approximate queries as thevolume of data increases. Our experience with productionjobs in SCOPE, however, proves that this notion is naive.As a matter of fact, what we have observed (and reportedin this paper) is that the simple-minded uniform samplingis ineffective, whose sample size keeps increasing with datavolume, and at the end delivers no performance gains. Theculprit is sparse data (with respective to data range) whoseskewness in distribution gets severe as more and more dataare included. An effective sampling scheme must understandthe nature of the data, divide the data into regions, andsample them appropriately. The increased cost of samplingcan be amortized over different queries that share them, andover time by using incremental updates. We have developedtheoretical optimal as well as practical heuristic variants,and verified our techniques against real-world productionjobs with a real implementation. The important lesson isthat system optimizations are becoming increasingly moredata-dependent.

8. ACKNOWLEDGEMENTSWe would like to thank Haixun Wang for his insightful

comments on this work. We thank An Yan, Redmond Duanand Xudong Zheng for providing endless support during theinvestigation. We would also like to give our acknowledge-ments to Tao Yang and Xuzhan Sun for their help on theevaluation.

Page 12: Error-bounded Sampling for Analytics on Big Sparse Data · Error-bounded Sampling for Analytics on Big Sparse Data Ying Yan Microsoft Research ying.yan@microsoft.com Liang Jeff Chen

9. REFERENCES[1] S. Acharya, P. B. Gibbons, and V. Poosala.

Congressional samples for approximate answering ofgroup-by queries. In SIGMOD, 2000.

[2] S. Agarwal, B. Mozafari, A. Panda, H. Milner,S. Madden, and I. Stoica. Blinkdb: Queries withbounded errors and bounded response times on verylarge data. In In Proc. of ACM EuroSys 2013, 2013.

[3] B. Babcock, S. Chaudhuri, and G. Das. Dynamicsample selection for approximate query processing. InSIGMOD, 2003.

[4] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,D. Shakib, S. Weaver, and J. Zhou. Scope: easy andefficient parallel processing of massive data sets.PVLDB, 1(2), 2008.

[5] S. Chaudhuri, G. Das, M. Datar, R. Motwani, andV. R. Narasayya. Overcoming limitations of samplingfor aggregation queries. In ICDE, 2001.

[6] S. Chaudhuri, G. Das, and V. R. Narasayya. Optimizedstratified sampling for approximate query processing.ACM Trans. Database Syst., 32(2), 2007.

[7] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein,J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Onlineaggregation and continuous query support inmapreduce. In SIGMOD, 2010.

[8] V. Ganti, M. L. Lee, and R. Ramakrishnan. Icicles:Self-tuning samples for approximate query answering.In VLDB, pages 176–187, 2000.

[9] P. J. Haas. Large-sample and deterministic confidenceintervals for online aggregation. In SSDBM, pages51–63. IEEE Computer Society Press, 1996.

[10] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Onlineaggregation. In SIGMOD, 1997.

[11] W. Hoeffding. Probability inequalities for sums ofbounded random variables. Journal of the AmericanStatistical Association, 58, 1963.

[12] N. Laptev, K. Zeng, and C. Zaniolo. Early accurateresults for advanced analytics on mapreduce. PVLDB,5(10), 2012.

[13] S. L. Lohr. Sampling: design and analysis. ThomsonBrooks/Cole, 2010.

[14] N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie.Online aggregation for large mapreduce jobs. PVLDB,4(11), 2011.

[15] R.J.Serfling. Probability inequalities for the sum insampling without replacement. Institute ofMathematical Statistics, 38, 1973.

[16] P. Rosch and W. Lehner. Sample synopses forapproximate answering of group-by queries. In EDBT,2009.

[17] L. Sidirourgos, M. Kersten, and P. Boncz. Sciborq:Scientific data management with bounds on runtimeand quality. In In Proc. of the Intl Conf. on InnovativeData Systems Research (CIDR, pages 296–301, 2011.

[18] J. S. Vitter. Random sampling with a reservoir. ACMTrans. Math. Softw., 11(1):37–57, Mar. 1985.

[19] Y. Wang, J. Luo, A. Song, J. Jin, and F. Dong.Improving online aggregation performance for skeweddata distribution. In DASFAA (1), volume 7238 ofLecture Notes in Computer Science, pages 18–32.Springer, 2012.

[20] S. Wu, S. Jiang, B. C. Ooi, and K.-L. Tan. Distributedonline aggregation. PVLDB, 2(1):443–454, 2009.

[21] S. Wu, B. C. Ooi, and K.-L. Tan. Continuous samplingfor online aggregation over multiple queries. InSIGMOD, 2010.

[22] J. Zhou, N. Bruno, M.-C. Wu, P.-A. Larson,R. Chaiken, and D. Shakib. Scope: parallel databasesmeet mapreduce. VLDB J., 21(5), 2012.

APPENDIXA. SAMPLE MAINTENANCE ALGORITHM

The sliding window sample maintenance algorithm’s pseudocode is shown in Algorithm 2. First, it computes a set ofbuckets {B1, . . . , Bq}, q ≤ k that need to be updated, to-gether with their updated error bounds {ε′1, . . . , ε′q} (Line 2).

It also computes the difference Mi= Nai − ndi of the updated

buckets (Line 3). This information will be used for adjustinginsertions in the next stage. The algorithm then checks Equa-tion 13. If it is not satisfied, the algorithm moves to the nextstage to compute nai . In the second stage, the algorithm aimsto find one or more buckets in B′ with Mi≥ 0 to help fillingthe gap of the error bound difference ε0 − ε′ (Line 10 to 20).From Equation 13, the buckets with larger populations cancontribute more to the global error bound. It means giventhe same number of insertions, inserting them to bucketswith high populations is more effective in reducing the globalerror bound. To implement this heuristic, the algorithmreorders the buckets with Mi≥ 0 by their populations, andtries to insert all incoming data into top-ranked buckets.If the gap ε0 − ε′ can not be filled after all incoming datahas been exhausted, the sketch is under sampling due todeletions, and must be reconstructed from scratch.

Algorithm 2: Maintenance algorithm for sliding win-dowed queries

Input: A list of bucket B:{B1, . . . , Bk}Output: Buckets B with updated error bounds.1: for i = 1 to k do2: εi = ε′i ← Update the new error bound3: Mi= Na

i − ndi4: end for5: ε′ ← the global error bound with Equation 136: if ε′ ≤ ε0 then7: return B8: end if9: Reorder the buckets whose Mi≥ 0 according to its

population {q buckets}10: for i = 1 to q do11: ε′i ← Nε0 −

∑kj=1,j 6=iNjεj {Compute ε′i by

Equation 6}12: n′i ← sample size with ε′i by Hoeffding Equation 213: if n′i <= ni +Na

i then14: Update the error bound in Bi to ε′i15: return B16: else17: n′i = ni +Na

i

18: εi = ε′i ← update the error bound19: end if20: end for21: return NULL {Need to be reconstructed}