Uncertainty in Aggregate Estimates from Sampled Distributed Traces

7/28/2019 Uncertainty in Aggregate Estimates from Sampled Distributed Traces

1/7

Uncertainty in Aggregate Estimates from Sampled Distributed Traces

Nate Coehlo, Arif Merchant, Murray Stokely{natec,aamerchant,mstokely }@google.com

Google, Inc.

Abstract

Tracing mechanisms in distributed systems give impor-tant insight into system properties and are usually sam-

pled to control overhead. At Google, Dapper [ 8] isthe always-on system for distributed tracing and perfor-mance analysis, and it samples fractions of all RPC traf-c. Due to difcult implementation, excessive data vol-ume, or a lack of perfect foresight, there are times whensystem quantities of interest have not been measured di-rectly, and Dapper samples can be aggregated to esti-mate those quantities in the short or long term. Here wend unbiased variance estimates of linear statistics overRPCs, taking into account all layers of sampling that oc-cur in Dapper, and allowing us to quantify the samplinguncertainty in the aggregate estimates. We apply thismethodology to the problem of assigning jobs and data

to Google datacenters, using estimates of the resultingcross-datacenter trafc as an optimization criterion, andalso to the detection of change points in access patternsto certain data partitions.

1 Introduction

Estimates of aggregated system metrics are needed inlarge distributed computing environments for many uses:predicting the effects of conguration changes, capacityplanning, performance debugging, change point detec-tion, and simulations of proposed policy changes. For

example, in a multi-datacenter environment, it may bedesirable to periodically repack users and applicationdata into different datacenters. To evaluate the effectsof different rearrangements of data, it is necessary to es-timate several aggregates, such as the cross-datacentertrafc created and the aggregate CPU and networkingdemands of the applications placed in each datacenter.Most organizations deploy telemetry systems [ 4, 5] torecord system metrics of interest but, quite often, we ndthat some of the metrics required for a particular evalu-

ation were not recorded. Sometimes this occurs becausethe number of possible metrics is too large for the prob-lem of interest, other times it can be due to difcult im-plementation and a lack of perfect foresight.

At Google, we deploy additionally a ubiquitous trac-ing infrastructure called Dapper [8] that is capable of fol-lowing distributed control paths. Most of Googles inter-process communication is based on RPCs and Dappersamples a small fraction of those RPCs to limit overhead.These traces often include detailed annotations createdby the application developers. While the primary pur-pose of this infrastructure is to debug problems in thedistributed system, it can also be used for other purposeslike monitoring the network usage of services and theresource consumption of users in our shared storage sys-tems.

From any system with sampling, such as Dapper,Fay [ 7], or the Dapper-inspired Zipkin [ 1], it is straight-forward to get an estimate for aggregate quantities. How-ever, assessing the uncertainty of an estimate is more in-volved, and the contribution of this work is nding co-variance estimates for aggregate metrics from Dapper,based on the properties of Dapper sampling mechanisms.Given these covariance estimates, one can attach con-dence intervals to the aggregate metrics and perform theassociated hypothesis tests.

In Section 2 we summarize the relevant statisticalproperties of sampling in Dapper. Section 3 gives theestimation framework and algorithm for covariance es-timation, while the proof appears in Appendix A. Sec-tion 4 gives two case studies of statistical analysis aboutdistributed systems using this framework, and Section 5has concluding remarks.

2 Background on RPC Sampling

Estimating an aggregate quantity from a sampled systemis simple; for each measured RPC you have a result and asampling probability, so summing those results weighted


2/7

by the inverse of their sampling probability will give anunbiased estimate of the quantity of interest. To calculatethe uncertainty (variance) of such an estimate, however,it requires knowledge of the joint sampling probability of any two RPCs, so a more detailed understanding of thesampling mechanism is necessary.

RPCs in distributed systems can be grouped in termsof an initiator, and we refer to this grouping as a trace .Each trace at Google is given an identier ( ID ), whichis an unsigned 64-bit integer, and all RPCs within atrace share that one ID. The ID is selected randomlyover the possible values so collecting RPCs when ID 1/ 3 then none are returned,and if 1 / 6 < s 1/ 3 then only D is returned. If s 1/ 6and d 1/ 2 then all RPCs are returned. If s 1/ 6 andd > 1/ 2 then all except F are returned.

whether RPC i was included in the sample S , we getan unbiased estimate of from

= iS

xisid i

= i

xi1isid i

The algorithm GetSigmaHat below produces , anunbiased estimate of = Cov .

In this paper we will use the normal approximationfor inference, N ( , ), which is a generalization of normal approximation to the binomial distribution. How-ever, we believe that the considering the variance addssubstantial value and protection against false positives toany analysis involving these estimates, even in the caseof small samples sizes with highly variable xi, si, d i wherethe normal approximation is not ideal.

The remainder of this section provides an outline forproving is unbiased, gives a simple algorithm, and

nds its complexity. The next section applies these re-sults to the statistical analysis of real distributed systems.

3.1 Calculation Overview

To nd the an unbiased estimate of the population covari-ance matrix , we rst nd , then appropriately weightsample quantities and show the result is unbiased. A de-tailed calculation is in the Appendix, and there it is di-vided by these three ideas:

2


3/7

1. Within a trace, the boolean random variables 1i and1 j must be the same if si = s j and d i = d j, so wecan aggregate the values of x corresponding to thesame (s, d ) tuple to y in our representation of andre-parameterize the problem in terms of the distinctvalues of ( ID , s, d ) and the sums y. This simplies

proof notation and improves the algorithm perfor-mance, as discussed in section 3.2.

2. Letting y[ j]i denote component j of y for the (ID, s,d) tuple i, we have

( j,k ) = Cov ( j, k ) = i

i

x[ j]i x[k ]isisi d id i

Cov (1i, 1i )

so the problem reduces to nding the covariancebetween sampling any two tuples ( ID i, si, d i) and( ID i , si , d i ).

As described in Section 2, the trace ID is mappedto two independent uniform random variables on(0, 1), which we denote by (U i,V i) and assume theyare independent across i. Therefore, if ID i = ID ithen they are independent and the covariance iszero. If ID i = ID i then we must have 1

Cov (1i, 1i ) = E (1i1i ) E (1i)E (1i )= ( sisi )(d id i ) sisi d id i

3. Finally, since the resulting population covariance

matrix depends on cross terms within the sametrace, weighting sampled cross-terms by their prob-ability of inclusion, (sisi )(d id i ), will give anunbiased estimate.

3.2 Covariance Estimation Algorithm andComplexity

Algorithm GetSigmaHat returns , which is the sum of the contributions over each trace ID 2 :

Algorithm 1 GetSigmaHat M a J J matrix of zeros.for all ID S do

M + = ProcessSingleTrace ( ID )end forreturn M

While there may be a large number of RPCs within atrace, the number of distinct (s, d ) tuples within a trace

1 We use the notation min(a , b) = a b and max (a , b) = a b.2 We denote the outer product between two vectors as x y.

Algorithm 2 ProcessSingleTraceGiven a collection of (si, d i, xi) corresponding to agiven ID, aggregate data over the unique tuples of (s, d ) to get (sk , d k , yk ) where yk = { j| (s j,d j)=( sk ,d k )} x jand we let K t be the number of distinct tuples resultingform this aggregation.

M a J J matrix of zeros.

for all k 1 : K t dofor all k 1 : K t do

w = 1 max (sk ,sk )max (d k ,d k )sk sk d k d k M + = w( yk yk )

end forend forreturn M

is small; across all traces we collected there are lessthan 20 distinct combinations. Given N t RPCs withina trace and M t distinct combinations, aggregating in therst step of ProcessSingleTrace before running the loopscales as N t log ( M t )+ M 2t J 2 rather than N 2t J 2 . Letting M = max M t and T be the number of traces, we sum overtraces for the bound

t

N t log ( M t ) + M 2t J 2 (

t N t )log ( M ) + T M 2 J 2

= Nlog ( M )+ T M 2 J 2

Since M is bounded by a small number in practice, wehave linear scaling in the number of RPCs and Traces. Inaddition, one could split GetSigmaHat over several ma-chines, sharding by trace ID, with each returning theircomponent of the J J covariance estimate.

4 Case Studies

4.1 Bin Packing and Cross-DatacenterReads

Large scale, distributed computing environments may

comprise tens of data centers, tens of thousands of users,and thousands of applications. The conguration of stor-age in such environments changes rapidly, as hardwarebecomes obsolete, user requirements change, and ap-plications grow, placing new demands on the hardware.When new storage capacity is added for example, byadding or replacing disks in existing data centers, or byadding new data centers we must decide how to re-arrange the application services, data, and users to takebest advantage of the new hardware.

3


4/7

An optimizer who bin-packs the application data andthe users into the data centers will use data from manysources, may forecast growth rates of some parameters,and will try to satisfy various constraints. One com-ponent of such an optimization is to control the num-ber of cross-datacenter reads that result from the pack-

ing, and simulation of this would require a full record of all user/application pairs. However, maintaining a com-plete record of the trafc for each user/application pair isprohibitively expensive, since there are millions of suchpairs, so we can instead use the Dapper sampled tracesof RPCs to estimate the cross-datacenter trafc for eachscenario.

To illustrate the usefulness of our procedure for com-paring policies over historical samples, we consider bin-packing user data in three nearby data centers. There isconsiderable work on the subject of le and storage al-location in the literature [6, 2, 3 ]. We do not claim topresent optimal or useful bin-packing strategies here, but

we do claim to be able to evaluate the comparative ad-vantage in terms of cross-datacenter reads.

Data in a storage system is written by some user,which we call the owner , and is later read by that user,or potentially by many other users. We decide to pack data so each owner is only in one data center accordingto two strategies.

Strategy 1: basicFrom a snapshot of the three cells, we gure out the totalstorage, and the percentage that goes to each user byadding up their contributions over the three cells. Thenwe partition the owners by alphabetical order so eachdatacenter gets 13 of the total data.

Strategy 2: crosstermThis simple policy tries to put most cross user trafc inthe rst cell. We dene the adjacency between two usersas the estimated number of cross user reads divided bytheir combined storage capacity. We then allocate pairsof owners with the highest adjacency to the rst cell untilit reaches 13 of the total data, then move on to the nextcell.

4.1.1 Results

We compare the cross datacenter trafc for the twostrategies above applied in simulation to three datacen-ters that each store several tens of petabytes of databelonging to over 1000 users. We use data collectedfrom May 6, 2012 through May 12 to train the policycrossterm , then we evaluate the performance from pe-riod from March 25th through June 5th by assuming thata read by user A is initiated in the datacenter that storesdata for user A. Our collection period from Dapper has

millions of RPCs with a range of sampling probabilitiesextending down to 5 e 7 .

To test whether there is a signicant difference be-tween the resulting cross-datacenter reads, we look at thedifference between the two estimates and form 95% con-dence intervals around that difference, noting that when

the interval does not cross zero it corresponds to rejectingthe Null hypothesis that there is no difference betweenthe strategies 3 .

For each day, and for each policy, we get an estimateof the resulting cross datacenter trafc. To decrease ourvulnerability to setting policy based on sampling aberra-tions, we test against the Null hypothesis that they bothproduce the same number of cross datacenter reads. InFigure 2 we show these intervals, where the y-axis hasbeen scaled by the average for basic over the entire test-ing period; crossterm is signicantly better than basic onevery day, and we estimate it does over 20% better.

0%

20%

40%

60%G

G

G

G

GG G

G

G

G

G

G

May 25 May 30 Jun 04

Advantage for Crossterm

Figure 2: Estimate difference in number of daily crossdatacenter reads, plus or minus two Standard Errors. TheY-Axis is normalized by the average number of cross dat-acenter reads for strategy basic .

4.2 Change Point Detection

It is often useful for a monitoring tool to detect suddenchanges in system behavior, or a spike in resource us-age, so that corrective action can be taken whetherby adding resources or by tracking down what causedthe sudden change. In this case, we wanted to monitorthe number of disk seeks to data belonging to a certainservice, and to detect if the number of cache misses in-creased suddenly due to a workload change. The systemlogging available did not break out miss rates per serviceat the granularity we needed, but we could estimate the

3 Here, the function f : x a 2-vector of booleans indicat-ing whether that RPC would have caused each strategy to result ina cross-datacenter read. The advantage for the crossterm strategyis estimated as 1 2 , and the corresponding variance estimate is1,1 + 2,2 21,2 .

4


5/7

miss rates based on Dapper traces. However, we onlywant to detect real changes, and hoped to have few falsepositives induced by sampling uncertainty.

One alternative to alerting based on relative differ-ences it to alert only if the difference is signicantly dif-ferent from zero. The problem with this approach is that

some services have higher sampling rates, and given ahigh sampling rate, small true differences will be aggedas signicant. Since we expect that there to be some truevariation from day to day, we instead ag if we reject thenull that the number of seeks increased by less than 10%.

In particular, let t be the number of seeks on day t,and t be our estimated number 4 of seeks for day t, and

H 0 : t

t 1 1.1

z = t 1.1 t 1 2t + 1.12 2t 1 12

We test against the one sided null H 0 by rejecting when z > 1.64, and since this test has level 0 .05 when t t 1 =1.1 5 , it is even more conservative when t t 1 < 1.1.

4.2.1 Results

G G

G

G

GG G G

G

G

GG

G

Jun 21 Jun 25 Jun 29 Jul 03

0.2

0.51.02.05.0

10.020.0

50.0100.0200.0

a c c

e s s e s

5

0

5

10

z

s c o r e

G accesseszscore

Figure 3: Cache Misses to a particular partition of datain blue, where the value on the rst day is normalized to1. Red shows the z-scores corresponding to the test thatthe seeks increased less than 10%, and extending abovethe red line corresponds to rejecting that hypothesis.

In Figures 3 and 4, we display normalized accesses inblue, and the corresponding z-score fora change from theprevious day in red. The horizontal red line corresponds

4 For a given data partition, D , and day t , f : x would return abinary indicating whether that RPC was a disk seek.

5 Here we get independence by ignoring the rare traces that spanacross midnight

G

G GG G

GG

GG

G

G GG

G

G

G

G

G

G

G G

G

G

GG

G

May 21 Jun 04

5e011e+00

5e+001e+01

5e+011e+02

5e+021e+03

a c c e s s e s

4

2

0

2

4

z

s c o r e

G accesseszscore

Figure 4: Same as Figure 3, but for a different data parti-tion.

to a level 0 .05 normal test, and we dene our detectionalgorithm as extending above that line.

In Figure 3, we see that the cache misses started in-creasing on June 30th, then ended up 100 times higheron July 1st - 3rd than is was in late June. Our z-scoredetection algorithm ags this change in behavior on June30th and July 1st, and this change ended up being a per-sistent change in behavior of order 100 .

A spike much higher than 100 occurred on June 3rdfor the data partition in Figure 4, but this estimate hadsuch high uncertainty that the z-score was moderate, andthe change was not agged. Unlike the case in Figure 3,the higher level did not persist. It is possible that data

partitions could see real usage spikes on one day thatlater disappear and it may be useful to know about those,but after studying the variance, we nd that this spikedoes not present strong evidence of being more than asampling variation.

5 Conclusion

Many interesting system quantities can be represented asa sum of a vector-valued function over RPCs, and wepresent a method to obtain estimates of these quantitiesand their uncertainty from Dapper. At Google, these

sampled distributed traces are ubiquitous, and are oftenthe only data source available for some system questionsthat arise. Although arbitrary traces may have complexsampling structures, we nd an unbiased covariance esti-mate that works in all cases and can be easily computed.We demonstrate how this methodology can be used eval-uate to the effectiveness of different bin-packing strate-gies on Google data centers when evaluating over an ex-tended period, and also for detecting change points inquantities that are not directly logged.

5


6/7

A Appendix

Part 1: Notation and Aggregate RepresentationWe represent all RPCs in our time window of interest,

, as a double subscript (i, j), which represents the jthRPC corresponding to trace ID i. This allows us to write

= N

i= 1

J i

j= 1

x(i, j)

Letting S be the sample returned by Dapper, s(i, j) andd (i, j) the server sampling and downsampling probabili-ties for x(i, j) , then our estimate can be re-written as

= (i, j)S

x(i, j)s(i, j)d (i, j)

It is useful to re-parametrize the indices (i, j) in termsof the distinct server sampling and downsampling fac-tors:

0 < s1 < s2 < ... < s M 1

0 < d 1 < d 2 < ... < d L 1

We then dene the (possible empty) index sets as

(i,m,l) = {(i , j ) | i = i, s(i , j ) = sm d (i , j ) = d l }

the (possibly zero) trace-level sums by

y(i,m,l) = (i, j) (i,m,l)

x(i, j)

We dene weighted boolean variables

W (i,m,l) =1(i,m,l)S

smd l

So that

T = N

i= 1

M

m= 1

L

l= 1

y(i,m,l)W (i,m,l)

Part 2: The Population Covariance MatrixBefore expanding Cov (T ), we note that:

E (W (i,m,l)) = 1

Cov (W (i,m,l) ,W (i ,m ,l )) = 0 If i = i

And for i = i , we dene (m,l ,m ,l ) by

Cov (W (i,m,l) ,W (i,m ,l )) =P (i, m, l), (i, m , l ) S

smsm d l d l 1

=(smsm )(d l d l )

sm

sm

d ld

l

1

=1 (smsm )(d l d l )

(smsm )(d l d l ) (m,l,m ,l )

so

Cov ( y(i,m,l)W (i,m,l) , y(i,m ,l )W (i,m ,l )) = y(i,m,l) y(i,m ,l ) (m,l ,m ,l )

Where is the outer product resulting in a P Pmatrix.

Putting it together, we have

= Cov (T )=

i

(m,m ,l ,l ) y(i,m,l) y(i,m ,l ) (m,l ,m ,l )

Part 3: Unbiased estimate of Since

E1(i,m,l)S 1(i ,m ,l )S

(smsm )(d l d l )= 1

If follow that an unbiased estimate for y(i,m,l) y(i,m ,l )is given by

E y(i,m,l) y(i,m ,l )1(i,m,l)S 1(i ,m ,l )S

(smsm )(d l d l )So an unbiased estimate of is given by

= iS

(m,m ,l,l )S

(m,l ,m ,l ) y(i,m,l) y(i,m ,l )(smsm )(d l d l )

= (m,m ,l,l )

1 (smsm )(d l d l )smsm d l d l

iS

y(i,m,l) y(i,m ,l )

Equivalence to GetSigmaHat follows since ProcessS-ingleTrace produces the above result for a single trace.A simple case occurs if you are interested in a scalar andeach trace shares the same server sampling and down-sampling probability. Letting pi = sid i , the result sim-plies to

= 2 = i

1 pi pi

y2i

= 2 = iS

1 pi p2i

y2i

6


7/7

References

[1] Available 20120720: http://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.html .

[2] A LVAREZ , G . A . , B OROWSKY , E . , G O , S . , R OMER , T. H.,BECKER -S ZENDY , R., G OLDING , R., M ERCHANT , A., S PASO -JEVIC , M., V EITCH , A., AN D W ILKES , J. Minerva: An auto-mated resource provisioning tool for large-scale storage systems. ACM Trans. Comput. Syst. 19 , 4 (Nov. 2001), 483518.

[3] A NDERSON , E . , S PENCE , S . , S WAMINATHAN , R . , K ALLA -HALLA , M., AN D WAN G , Q. Quickly nding near-optimal stor-age designs. ACM Trans. Comput. Syst. 23 , 4 (Nov. 2005), 337374.

[4] B ARTH , W. Nagios: System and Network Monitoring . No StarchPress, San Francisco, CA, USA, 2006.

[5] B ERTOLINO , A., C ALABR O , A., L ONETTI , F., AN D SABETTA ,A. Glimpse: a generic and exible monitoring infrastructure. In

Proceedings of the 13th European Workshop on Dependable Com- puting (New York, NY, USA, 2011), EWDC 11, ACM, pp. 7378.

[6] D OWDY , L. W., AN D FOSTER , D. V. Comparative models of thele assignment problem. ACM Comput. Surv. 14 , 2 (June 1982),287313.

[7] E RLINGSSON , U., P EINADO , M., P ETER , S., AN D BUDIU , M.Fay: extensible distributed tracing from kernels to clusters. InProceedings of the Twenty-Third ACM Symposium on OperatingSystems Principles (New York, NY, USA, 2011), SOSP 11, ACM,pp. 311326.

[8] S IGELMAN , B. H., B ARROSO , L. A., B URROWS , M., S TEPHEN -SO N , P., P LAKAL , M., B EAVER , D., JASPAN , S ., AN DSHANBHAG , C . Dapper, a large-scale distributed systems tracinginfrastructure. Tech. rep., Google, Inc., 2010.

7
http://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.htmlhttp://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.htmlhttp://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.htmlhttp://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.htmlhttp://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.htmlhttp://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.htmlhttp://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.html

Documents

Uncertainty in Aggregate Estimates from Sampled Distributed Traces