13
A User-Friendly Self-Similarity Analysis Tool * Thomas Karagiannis, Michalis Faloutsos, Mart Molle {tkarag,michalis,mart}@cs.ucr.edu Department of Computer Science & Engineering University of California, Riverside ABSTRACT The concepts of self-similarity, fractals, and long-range dependence (LRD) have revolu- tionized network modeling during the last decade. However, despite all the attention these concepts have received, they remain difficult to use by non-experts. This diffi- culty can be attributed to a relative com- plexity of the mathematical basis, the ab- sence of a systematic approach to their ap- plication and the absence of publicly avail- able software. In this paper, we introduce SELFIS, a comprehensive tool, to facilitate the evaluation of LRD by practitioners. Our goal is to create a stand-alone public tool that can become a reference point for the community. Our tool integrates most of the required functionality for an in-depth LRD analysis, including several LRD estimators. In addition, SELFIS includes a powerful ap- proach to stress-test the existence of LRD. Using our tool, evidence are presented that the widely-used LRD estimators can provide misleading results. It is worth mentioning that 25 researchers have acquired SELFIS within a month of its release, which clearly demonstrates the need for such a tool. * This work was supported by the NSF CAREER grant ANIR 9985195, and DARPA award NMS N660001-00-1- 8936, and NSF grant IIS-0208950, and TCS Inc., and DIMI matching fund DIM00-10071. 1. INTRODUCTION Self-similarity, fractals, and long-range dependence (LRD) have emerged as powerful tools for mod- eling the behavior of real processes and systems. These concepts have been applied to numerous dis- ciplines [25] ranging from molecular biology and genetics [3] [17] to geology [26]. However, in this work we focus on their application to the measured behavior of computer networks. Following the seminal work of Leland et al. [16], long-range dependence has become a key concept in analyzing networking traffic data over the last decade. The community has observed an over- whelming manifestation of self-similarity in mul- tiple network aspects such as traffic load, packet arrival times and queue sizes. As a result, most re- searchers expect to identify LRD in their analysis of measurements. Furthermore, realistic simula- tions require models that exhibit LRD. Intuitively, the properties of LRD and self-similarity measure the importance of long-term memory in the evolution of a process over time. When applied to networking, we say that a time-dependent pro- cess (such as packet arrivals, queue lengths, etc.) is LRD if its current value is strongly correlated with its previous values far into the past. As a re- sult, a sequence of measured values from an LRD process tends to generate similar (rather than in- dependent) values. Thus, as is well known in the case of self-similar behavior, changing time scales through aggregation (i.e., taking the sum or aver- age of a series of high resolution measurements to produce a single low resolution measurement) may have little impact on the apparent smoothness of an LRD process. Recognizing the presence of LRD is important for practitioners, because it can significantly change the behavior of the network. For example, if the

A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

A User-Friendly Self-Similarity Analysis Tool ∗

Thomas Karagiannis, Michalis Faloutsos, Mart Molle{tkarag,michalis,mart}@cs.ucr.edu

Department of Computer Science & EngineeringUniversity of California, Riverside

ABSTRACTThe concepts of self-similarity, fractals, andlong-range dependence (LRD) have revolu-tionized network modeling during the lastdecade. However, despite all the attentionthese concepts have received, they remaindifficult to use by non-experts. This diffi-culty can be attributed to a relative com-plexity of the mathematical basis, the ab-sence of a systematic approach to their ap-plication and the absence of publicly avail-able software. In this paper, we introduceSELFIS, a comprehensive tool, to facilitatethe evaluation of LRD by practitioners. Ourgoal is to create a stand-alone public toolthat can become a reference point for thecommunity. Our tool integrates most of therequired functionality for an in-depth LRDanalysis, including several LRD estimators.In addition, SELFIS includes a powerful ap-proach to stress-test the existence of LRD.Using our tool, evidence are presented thatthe widely-used LRD estimators can providemisleading results. It is worth mentioningthat 25 researchers have acquired SELFISwithin a month of its release, which clearlydemonstrates the need for such a tool.

∗This work was supported by the NSF CAREER grantANIR 9985195, and DARPA award NMS N660001-00-1-8936, and NSF grant IIS-0208950, and TCS Inc., and DIMImatching fund DIM00-10071.

1. INTRODUCTIONSelf-similarity, fractals, and long-range dependence(LRD) have emerged as powerful tools for mod-eling the behavior of real processes and systems.These concepts have been applied to numerous dis-ciplines [25] ranging from molecular biology andgenetics [3] [17] to geology [26]. However, in thiswork we focus on their application to the measuredbehavior of computer networks.

Following the seminal work of Leland et al. [16],long-range dependence has become a key conceptin analyzing networking traffic data over the lastdecade. The community has observed an over-whelming manifestation of self-similarity in mul-tiple network aspects such as traffic load, packetarrival times and queue sizes. As a result, most re-searchers expect to identify LRD in their analysisof measurements. Furthermore, realistic simula-tions require models that exhibit LRD.

Intuitively, the properties of LRD and self-similaritymeasure the importance of long-term memory inthe evolution of a process over time. When appliedto networking, we say that a time-dependent pro-cess (such as packet arrivals, queue lengths, etc.)is LRD if its current value is strongly correlatedwith its previous values far into the past. As a re-sult, a sequence of measured values from an LRDprocess tends to generate similar (rather than in-dependent) values. Thus, as is well known in thecase of self-similar behavior, changing time scalesthrough aggregation (i.e., taking the sum or aver-age of a series of high resolution measurements toproduce a single low resolution measurement) mayhave little impact on the apparent smoothness ofan LRD process.

Recognizing the presence of LRD is important forpractitioners, because it can significantly changethe behavior of the network. For example, if the

Page 2: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

packet arrival process were LRD, then larger inputbuffers would be needed to meet a given packetloss-rate specification. Moreover, the LRD prop-erty is completely different from ordinary notionsof the variance of a random process. High varianceonly means that individual samples from the pro-cess may deviate significantly from the global av-erage value. Nevertheless, if those individual sam-ples are mutually independent (or they exhibit onlyshort-range correlations), then the aggregated pro-cess quickly converges to a smooth function thatis concentrated around the global average. Con-versely, even a process that exhibits low variancecould be hiding a significant LRD component, inwhich case it may continue to exhibit similar vari-ance despite repeated aggregation.

Unfortunately, despite its widespread presence, theevaluation of LRD poses significant difficulties es-pecially for practitioners. We can see several bar-riers to limit the use of these concepts in the net-working community.

• Complexity: Many of the concepts are fairlyhard to comprehend both from an intuitiveand a mathematical perspective. There havebeen limited efforts to systematize and sim-plify these concepts, which results in confu-sion, partial understanding and misinterpre-tation of terms.

• Confusion: There does not exist a straight-forward step-by-step approach to quantify LRD.A measure of the strength of LRD is the Hurstexponent (H), a scalar. However, the Hurstexponent can only be estimated and not calcu-lated in a definitive way. The several differentestimators for the Hurst exponent often pro-duce conflicting results [14] [18].

• Lack of support: There does not exist asingle source of information or tools. As aresult, compiling and digesting the LRD liter-ature and developing tools from scratch is anon-trivial effort.

The overarching goal of this work is to demys-tify LRD and make it accessible to non-experts.Therefore, we have developed SELFIS, a SELF-similarity analysIS tool, a first step towards sup-porting a community-wide reference implementa-tion of the major algorithms used for self-similarityanalysis. For this reason, SELFIS is: a) free, b)user-friendly, and c) extensible. The need for such a tool

has been demonstrated by the 200 researchers1 whodownloaded the tool. It is implemented in java toavoid the need of costly commercial software. Fur-thermore, its modular open-source design allowsfor a collaborative development that can integratethe community expertise. More specifically, thedesign goals for SELFIS include:

• Ease of use and accessibility: Througha straightforward graphical user interface, itenables non-experts to use self-similarity bymaking the tool valuable both for researchand educational purposes. The simplicity ofthe interface allows for effortless use of the ca-pabilities of SELFIS, while the visualizationof the output of LRD test algorithms offers aquick sanity check and educational aid. How-ever, simplicity does not depreciate the powerof the analysis. Together with our analysisof LRD algorithms (e.g., section 4, [14] [13]),SELFIS is a powerful yet simple and intellec-tually accessible tool.

• Repeatability and Consistency: Resultsand observations from different research ef-forts can be replicated and validated. SELFISoffers a common reference platform for self-similarity analysis so that researchers will notbe required to implement sophisticated algo-rithms from scratch.

• Evaluation: Observations can be made aboutthe performance, capabilities and limitationsof various long-range dependence estimators.In addition, future versions of SELFIS will in-corporate algorithms and heuristics to over-come the statistical limitations to allow forrobust estimation.

In addition to implementing well-known long-rangedependence estimation algorithms, SELFIS also in-corporates an intuitive approach to verify the ex-istence of long-range dependence. We call thismethodology randomized buckets. The methodallows us to independently control the amount ofcorrelations at different scales in a given dataset.In randomized buckets, the initial time-series is1The researchers span multiple disciplines such as com-puter science, electrical engineering, mathematics, psychol-ogy. Apart from its academic appeal, it has attracted theinterest of various research labs around the world such asTelstra Research Laboratories, the Chinese Academy of Sci-ence, Ericsson Research, USC/ISI, and Swiss Federal Insti-tute of Technology.

Page 3: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

permutated in a controlled fashion, and the statis-tical properties of the initial and the randomizedseries are compared. The approach extends andsubsumes the methodology used in [7] in 1996.

We extend the initial method in two ways. First,we enable multiple levels of permutations in or-der to separate user-defined “medium” correlationsfrom long and short. Second, we enable differentways of randomizing the time-series apart from thepermutations: we allow sampling with repeatablevalues.

Finally, to demonstrate the value of SELFIS, weuse it to evaluate various long-range dependenceestimators in a variety of test cases. Our resultsshow that the estimators have significant limita-tions and interpretation of results is not trivial.First, several estimators seem to be sensitive toshort-range dependence. Using the randomized buck-ets methodology, we create series without long-range dependencies; however, some of the estima-tors continue to report the same Hurst exponentvalues indicating long-range dependence. Second,even in synthesized long-range dependent series,many of the estimators fail significantly to reportthe correct Hurst exponent value.

The rest of this paper is structured as follows. Sec-tion 2 is a brief overview of self-similarity and long-range dependence and summarizes previous find-ings of self-similarity in network traffic. Section 3presents SELFIS, our self-similarity analysis tool.Section 4 is a case study that presents the function-ality of SELFIS: randomized buckets and a studyof the reliability of long-range dependence estima-tors. Section 5 concludes our work.

2. DEFINITIONS - BACKGROUNDSelf-similarity describes the phenomenon where cer-tain properties are preserved irrespective of scalingin space or time. Of interest to network traffic pro-cesses is second-order self-similarity. Second-orderself-similarity describes the property that the cor-relation structure of a time-series is preserved ir-respective of time aggregation. This correlation iscaptured by the autocorrelation function (ACF),which measures the similarity between a series Xt,and a shifted version of itself, Xt+k. Simply put,the autocorrelation function of a second-order self-similar time-series is the same across multiple ag-gregation levels. For detailed description of self-similarity, see [20].

If the ACF decays hyperbolically to zero then theprocess shows long-range dependence. Long-rangedependence measures the memory of a process. In-tuitively, distant events in time are correlated. Onthe contrary, short-range dependence is character-ized by quickly decaying correlations (e.g. ARMAprocesses). The strength of the long-range depen-dence is quantified by the Hurst exponent (H). Aseries exhibits LRD when 1

2 < H < 1. Further-more, the closer H is to 1, the stronger the depen-dence of the process is.

More rigorously, a stationary process Xt has long-memory or is long-range dependent [5], if there ex-ists a real number α ∈ (0, 1) and a constant cp > 0such that

limk→∞

ρ(k)/[cpk−α] = 1

where ρ(k) is the sample Autocorrelation func-tion (ACF):

ρ(k) =E[(Xt − µ)(Xt+k − µ)]

σ2

where µ, σ are the sample mean and standard de-viation respectively. The definition states that theautocorrelation function of a stationary long-rangedependent process, decays to zero with rate ap-proximately k−α, where H = 1 − α

2 is the Hurstexponent. For traffic modeling purposes stationar-ity implies that the structure of a time-series doesnot depend on time.

Another way to characterize long-range dependenceis to study the properties of the aggregated pro-cess X(m)(k) which is defined as follows:

X(m)(k) =1m

km∑

i=(k−1)m+1

Xi, k = 1, 2...., [N

m].

To evaluate long-range dependence, the effect ofaggregation on various second-order statistics isevaluated. For example, if there is no correlation inthe time-series then the variance of the aggregatedseries should decrease as 1

m [5]. Slower decayingvariance will imply long-range dependence.

Research dealing with self-similarity and long-rangedependence can be classified in two general cate-gories. The first includes studies on the manifes-tation of such phenomena in networking, their ori-gins and their effects. The second category involvesreports on modeling and estimating long-range de-pendence, as well as showing the complexity anddifficulty in its correct use and interpretation.

Page 4: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

There has been ample evidence of long-range de-pendence and scaling phenomena in many differ-ent aspects of networking. The first experimen-tal evidence of self-similar characteristics in localarea network traffic were presented in [16]. Theauthors perform a rigorous statistical analysis ofEthernet traffic measurements and were able toestablish its self-similar nature. Similar observa-tions were presented for wide area Internet trafficin [22] and World Wide Web traffic in [6] where theunderlying distributions of file sizes were shownto be the main cause of self-similarity. In [32],the authors discuss the failure of Poisson modelingin the Internet. Scaling phenomena [10] [29] andthe factors that contribute to self-similarity andlong-range dependence have been extensively stud-ied [19] [33] [31] [9] [8]. Furthermore, in [11] [24] [12]the relevance and the effects of the self-similarityon various metrics of network performance are ex-amined.

The second major aspect of research dealing withself-similarity and long-range dependence is esti-mation of the Hurst exponent. An overview ofa large number of these estimation methodologiescan be found in [27] [5] [28]. Relatively little ef-fort has been devoted to studying the accuracyof the estimation methodologies [27] and pointingout difficulties in long-range dependence estima-tion [18] [14]. The authors present pitfalls whenestimating the intensity of long-range dependencein the presence of trends, non-stationarity, period-icity and noise. Furthermore, the limitations of thevariance-time estimator have been analyzed in [15].

Of major importance is also the development ofmodels for simulating long-range dependence. Pro-posed models like the one in [23] or generatorsfor long-range dependent time-series [21] are hardto evaluate in practice. Thus, there are hardlyany studies that assess the various models or com-pare the different generators. In general, this sug-gests the need for practical tools and a system-atic methodology to estimate, validate and gener-ate long-range dependent time-series.

3. THE SELFIS TOOLThe SELFIS tool [1] (fig. 1) is developed to pro-vide all the necessary functionality for a completeand systematic analysis. Our goal is to estab-lish SELFIS as a reference point in self-similarityanalysis. It is a java-based, modular, extendible,freely distributed software tool, that can automatetime-series analysis. We chose to develop an inde-

pendent platform instead of relying on commercialproducts. Our purpose was to give to the commu-nity a ready to use tool, without further obligationsof purchasing any software.

The SELFIS tool is a collection of self-similarityand long-range dependence estimation methodolo-gies and time-series processing algorithms. It cur-rently incorporates all the widely used long-rangedependence estimators. Also, SELFIS offers dataprocessing methodologies and transforms, such aswavelets, Fourier transform, stationarity tests andsmoothing algorithms. In addition, SELFIS pro-vides the possibility of synthesizing long-range de-pendent time sequences, as it includes fractionalGaussian noise generators. The following subsec-tions present analytically the different classes offunctionality included in SELFIS: a) Hurst expo-nent estimators, b) randomized buckets, c) trans-forms, d) data processing and e) fractional Gaus-sian noise generators.

3.1 Hurst EstimatorsSELFIS includes most of the existing long-rangedependence estimators. These estimators can beclassified in two main general categories. In thefirst, there is a number of time-domain methods,such as RSplot and the Variance method. The sec-ond category includes the frequency-based estima-tors, such as the periodogram, the Whittle and theAbry-Veitch estimators.

The existence of numerous estimators is justifiedby the asymptotic nature of the Hurst exponent.Intuitively, since the limiting behavior of the pro-cess can only be estimated, statistical errors anduncertainty impede reliable and concise calcula-tion of the Hurst exponent. Statistical limitationsarise also when applying mathematical definitionsin practice, i.e., the estimators assume stationar-ity which is an elusive concept. Furthermore, eachestimator looks at a different property of a giventime-series. Thus, it is common that these method-ologies produce conflicting estimates for the sametime-series. This is true not only for “real-life”time-series where the existence of periodicities, noiseor trends has substantial effect on the estimation [14],but also for synthesized LRD series with specificpredetermined Hurst exponent value (see section 4.2for the limitations of the estimators). On the otherhand, the estimation methodologies examine spe-cific properties (e.g., variance, power spectrum) atdifferent time scales. At larger time-scales wherethe behavior at the limit is described, the num-

Page 5: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

Figure 1: Two screen dumps of the SELFIS tool.

ber of samples decreases significantly resulting instatistical uncertainties. Applying all the estima-tors to a time-series provides with a more completeoverall picture of its possible self-similar nature.

For all the aforementioned limitations of the es-timation methodologies, SELFIS also reports thestatistical significance of each estimation. The cor-relation coefficient or the confidence intervals whereavailable should always be reported together withthe Hurst value. Stressing only the Hurst exponentvalue is rather meaningless if statistically the valueis not significant. More specifically, in our tool thefollowing estimators are included:

A.Time-domain estimators: These estimationmethodologies are based on investigating the power-law relationship between a specific statistic of thetime-series and the aggregation block size m.

• Absolute Value method. The log-log plot ofthe aggregation level versus the absolute firstmoment of the aggregated series X(m) is astraight line with slope of H − 1, if the time-series is long-range dependent (where H is theHurst exponent).

• Variance method. The method plots in log-logscale the sample variance versus the block sizeof each aggregation level. If the series is long-range dependent then the plot is a line withslope β greater than −1. The estimation of His given by H = 1 + β

2 .

• R/S method. This method uses the rescaledrange statistic (R/S statistic). The R/S statis-tic is the range of partial sums of deviationsof a time-series from its mean, rescaled by its

standard deviation. A log-log plot of the R/Sstatistic versus the number of points of the ag-gregated series should be a straight line withthe slope being an estimation of the Hurst ex-ponent.

• Variance of Residuals. The method uses theleast-squares method to fit a line to the par-tial sum of each block m. A log-log plot ofthe aggregation level versus the average of thevariance of the residuals after the fitting foreach level should be a straight line with slopeof H/2.

B.Frequency-domain/wavelet-domain estima-tors: These estimators operate in the frequency orthe wavelet domain.

• Periodogram method. This method plots thelogarithm of the spectral density of a time se-ries versus the logarithm of the frequencies.The slope provides an estimate of H. The pe-riodogram is given by

I(ν) =1

2πN

∣∣∣∣∣∣

N∑

j=1

X(j)eijν

∣∣∣∣∣∣

2

where ν is the frequency, N is the length of thetime-series and X is the actual time-series.

• Whittle estimator. The method is based onthe minimization of a likelihood function, whichis applied to the periodogram of the time-series. It gives an estimation of H and pro-duces a confidence interval. It does not pro-duce a graphical output.

Page 6: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

Lag crosses buckets Lag inside one bucket

Bucket size (b)

i i+k j j+k

Figure 2: Pairs separated by lag k can belong to the

same bucket or not in which case they are inbucket or

outbucket respectively.

• Abry-Veitch (AV). The Hurst exponent is es-timated by using the wavelet transform of theseries [2]. A least-squares fit on the average ofthe squares of the wavelet coefficients at dif-ferent scales is an estimate of the Hurst expo-nent. The method produces both a graphicaloutput and a confidence interval.

3.2 Randomized BucketsIn SELFIS, Randomized Buckets is used as anintuitive method for the detection and validationof long-range dependence. We examine numerousways of randomization, such as moving numbersa constant number of positions in the series orrandomizing inside the buckets with replacement.These methodologies will be further commentedupon in our future work.

The idea behind randomized buckets is to decou-ple the short-range from long-range correlations ina series to facilitate the study of the effects of long-range dependence. This is achieved through par-titioning the time series into a set of “buckets”of length b. Thus, we define the contents of theuth bucket to be items Xu·b, . . . , X(u+1)·b−1 fromthe series, and the home of item Xi to be bucketH(i) ≡ bi/bc. Also, we say that two items (Xi, Xj)form an inbucket pair if H(i) = H(j); otherwise,they form an outbucket pair with an offset of|H(i) − H(j)| buckets. Note that this classifica-tion depends on the (fixed) locations of the bucketboundaries, and not just the separation betweentwo items in the time series. For example, fig. 2,shows that two items separated by lag k could formeither an inbucket or outbucket pair.

Once the series has been partitioned in this way, wecan then apply one of the following randomizationalgorithms to reorder its items:

External Randomization (EX): The order ofbuckets is randomized, whereas the content of each

bucket remains intact. This can be achieved by la-belling each bucket with a bucket-id between 0 andbTime-SeriesLength/bc, and randomization of thebucket-ids. External randomization preserves allcorrelations among the inbucket pairs, while equal-izing all correlations among the outbucket pairswith different offsets. Thus, if the series is suffi-ciently long, the ACF should not exhibit significantcorrelations beyond the bucket size.

Internal Randomization (IN): The order of thebuckets remains unchanged while the contents ofeach bucket are randomized. As a result, correla-tions among the inbucket pairs are equalized, whilecorrelations among the outbucket pairs are pre-served, but rounded to a common value for eachoffset. Thus, if the original signal has long-memory,then the ACF of the internally-randomized serieswill still show power-law behavior.

Two-Level Randomization (2L): Each bucketis further subdivided into a series of “atoms” of sizea. Thereafter, we apply external randomization tothe block of bb/ac atoms within each bucket. As aresult, both short-range correlations (within eachatom) and long-range correlations (across multiplebuckets) are preserved, while medium-range cor-relations (across multiple atoms within the samebucket) are equalized.

3.3 TransformsTransformations are usually applied to reveal in-formation that is not available in the raw time-series. Fourier and wavelet transforms can be use-ful to reveal periodicities in the series and in gen-eral study the frequency components of the time-series. SELFIS includes the following transforms:

• Fourier Transform. Fourier transform is usedto transform a series from the time domainto the frequency domain. Intuitively, the sig-nal is transformed into a sum of sinusoids ofdifferent frequencies.

• Wavelets (Haar and D4). Wavelet transformis capable of providing the time and frequencyinformation of a time-series simultaneously.Fourier transform cannot present informationabout the time. Wavelets cover for this inef-ficiency by combining frequency and time do-mains.

• Power Spectrum. The power spectrum presentsthe amount of energy that corresponds to eachfrequency of the Fourier transform.

Page 7: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

3.4 Data ProcessingData processing is an essential element in time-series analysis. Processing reveals the underlyingbehavior of the series and allows for further analy-sis. SELFIS currently includes the following dataprocessing methodologies:

• Smoothing Algorithms. Smoothing can be ap-plied by median, average or exponential smoo-thing algorithms. Our tool includes the 4253Hsmoothing algorithm described in [30]. Thealgorithm has been shown to provide sufficientresults for different kinds of data. Accordingto 4253H smoothing the signal is smoothed bysuccessively applying median smoothing withwindow 4,2,5 and 3 followed by a hanning op-eration. A hanning operation multiplies thevalues of a window 3 by 0.25, 0.5 and 0.25respectively, and sums the results.

• Stationarity tests. Stationarity means intu-itively that there is no trend in the series.There are a number of tests that check a seriesfor stationarity. One of the common tests forstationarity is the run test [4]. The test candetect a monotonic trend in the series by eval-uating the number of runs. A run is defined asa sequence of identical observations, i.e., con-secutive equal values in a series. The numberof runs must be a random variable with meanN2 + 1 and variance N(N−2)

4(N−1) , where N is thelength of the series. The number of runs isevaluated from a series s(i), where:

s(i) = 0 , if y(i) < median(y), ands(i) = 1 , if y(i) ≥ median(y),

where y(i) is the time series. Thus, a runis defined as a sequence of consecutive valuesthat are all above or below the median of theoriginal time-series. Nonstationarity is indi-cated by a number of runs considerably dif-ferent than N

2 + 1. Stationarity is importantwhen long-range dependence is studied, sinceestimators fail in non-stationary data2.

3.5 Fractional Gaussian Noise GeneratorsFractional Gaussian noise (fGn) generators can syn-thesize series with long-range dependence. Ourtool includes two generators. The first method isbased on fast Fourier transform to generate a fGn2If stationarity is detected, the time series must be differ-enced successively until stationarity is achieved.

series [21]. The second generator produces fGn se-ries by using the Durbin-Levinson coefficients.

4. CASE STUDYThis section highlights the capabilities of SELFIS.Two case studies are presented. First, a demon-stration of how randomized buckets can be used tostress-test long-range dependence and cancel theeffect of short-term correlations. To demonstratethe methodology, we use fractional Gaussian noiseseries generated by one of the generators includedin SELFIS. Second, we demonstrate that long-rangedependence estimators have limited capabilities.

These case studies also demonstrate and justify theneed for different estimation methodologies. Eachestimator has different strengths and weaknessesand thus can be best used at different cases. Inaddition, understanding the limitations of each es-timator allows for sound usage of the SELFIS tooland interpretation of its results.

4.1 Randomized BucketsRandomized buckets (see previous section) is anintuitive, straightforward methodology that vali-dates the existence of long-memory. To show howlong-range dependence can be detected using ran-domized buckets, we synthesized a sample seriesof fractional Gaussian noise. The series (fig. 3,left plot) has length 65536, Hurst exponent 0.8and was synthesized using the generator createdby Paxson [21]. The middle plot in fig. 3 showsthe sample autocorrelation function (ACF) of theseries which decays hyperbolically to zero and im-plies long-range dependence. To ensure that long-range dependence really exists we employ random-ized buckets. The right plot in fig. 3 shows theACF of the fGn series if randomized externallywith bucket size 1. This type of randomizationremoves all correlations by creating a completelyrandom series As expected, the ACF shows thatno correlation exists at all time lags. Fig. 4 showsthe ACF function after the signal is randomizedwith three different ways:

• External Randomization (using b = 50)causes the ACF to drop smoothly from theinitial value for the unrandomized sequenceto zero as the lag increases, reaching zero atexactly the bucket size! The left plot in fig. 4shows that all correlations are equalized be-yond the bucket size (50).

Page 8: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

0.5 1 1.5 2 2.5 3 3.5

x 104

−4

−3

−2

−1

0

1

2

3

4

FGN series (length=65536, Hurst = 0.8)

0 50 100 150 200 250 300−0.2

0

0.2

0.4

0.6

0.8

Lag

Sam

ple

Aut

ocor

rela

tion

Sample Autocorrelation Function (ACF)

0 20 40 60 80 100 120 140 160 180 200−0.2

0

0.2

0.4

0.6

0.8

Lag

Sam

ple

Aut

ocor

rela

tion

Sample Autocorrelation Function (ACF)

Figure 3: LEFT: fGn series of length 65536 and Hurst 0.8. MIDDLE: Autocorrelation function (ACF) of the series

up to lag 300. The ACF shows power-law like behavior. RIGHT: ACF after external randomization with bucket size

1 (full randomization).

0 50 100 150 200 250 300−0.2

0

0.2

0.4

0.6

0.8

Lag

Sam

ple

Aut

ocor

rela

tion

Sample Autocorrelation Function (ACF)

0 50 100 150 200 250 300−0.2

0

0.2

0.4

0.6

0.8

Lag

Sam

ple

Aut

ocor

rela

tion

Sample Autocorrelation Function (ACF)

0 100 200 300 400 500 600−0.2

0

0.2

0.4

0.6

0.8

Lag

Sam

ple

Aut

ocor

rela

tion

Sample Autocorrelation Function (ACF)

Figure 4: LEFT: External randomization with bucket size 50. After lag 50 all correlations are insignificant. MID-

DLE: Internal randomization with bucket size 50. The ACF shows the same power-law behavior like the original series

(Fig. 3). RIGHT: Two-level randomization with bucket sizes 300 and 30. Medium-range correlations are distorted.

• Internal Randomization (using b = 50) sig-nificantly lowers and flattens the ACF at smallvalues of the lag compared to the original (un-randomized) series. However, for large valuesof the lag, internal randomization has no effecton the ACF. The ACF (fig. 4, middle plot) forlarge lags (beyond the bucket size) is similar tothe ACF of the original series (bucket size 50).Observe that long-range dependence seems todominate the original series, since the effect ofequalizing the inbucket correlations on ACF isminimal.

• Two-Level Randomization (using a = 30,b = 300) exhibits similar behavior to exter-nal randomization for small values of the lag(i.e., less than a), along with similar behaviorto internal randomization for large values ofthe lag. These two limiting values also matchthe ACF for the original (unrandomized) se-ries, but for intermediate values the two-level

randomization significantly reduces the corre-lations. The right plot in fig. 4 demonstratesthe distortion of medium-range correlations inthe ACF after two-level randomization.

To emphasize the effect of the various types of ran-domization on the correlations of the time-series,we plot in log-log scale the autocorrelation func-tion after various types of randomization (fig. 5).The ACF of the initial (unrandomized) fGn seriesis a straight line as expected from the definition oflong-range dependence (see section 2). The ACFafter internal randomization differs from the orig-inal ACF for lags smaller than the lag represent-ing the bucket size. On the contrary, ACF afterexternal randomization is similar to the originalfor small lags only, while the ACF after two-levelrandomization differs for intermediate lags. Fur-thermore, in fig. 6 we plot the difference betweenACF after the various types of randomization andthe initial ACF. The difference is close to zero for

Page 9: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

1 10 30 50 100 30010

−3

10−2

10−1

100

Sample Autocorrelation Function (ACF)

Samp

le Au

toco

rrelat

ionExternal

Two−Level

Internal

Initial

Lag

Figure 5: The ACF for various types of randomization. The vertical dashed line shows the bucket size used.

lags larger than the bucket size for internal andtwo-level randomization indicating long-range de-pendence. On the contrary, the difference in thecase of external randomization decays with the lagup to the bucket size; beyond that, it starts to growtowards zero. At small lags the ACF after externalrandomization drops faster to zero than the initialACF. Thus, the difference between them grows asthe lag approaches the bucket size. At the bucketsize the difference becomes maximum. After thebucket size the difference becomes smaller as theinitial ACF also approaches zero.

4.2 Reliability of LRD EstimatorsThis section is an evaluation of the Hurst exponentestimators. We examine the accuracy and robust-ness of the estimators using two types of test-cases:a) Synthesized LRD series with known Hurst expo-nent value to study the accuracy of the estimators.b) Randomized LRD series using randomized buck-ets to study the effect of short-range correlationson the estimators.

A. Accuracy on Synthesized LRD Series:We show that the estimators seldom agree on thevalue of the Hurst exponent, and often they dis-agree by significant difference. Each of the esti-mators was tested against two different types ofsynthesized long-memory series: a) AutoregressiveFractional Integrated Moving Average processes(ARFIMA) and b) fractional Gaussian noise (fGn)series. For more details, also see [14].

We generate 100 datasets with different seed for

each Hurst value from 0.5 to 0.9 with step of 0.1.Fig.7 summarizes our findings for the Paxson gen-erator and the ARFIMA model. For both plots infig.7, the X axis presents the Hurst exponent valueof the fGn series and the Y axis shows the averageestimated value of the corresponding methodology.The “Target” line presents what the optimal esti-mation of the fGn data for each case would be. The95% confidence intervals are typically within 0.01of the reported value.

Our findings exhibit the inability of the majority ofthe estimators to accurately estimate the value ofthe Hurst exponent. For fGn data, with the excep-tion of the Whittle and Periodogram estimators,all other estimations fail to estimate correctly. Weobserved similar results in the case of series gener-ated with the ARFIMA model. In the latter case,the Periodogram, Abry-Veitch and R/S estimatorsproduce values closer to the target.

B. Estimators and Randomized Buckets:How sensitive are the estimators to short-rangecorrelations? To address this question we employedthe randomized buckets methodology. Intuitively,the estimations should not be affected after inter-nal randomization. Note that internal random-ization breaks the short-term correlations, whilepreserving the long-term. On the contrary, ex-ternal randomization should significantly influenceestimations, since long-memory is distorted. Esti-mates of the Hurst exponent of externally random-ized series should be close to 0.5 since long-rangedependence has been canceled.

Page 10: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

50 100 150 200 250 300 350 400

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

Lag

Diffe

renc

e

Internal − Initial

Two−Level − Initial

External − Initial

Figure 6: Differences in autocorrelation coefficients of the ACF after randomization. ACF after internal, external

and two-level randomization minus the initial ACF.

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Est

imat

ed H

urst

Exp

onen

t

Hurst Exponent of FGN series

Target"RS"

"Variance""Abs"

"Residuals""Periodogram"

"Whittle""AV"

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Est

imat

ed H

urst

Exp

onen

t

Hurst Exponent of ARFIMA series

Target"RS"

"Variance""Abs"

"Residuals""Periodogram"

"Whittle""AV"

Figure 7: The performance of the estimators using Paxson’s generator (left) and ARFIMA (right). The “Target”

line shows an optimal estimation of the synthesized data. The Whittle and Periodogram estimators fall exactly on the

Target line in the fGn case.

We synthesized numerous fGn series for variousHurst values between 0.5 and 1. These synthe-sized series were randomized for different bucketsizes. Fig.8 presents our findings. For both plots,the Y axis shows the estimated Hurst value whilethe X axis presents the bucket size. The ACF of theinitial (unrandomized) time-series is represented inthe plots with bucket sizes infinity and one for ex-ternal and internal randomization respectively. Inmore detail, this figure shows the average estima-tions of 100 fGn series with Hurst 0.8 after beingexternally and internally randomized with bucketsize ranging from 10 to 90.

Intuitively, we would expect that especially for smallbucket sizes all estimations would be close to 0.5

after external randomization (left plot of fig. 8).This is true for all the estimators except the AV,Whittle and Periodogram estimators who behavecounter-intuitively. In particular, AV and Whittleestimators do not seem to be affected by exter-nal randomization irrespective of the bucket size.The three “frequency-based” estimators, especiallyWhittle and AV produce the same estimates as be-fore randomizing (bucket size of infinity in the fig-ure), even though long-range dependence has beeneliminated from the series.

Similar counter-intuitive behavior for AV and Whit-tle holds in the case of internal randomization. Ex-cept AV and Whittle, all estimators estimate thesame Hurst value before and after internal random-

Page 11: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

0.5

0.6

0.7

0.8

0.9

1

1.1

10 30 30 70 90 Inf.

Est

imat

ed H

Bucket Size

RSVariance

ABSResiduals

PeriodogramWhittle

AV

0.5

0.6

0.7

0.8

0.9

1

1.1

1 10 100

Est

imat

ed H

Bucket Size

RSVariance

ABSResiduals

PeriodogramWhittle

AV

Figure 8: External (left) and internal (right) randomization with various bucket sizes (10 30 50 70 90). The X axis

is presented in log scale. “Frequency based” (AV, Whittle, Periodogram) estimators fail to capture that long-term

correlations are distorted in external randomization. Their estimations are the same as before randomizing (the bucket

size of the initial ACF can be considered to be infinity in external randomization). In addition, these estimators fail

to capture internal randomization. Their estimations after internal randomization deviate significant from the initial

estimations (the bucket size of the initial ACF can be considered to be 1 in internal randomization).

ization as intuitively expected (fig. 8 right plot).On the contrary, estimations of the Hurst valuefrom AV and Whittle estimators drop significantlyas the bucket size increases. However, estimationsshould be unaffected since internal randomizationonly destroys short-range correlations.

Considering the effect of randomized buckets onlong-memory, we can claim that these two estima-tors —AV and Whittle— seem to depend more onthe short-term behavior of the time-series to derivean estimate for the Hurst exponent.

Note, that the estimators that perform the best insynthesized long-range dependence series were theones most affected by short-term correlations.

Summing up our study of the accuracy and robust-ness of the Hurst exponent estimators, we reach thefollowing main conclusions:

• When the data are generated by fractionalGaussian noise (fGn), Whittle, Periodogramseem to give the most accurate estimation forthe Hurst exponent.

• When the data are synthesized using theARFIMA model, AV, Periodogram and R/Shave the best performance.

• Even though the Whittle estimator is consid-ered the most robust, it is the most sensitiveof the estimators.

• AV and Whittle estimators seem to dependmainly on short-range correlation to derivethe Hurst exponent estimate.

In general, there is no definite estimator that couldbe consistently used in every case. Each estimatorevaluates different statistics of the time-series toestimate the Hurst exponent. Thus, different pro-cesses may have different effect on each estimator.

5. CONCLUSIONSThe main contribution of this work is the devel-opment of the SELFIS software tool. We believethat SELFIS presents a long-overdue first step to-wards a widely-used reference platform to facilitateself-similarity analysis. Through the introductionof this tool, we hope to encourage greater use ofthese techniques. Despite the wide interest in self-similarity and long-range dependence within thecommunity, a common tool not yet emerged. As aresult, this impeded the use and comparability ofresults.

In more detail, SELFIS provides the following di-rect benefits.

• Accessibility: Anyone will be able to use long-range dependence analysis, even non-experts.In this sense, SELFIS will help spreading theuse of LRD concept for research and educa-tional purposes.

• User friendliness: The interface of the tool is

Page 12: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

straightforward making its use effortless, whilevisualization offers a fast sanity check of re-sutls.

• Robustness: It offers multiple estimators forreliable results and presents their statisticalsignificance.

• Repeatability: Results can be replicated andverified.

• Open source collaborative development: Thecommunity can cooperate on enriching the ca-pabilities of the tool therefore leveraging fromeach others work. Different groups are wel-come to contribute their expertise. We ac-tively solicit contributions.

• Free: The use of SELIFS comes with no mon-etary cost. SELFIS is publicly available andno extra commercial software is needed.

An additional contribution is an implementation ofa powerful tool for stress-testing long-range depen-dence. Randomized buckets can isolate the effectof short, long and medium correlations on a time-series. Although this idea appeared in earlier workit has been neglected. We would like to revive anddevelop this idea to its full potential.

Our study on the estimation of long-range depen-dence and our experience with the estimators allowus to highlight a few tips for practitioners. First, areporting of the Hurst exponent is meaningful, onlyif it is accompanied by the method that was used,as well as the confidence intervals or correlation co-efficient. In addition, researchers should not relyonly on one estimator in deciding the existence oflong-range dependence. Several of the estimatorscan be overly optimistic in identifying long-rangedependence. Furthermore, for efficient characteri-zation, it may be necessary to process and decom-pose the time-series. Finally, a visual inspectionof the time-series can be very useful, providing aqualitative analysis and revealing many of its fea-tures, like periodicity.

SELFIS will be further extended with additionalfunctionality in the future. Calculation of fractaldimensions and forecasting models are some of ourpriorities. In addition, we are very interested incollaborative development. Interested parties arehighly encouraged to contribute code.

Finally, the algorithms within SELFIS are not re-stricted to the domain of time-series networking

data. Thus, we hope that our work may be ap-plied to the analysis of long-range dependence datasets in other disciplines, such as computer science,economics, sociology, psychology, etc.

6. REFERENCES[1] The SELFIS Tool.

http://www.cs.ucr.edu/∼tkarag.

[2] P. Abry and D. Veitch. Wavelet Analysis ofLong-Range Dependence Traffic. In IEEETransactions on Information Theory, 1998.

[3] B. Audit, C. Vaillant, A. Arneodo,Y. d’Aubenton Carafa, and C. Thermes.Long-Range Correlations Between DNABending Sites. In Journal of MolecularBiology, volume 4, pages 903–918, 2002.

[4] J. S. Bendat and A. G. Persol. Random Data- Analysis and Measurement Procedures.John Wiley & Sons, NY , 1986.

[5] J. Beran. Statistics for Long-memoryProcesses. Chapman and Hall, New York,1994.

[6] M. E. Crovella and A. Bestavros.Self-Similarity in World Wide Web TrafficEvidence and Possible Causes. InIEEE/ACM Transactions on Networking,1997.

[7] A. Erramilli, O. Narayan, and W. Willinger.Experimental Queueing Analysis withLong-Range Dependent Packet Traffic.IEEE/ACM Transactions on Networking,4(2):209–223, 1996.

[8] A. Feldmann, A.C.Gilbert, and W.Willinger.Data network as cascades: Investigating themultifractal nature of the Internet WANTraffic. In Computer CommunicationsReview, 1998.

[9] A. Feldmann, A. C. Gilbert, P. Huang, andW. Willinger. Dynamics of IP Traffic: AStudy of the Role of Variability and TheImpact of Control. In SIGCOMM, pages301–313, 1999.

[10] A. Feldmann, A. C. Gilbert, W. Willinger,and T. G. Kurtz. The Changing Nature ofNetwork Traffic: Scaling Phenomena. InACM Computer Communication Review,volume 28, pages 5–29, 1998.

Page 13: A User-Friendly Self-Similarity Analysis Toolalumni.cs.ucr.edu/~tkarag/papers/ccr03.pdf† Ease of use and accessibility: Through a straightforward graphical user interface, it enables

[11] M. Grossglauser and J. Bolot. On theRelevance of Long-Range Dependence inNetwork Traffic. In IEEE/ACM Transactionson Networking, 1998.

[12] G. K. K. Park and M. E. Crovella. On theEffect of Traffic Self-Similarity on NetworkPerformance. In Proceedings of SPIEInternational Conference on Performanceand Control of Network Systems, 1997.

[13] T. Karagiannis and M. Faloutsos. SELFIS: ATool For Self-Similarity and Long-RangeDependence Analysis. In 1st Workshop onFractals and Self-Similarity in Data Mining:Issues and Approaches (in KDD) ,Edmonton, Canada, July 23, 2002.

[14] T. Karagiannis, M. Faloutsos, and R. Riedi.Long-Range dependence:Now you see it, nowyou don’t! In IEEE GLOBECOM, GlobalInternet Symposium, 2002.

[15] M. Krunz. On the limitations of thevariance-time test for inference of long-rangedependence. In IEEE INFOCOM, pages1254–1260, 2001.

[16] W. E. Leland, M. S. Taqqu, W. Willinger,and D. V. Wilson. On the Self-Similar natureof Ethernet Traffic. In IEEE/ACMTransactions on Networking, 1994.

[17] X. Lu, Z. Sun, H. Chen, and Y. Li.Characterizing Self-Similarity in BacteriaDNA Sequences. In Physical Review E,volume 58, pages 3578–3584, 1998.

[18] S. Molnar and T. D. Dang. Pitfalls in LongRange Dependence Testing and Estimation.In GLOBECOM, 2000.

[19] K. Park, G. Kim, and M.E.Crovella. On theRelationship Between File Sizes TransportProtocols, and Self-Similar Network Traffic.In International Conference on NetworkProtocols, pages 171–180, Oct 1996.

[20] K. Park and W. Willinger. Self-similarnetwork traffic: An overview. In Self-SimilarNetwork Traffic and Performance Evaluation.Wiley-Interscience, 2000.

[21] V. Paxson. Fast approximation of self similarnetwork traffic. Technical Report LBL, 1995.

[22] V. Paxson and S. Floyd. Wide area traffic:the failure of Poisson modeling. IEEE/ACMTransactions on Networking, 1995.

[23] R. H. Riedi, M. S. Crouse, V. J. Ribeiro, andR. G. Baraniuk. A Multifractal WaveletModel with Application to Network Traffic.In IEEE Special Issue on InformationTheory, pages 992–1018, 1999.

[24] Z. Sahinoglu and S. Tekinay. On MultimediaNetowkrs: Self-similar Traffic and NetworkPerformance. In IEEE CommunicationsMagazine, volume 37, pages 48–52, 1999.

[25] M. Schroeder. Fractals, Chaos, Power Laws:Minutes from an Infinite Paradise. W. H.Freeman & Co., 1992.

[26] R. V. Sole, S. C. Manrubia, M. Benton, andP. Bak. Self-Similarity of ExtinctionStatistics in the Fossil Record. In Nature,volume 388, pages 764–767. MacmillanPublishers Ltd, 1997.

[27] M. S. Taqqu and V. Teverovsky. OnEstimating the Intensity of Long-RangeDependence in Finite and Infinite VarianceTime Series. In R. J. Alder, R. E. Feldmanand M.S. Taqqu, editor, A Practical Guide toHeavy Tails: Statistical Techniques andApplications, pages 177–217. Birkhauser,Boston, 1998.

[28] V. Teverovsky.http://math.bu.edu/people/murad/methods/.

[29] X. Tian, J. Wu, and C. Ji. A UnifiedFramework for Understanding NetworkTraffic Using Independent Wavelet Models.In IEEE INFOCOM, 2002.

[30] Velleman, P. F., and D. C. Hoaglin.Applications, Basics, and Computing ofExploratory Data Analysis. Duxbury Press,Boston, MA, 1981.

[31] A. Veres, Z. Kenesi, S. Molnar, andG. Vattay. On the Propagation of Long-rangeDependency in the Internet. In SIGCOMM,2000.

[32] W. Willinger and V. Paxson. WhereMathematics Meets the Internet. In Noticesof the AMS, 1998.

[33] W. Willinger, M. S. Taqqu, R. Sherman, andD. V. Wilson. Self-similarity throughhigh-variability: statistical analysis ofEthernet LAN traffic at the source level.IEEE/ACM Transactions on Networking,5(1):71–86, 1997.