Multi-Resolution Anomaly Detection for the InternetMulti-Resolution Anomaly Detection for the Internet ... detection method should be implemented as a real time algo-rithm, and that

Multi-Resolution Anomaly Detection for theInternet

Lingsong Zhang1 Zhengyuan Zhu1 Kevin Jeffay2

1Department of Statistics and Operations ResearchUniversity of North CarolinaChapel Hill, NC 27599-3260

[email protected], [email protected], [email protected],

J. S. Marron1,2 F. Donelson Smith2

2Department of Computer ScienceUniversity of North CarolinaChapel Hill, NC, 27599-3175

[email protected], [email protected]

Abstract—In the context of Internet traffic anomaly detection,we will show that some outliers in a time series can be difficultto detect at one scale while they are easy to find at another scale.In this paper, we develop an outlier detection method for a timeseries with long range dependence, and conclude that testingoutliers at multiple time scales helps to reveal them. We presenta Multi-Resolution Anomaly Detection (MRAD) procedure fordetecting network anomalies. We show that the MRAD methodis useful, especially when outliers appear as a slight local meanlevel shift with a rather long duration, e.g., as generated by aport scan. A novel MRAD outlier map is proposed to visualizethe location of the outliers, and also to suggest the significanceprobabilities (p values) for them.

I. INTRODUCTION

Effective means of detecting Internet intrusions, such asdistributed denial of service (DDoS) attacks, port scans, SYNfloods, etc., has become an active research area. Detectionmethods are often classified by the basic approach used,including signature-based detection (e.g. Bro [19] or Snort[20]), signal processing based methods (e.g. [1]), multivariatedata analysis methods (e.g. [11]), and data mining methods(e.g. [14; 21]). One of the common detection approachesis to view intrusions them as anomalies in measures of thetraffic (e.g. see [17]). Anomaly detection methods classifyempirical observations of network traffic as either representinga “normal” (expected) value or an anomalous (unexpected)value. By classifying observations as anomalous, it is thenpossible to examine the data more closely to determine if theyin fact arise from network intrusions. This type of separationis similar to outlier detection in the field of Statistics, whereanomalies correspond to statistical outliers, and normal trafficcorresponds to regular observations.

Outlier detection is a classical topic in Statistics (see [2],[9]). Internet measurements collected at a single locationover a fixed interval of time, for example, measurements ofpacket counts, byte counts, flow starts, etc., in 10 or 100millisecond intervals form a time series. Classical methods fordetecting outliers in time series data include methods such asintervention analysis [4; 6; 22]. However, classical time seriesmethods usually assume that observations are either indepen-dent or short range dependent, that is, that the autocorrelationfunction of the time series decays exponentially as the laggoes to infinity. In contrast, numerous studies have shown that

time series of network traffic measurements are Long RangeDependence (LRD) [15] and Self-Similar (SS) [25]. For thisreason, conventional statistical methods for detecting outliersin time series may not be suitable because the artifacts thatare naturally generated by LRD may cause these methods togenerate more false alarms.

Another characteristic of network traffic data is that differenttypes of network anomalies show statistically abnormal signalsin different time scales. That is, the length of the time intervalover which observations are measured or aggregated canstrongly influence results. For example, intrusions, such asport scans, that generate moderate or low packet- or byte-rates may not be detected by looking only at observationsover short time intervals (For more sophisticated examples ofother scale-related anomalies see [1]). Similarly, the LRD andSS properties of Internet traffic time series imply that evennormal traffic will have high variability over many differenttime scales. It is these multiscale properties of normal andanomalous traffic that motivate us to find detection methodswhich effectively deal with these multi-scale characteristics.This is contrast to sampling-based approaches which dealwith issues of scale through sampling at various frequencies[5; 16; 10].

In this paper, we propose a Multi-Resolution AnomalyDetection (MRAD) method for the analysis of time series withlong range dependence. From a theoretical viewpoint, thereare two important characterizations of the outlier generationprocess that are used in the development and evaluation ofour method: the intensity of a local level-shift in the meanof the observed traffic process and the duration of the level-shift. We also present a corresponding visualization tool, theMRAD outlier map, which can highlight outliers in differenttime scales.

The major contributions of this paper include: (1) a multi-resolution method for detecting outliers within a LRD timeseries, (2) a graphical tool to visualize the probability ofobservations at each time location (at multiple scales) to beoutliers (anomalies), and (3) an evaluation and comparison ofMRAD with classical statistical methods and naive methodsusing simulated time series and semi-experiments on realnetwork traces.

A critical issue for network intrusion detection is that the

2

detection method should be implemented as a real time algo-rithm, and that the test statistic (detection threshold) can beupdated iteratively as the normal traffic evolves. Thus it wouldbe possible to flag (and possibly stop) ongoing attacks in (orclose to) real time. One variation of the MRAD procedures,which is based on sliding window aggregation, can be easilyadapted to be an online detection approach, as discussed inSection III.

The remainder of this paper is arranged as follows. Amotivating example is discussed in Section II to illustratethe idea of the MRAD method. Section III discusses twoaggregation methods used to construct time series at largertime scales from empirical observations at a fixed, short,time interval. Section IV describes the rejection region ofour MRAD method and discusses an important theoreticalproperty concerning the power of the MRAD method. SectionV discusses test thresholds for this MRAD method. SectionVII uses several simulated data sets derived from actualmeasurement data, and some semi-experiments to evaluatethe usefulness of the MRAD method. Further analysis of themotivating example, and variations of the MRAD outlier mapsare discussed as well in this section. Section VIII summarizesour work and gives related discussion.

II. MOTIVATING EXAMPLE AND THE MRAD MAP

In this section, we use a real network trace to illustratethe idea of our MRAD method. Figure 1 shows a time seriesof byte counts (per 10ms), which was collected at the mainlink between the UNC campus and the Internet. The data werecollected on Wednesday April 10, 2002, starting from 1:00 pm,over a 2 hour duration. The Hurst parameter for this time seriesis estimated to be greater than 0.95 using different estimationmethods [12], i.e., this series does have LRD. A quick look atthe time series reveals that there might be network anomaliesat a number of locations and time scales. Some have a veryshort duration (e.g. the single spikes at several locations, whileothers may last for a few minutes (e.g. the medium bumpcluster near 3 (×105)).

As discussed in Section I, the multi-scale properties of“normal” network traces and network anomalies motivate ourMRAD method. An MRAD method for a time series (observedat a fixed interval which is the finest time scale) has thefollowing steps.

1) Form new time series at different time scales by aggre-gation of the observations.

2) Test whether the observations at each time and scale areoutliers or not.

3) Report or visualize the testing results.Section III describes the methods of forming time series at

different scales (levels of aggregation) and discussion of thetests based on these methods are given in Section IV. Assumethat the time series at different scales have been generated, andthe tests are preformed at each time location and each scale.Figure 2 shows an novel visualization to report the test results,the MRAD outlier map. For this particular trace, the multi-scale time series are generated by the conventional approach of

Fig. 1. Original byte count time series. Some anomalies may exist in thistime series, by showing huge spikes or clusters of moderate level shift.

aggregating observations into successive non-overlapping binsof increasing size (see Section III for details). The horizontalaxis corresponds to the time locations, and the vertical axisshows different scales. In this plot, finer scales (short timeintervals) are in the top region, while coarser scales are at thebottom. Detailed definition of these scales are also discussedin Section III. The map visualizes the significance probability(p value) simultaneously over scale and time. Note that undersuitable assumptions (see Section III for details), the marginaldistribution of each pixel (in the map) is the same when thereare no outliers at all in the time series. We use hotter colors(red) to show small p values, i.e., they correspond to higherchance to be outliers. And cooler colors (blue) are used todisplay large p values, i.e., they are less likely to be outliers.

From this plot, we find several important outlier regions. Forexample, around the location 3(×105), and the scale region5 − 15, there is one red zone, which suggests that there areoutliers in the region between (2.8(×105), 3.1(×105)). Thisgroup of outliers might be visually detected from the originaltime series, as discussed earlier, but it is not easy to visuallydistinguish signal from noise. In addition, around the locations6.2(×105) or 6.8(×105) and the scale 15, there are someorange zones, which also suggest outliers in these regions.Other outliers might not be obvious in this map, since thedisplay resolution is too small compared to the number oflocations (about 7 × 105 observations in this time series) inthe series. A zoomed version of this outlier map or a MATLABmovie of the dynamic sliding local-outlier map can be used toovercome this resolution problem, see Section VII-D for morediscussion.

The above outlier map visualizes the significance probabilityof all time locations at all scales. It is natural to focus on thoselocations and scales that are more likely to be anomalies.This motivates the following thresholded outlier map whichdisplays those scales where the p values are smaller thana threshold set to select those more likely to be outliers.Figure 3 shows the thresholded outlier map for the same trace.

3

Fig. 2. The color MRAD outlier map based on the non-overlapping windowaggregation method. The relative hot colors correspond to small p values,which suggests a high chance to be outliers. The relative cool colors standfor large p values, which mean a low possibility to be outliers.

Here we use color red to highlight those locations, wheretheir corresponding p values are smaller than 2(1 − Φ−1(3))(the absolute values of the normalized observations are largerthan 3). Figure 3 highlights those outlying regions shownin Figure 2, but they are more obvious in this map. AMATLAB program, mradvisual.m, which is available at[26], provides a default asymptotic threshold on p values andallows alternative thresholds to be set by users.

Fig. 3. Thresholded MRAD outlier map for the byte count data. Here wehighlight the p values which are smaller than 2(1−Φ−1(3)) (i.e., |YL(i)| ≥3).

Note that the above maps do not rely on the way themulti-scale time series are formed. For any multi-scale outlierdetection method, these map can be viewed as a generalway to display the test results. However, the interpretationof these outlier maps may be different for various multi-scaleaggregation methods. Another outlier map based on slidingwindow aggregation is discussed in Section III-D.

III. TWO TYPES OF AGGREGATION

There are many methods to form time series at multiplescales including wavelet methods and kernel methods. In thissection, we discuss two simple aggregation methods used inthe MRAD method because of their tractability in theoreticalproperties and their adaptability to an online algorithm. Oneimportant theoretical property of our method is reported inSection IV while other properties are discussed in [27].

In this paper, we will use a dyadic-like structure to constructmulti-scale time series, so that the window sizes with respectto different scales increase exponentially. This treatment is

motivated by the Haar Wavelet bases (see e.g. [18; 23] forintroduction of Haar Wavelet). The size of aggregation windowfor scale k is 2k−1. In other words, the observation at scale kis a function of consecutive 2k−1 observations at scale 1.

The classical outlier-detection methods in time series usu-ally use the whole time series to estimate the model and thenperform hypothesis tests on the residuals. To test whether theith observation is an outlier, methods such as the interventionmodels exploit future information after the time i. This is com-mon when outlier detection is performed after all the data areavailable. The following non-overlapping window aggregationis an example of this type. However, a real time detectionalgorithm cannot use future observations. In this section, wealso develop a one-sided sliding window aggregation for usein real-time detection. After selecting appropriate aggregationconstants, the aggregation vectors at some particular timelocations based on these two aggregation methods are thesame. However, the outlier maps from different aggregationmethods have different interpretations.

A. Non-overlapping window aggregation

The non-overlapping window aggregation (NOWA) is anatural way to form multiple-scale time series. It is motivatedby the usual Haar Wavelet method. Let Y1(i) (i = 1, · · · , N )be a time series at the finest scale. If this time series does notcontain any outliers, assume that it is a realization of fractionalGaussian noise (fGn) with the Hurst parameter H . Let Yk(i),i = 1, · · · , dN/2k−1e be the corresponding k-scale time series,where dxe returns the smallest integer which is greater thanor equals x. We use the following algorithm to aggregate thetime series from scale 1 to scale k + 1 as

Yk(i) =Lk∑j=1

1(Lk)H

Y1((i − 1)2k + j),

where Lk = 2k−1. From the above definition, we find thatone observation Y1(i) on the finest scale, can only be usedonce to form the time series at scale k. In fact, to form ob-servations at that scale, the finest scale time series are dividedby non-overlapping sections, where each section has length(/duration/window size) 2k−1. The observations at scale k area function of the observations within each section. That is whywe call this type of aggregation as non-overlapping windowaggregation. When {Y1(i)} does not contain any outliers,[27] showed that the marginal distributions of {Yk(i)} fordifferent i and k’s are the same. Note that the approximationcomponent of most (discrete) wavelet methods can be viewedas a generalization of this NOWA method.

B. Sliding window aggregation

Another important aggregation method is overlapping win-dow aggregation, such as traditional kernel methods (see agood introduction of kernel regression in chapter 5 of [24]).In this paper, we consider a special one-sided sliding windowmethod (called Sliding Window Aggregation (SWA)), which ismotivated by detection of network anomalies in real time.

4

Let {Y1(i)} be the time series at the finest scale (scale 1).The observation at time i and scale k is defined as

Yk(i) =Lk−1∑j=0

1(Lk)H

Y1(i − j),

where Lk = 2(k−1). These special weights are chosen to make{Yk(i)}, for all i and k, share the same marginal distribution,when there are no outliers in {Y1(i)}.

Note that the SWA defined above can be viewed as a specialcase of the one-sided kernel method (see [7] for a usage ofone-sided kernels). Thus, SWA can be extended to allow useof a general one-sided kernel method for detection (as apposedto the uniform kernel method used here). If the detection isnot necessarily required to be in real time, this method canalso be modified to other forms, such as a symmetric slidingwindow. Thus, the usual kernel method can be adapted in thisframework for outlier detection.

These two aggregation methods have a strong relationship.For example, at the 16th time slot, the observation vectorgenerated by NOWA is the same as the vector from SWA.Thus, the test at each location designed for NOWA is verysimilar to that for SWA. In fact, the theoretical propertieswe prove in [27] are mainly based on SWA, while all thevisualizations in this paper are based on NOWA, because theNOWA method provides a good interpretation, as discussed inSection III-D.

C. Real-time Algorithm

Note that the SWA method can be easily implemented asan iterative (over time) algorithm, and thus is well suitedfor anomaly detection in real time. If a new observation atthe smallest time interval arrives, all the observations of thecurrent time point at different scales can be easily updated. LetY(i) = (Y1(i), Y2(i), · · · , Yk(i))T be the current observationvector over different scales, and Y1(i+1) be the newly arrivedobservation at scale 1 (after normalization). For a simpleillustration, let assume the above aggregation is simply thesummation of all the observations. The update function is

Y(i + 1) = Y(i) + Y1(i + 1) − Y(i),

i.e.,Y1(i + 1)Y2(i + 1)

...Yk(i + 1)

=

Y1(i)Y2(i)

...Yk(i)

+Y1(i+1)−

Y1(i)

Y1(i − 1)...

Y1(i − 2k + 1)

.

It is straightforward that the complexity of this updatedepends on the number of scales.

D. MRAD outlier map based on SWA

As discussed earlier, the outlier map defined in Section IIvisualizes the probability of each time location and scale to bean outlier. The interpretation of the map depends on the multi-scale aggregation method. The MRAD outlier map, based onNOWA, has been discussed in Section II (Figure 2). In this

Fig. 4. The color MRAD map for the network byte counts data (slidingwindow aggregation) on April 10, 2002. It shows outliers at time locationsaround 3, 6.5 and 6.7 (×105), and scales from 10 to 15, by showing hotterregions.

section, we discuss a similar map based on SWA. Again, hotcolors (red) are used to show small p values (i.e. higher chanceto be outliers), and cool colors (blue) display large significanceprobabilities (i.e., lower chance to be outliers). Note that thetime series at scale k generated by SWA, does not have thefirst 2k−1 − 1 observations, so that we can not perform testsat these regions. We display them in the color of dark blue,the same color as for a significance test result of p = 1.

For SWA, let us assume that the observation at time i andscale k is flagged as an outlier. This flag means that there mayexist network anomalies at locations from i−2k−1+1 to i (i.e.,the last 2k−1 (the window size of aggregation) time locations)at the finest scale. For NOWA, the flag at time i and scale kmeans the corresponding block (i.e., from (i−1)×2k−1+1 toi×2k−1) at the finest scale contains outliers. Note that NOWAcan involve information after time i. There is a characteristicpattern in this SWA MRAD outlier map: a single huge spikemay be also flagged (i.e., to be thought of as possible outliers)at larger scales. Note that this huge spike will be first flaggedin a relatively finer scale, then it may be flagged later in arelatively coarser scale. Because the detection method basedon SWA is actually iterative in time, we can either remove thedetected anomalies or restart a new iterative procedure onceone observation is flagged as an outlier. These treatments willavoid having a map with the special pattern shown above.

Figure 4 shows the MRAD outlier map for the wholetrace based on SWA. It also highlights the three regions(locations around 3, 6.2, and 6.8(×105), and scales around15), which were discussed in Section II. However, comparedto the NOWA map, these regions are moved slightly to theright, which reflects the effect discussed above.

To form time series at different scales, we need to standard-ized the original series before the aggregation. In our explo-ration, we use a robust estimation of the mean (the medianof the whole series) and the standard deviation (1.4826× themedian absolute deviation [8]) to normalize the series.

5

IV. TEST METHOD AND STATISTICAL PROPERTIES

Assume that the background time series is a fGn, a commonexample of a LRD process, without any outliers. [27] showsthat the marginal distributions of {Yk(i)} are standard Normal.The MRAD method proposes the rejection region as

R : maxk

|Yk(i)| > CMα . (1)

At a given scale k, because the marginal distribution of eachobservation is N(0, 1), a naive single-scale rejection regionat time i is R : |Yk(i)| > Cα. In [27], we proved that thethreshold of the MRAD method, CM

α , is larger (i.e., moreconservative) than the threshold based on single scale, Cα.Thus, at one given scale, the MRAD is more conservativethan the naive method.

In [27], It has been shown that the power of a two-scale MRAD procedure (either based on NOWA or SWA),is larger than the average power of the naive outlier detectionmethod based on the two scales by the following theorem.The two-scale procedure is sufficient to show this feature. Weconjecture that this result is also true for larger scale MRADprocedure.

Theorem 1: Let β(1,L) = P1(maxl=1,L |Yl(i)| > CMα ),

β1 = P1(|Y1(i)| > Cα|), and βL = P1(|YL(i)| > Cα|). Forany δ > 0, there exists αδ > 0 and Lδ > 0, when α ∈ (0, αδ)and L > Lδ, the following inequality holds:

β(1,L) ≥β1 + βL

2.

Remark: This theorem shows that for any level shift, whenthe significance level α is small and L is large, the power ofthe two-scale MRAD is larger than the average power of theoutlier detection method based on a single scale.

V. TEST THRESHOLDS

A critical factor in the effectiveness of MRAD when appliedto anomaly detection is the choice of the test threshold becauseit directly controls the ability to detect the most true outlierswith the fewest false alarms. Theoretical thresholds developedin [27] are asymptotic when m, the number of scales, goes toinfinity. In practical applications, however, a rather small num-ber of scales (e.g. 15) is used so the asymptotic thresholds arenot precise enough. We developed instead an approximationmethod to determine the threshold based on three parameters:the significance level, α, Hurst parameter, H , and number ofscales, m.

When there are no outliers within the time series, [27]shows that the observations at one particular time location overscales, form a stationary process. The autocovariance functionsof these stationary processes, over different time locations,are the same. The marginal distribution of these observationsare standard Normal. Thus, given the Hurst parameter, thesignificance level, and the number of scales, an exact testthreshold can be developed based on the multivariate Normaldistribution. It is difficult to get the test threshold explicitly,

because of the complexity of calculations with the multivariateNormal. We developed a MATLAB function, to approximatethe exact threshold for MRAD based on simulation.

For example, let the number of scales be 10, significancelevel be 0.1, and the Hurst parameter be 0.9, the covariancefunction of this 10-dimensional Normal random vector isprecisely defined. We simulate this Normal random vector1000 times. For each vector, we will find the maximum ofthe absolute values. The 90% sample quantile of these 1000maximum values provides an estimate of the test threshold.In order to reduce the variability of this estimation, we alsorepeat this estimating procedure 1000 times. The average ofthese quantiles is a better estimate and provides a measurementof the error margin of this estimation.

Table I provides a partial list of the computed thresholds,and their corresponding 95% error margins, under differentcombinations of parameters: the significance level, Hurst pa-rameter, and the number of scales in the MRAD method. TheMATLAB function, mradtestthreshold.m, available at[26], calculates the thresholds.

α H number of scales threshold 95% error margin0.1 0.5 10 2.4561 0.00240.1 0.7 10 2.3899 0.00250.1 0.9 10 2.2015 0.00270.1 0.99 10 1.8656 0.00290.1 0.5 11 2.4915 0.00240.1 0.7 11 2.4258 0.0024

......TABLE I

THE TEST THRESHOLDS AND THE 95% ERROR MARGIN, UNDERDIFFERENT COMBINATIONS OF SIGNIFICANCE LEVEL, α, HURST

PARAMETER, H , AND NUMBER OF SCALES, m.

We compared the relationships between the Hurst parameter,the number of scales and the test threshold. We found that ata given scale and a given significance level, the test thresholddecreases as the Hurst parameter increases. Meanwhile, whenthe Hurst parameter and the significance level are fixed, thethreshold increases as the number of scales increases. In fact,this threshold will approach the asymptotic threshold as thenumber of scales turns to infinity. See more discussion in [27].

VI. METRICS FOR EVALUATION

In this paper, we use the following metrics to evaluate andcompare our MRAD method with other existing methods:Detected Outlier Rate (DOR), True Discovery Rate (TDR),False Discovery Rate (FDR) and False Negative Rate (FNR).Another metric, True Outlier Proportion (TOP), is also re-ported. Note that we are using the term “discovery” to meandeclared an outlier (positive) and “negative” to mean declareda regular observation. By using the classical FDR table (from[3]) below, we can define these metrics explicitly using math-ematical notation.

In this table, U is the number of regular observationsdeclared (correctly) as not being outliers (true negatives); Vis the number of regular observations declared (incorrectly) as

6

Declared Declared Totalnon-outlier outlier

regular observations U V m0

true outliers T S m−m0

m−R R m

TABLE IITHE CLASSICAL FDR DEFINITION TABLE.

outliers (false positives); T is the number of outliers declared(incorrectly) as regular observations (false negatives); S isthe number of outliers declared (correctly) as outliers (truepositives); and R is the total number of observations declaredas outliers.

Based on these notions, we can formally define the abovemetrics.TOP: the true outlier proportion, (m−m0)/m. Note that thisis not a random variable.DOR: the detected outlier rate, E(R/m), i.e., the averageproportion of declared outliers among all observations.TDR: the average proportion of declared outliers among thetrue outliers, E(S/(m − m0)).FDR: the average false discovery rate, E(V/R), i.e, amongall the detected outliers, the average ratio of those that wereregular observations.FNR: the average false negative rate, E(T/(m − R)), i.e,among all the observations declared not outliers, those thatwere true outliers.

VII. SIMULATIONS AND APPLICATIONS IN NETWORKINTRUSION DETECTION

In this section, we use several synthetic time series (SectionsVII-A, VII-B) semi-experiments (Section VII-C), and a realtrace (Section VII-D) to further illustrate the usefulness ofthe MRAD method. All the figures and MATLAB moviesin this section are produced from the MATLAB function,MRADvisual.m, which can be downloaded from [26].

A. Fractional Gaussian noise and local mean level shift

We generate several synthetic time series to evaluate theMRAD method. In these time series, the background timeseries are simulated from fGn with Hurst parameters rangingfrom 0.70 to 0.95. As noticed from the UNC Internet DataStudy Group [12], the Hurst parameters for packet-count andbyte-count time series are usually close to 0.90, so our resultsprovide a good approximation to those network features.

In the following analysis, we use standard fGn (mean 0,standard deviation 1) as the background series, and the outliersare injected as a local mean level-shift. There are severalparameters that need to be set: the Hurst parameter (H) ofthe fGn, the length of the whole series (N ); the starting time(a), duration (K), and intensity (or mean, δ) of the local level-shift. The length of the background time series, N , is set tobe 215 (this roughly approximates the number of observationsin a one-hour traffic trace recorded at 100ms intervals), andthe Hurst parameter is set to be 0.90. The intensity (δ) ofthe level shift is chosen as 1, which means that the mean

Setting percentagea K δ TOP DOR TDR FDR FNR

5644 6465 1 19.73 20.80 94.64 5.36 0.08

TABLE IIIEVALUATION OF THE MRAD METHOD BASED ON SIMULATION. IN THESE

TIME SERIES, WE INJECTED OUTLIERS (I.E., A MEAN LEVEL SHIFT) AT19.73% LOCATIONS. AMONG THESE FLAGGED LOCATIONS, AROUND 5%

OF THEM ARE FALSE POSITIVES, AND WE DETECTED APPROXIMATELY95% OF THE TRUE OUTLIERS.

of the level shift is the same as the standard deviation ofthe background time series. More challenging (i.e. smaller)intensities are investigated in Section VII-B. The starting pointof the local level shift, a, is randomly simulated from auniform distribution, U [0, 214], i.e., the starting point fallsinto the first half of the trace. The duration of the level shift,K, is simulated from an exponential distribution with meanduration 4000 observations (approximating over 6 minutesof time for 100ms recording intervals). In terms of networkanomaly detection, these settings correspond to a rather longduration, but relatively low intensity attacks, which are quitechallenging to detect by other methods. Note that we chosethe duration and intensity of the outlier interval as reasonableapproximations. The true distribution of traffic intensities andthe true distribution of the durations in a particular type ofattack are interesting future research topics.

We performed a number of simulation experiments generat-ing synthetic time series but show here only one representativeexample (more simulation results similar to these are discussedin [27]). For each experiment we generated a random startingpoint, a random duration, and selected a intensity value forthe local mean level-shift. We then generated 100 instancesof a time series from an fGn process with a selected Hurstparameter. We added the same level-shift to each of the 100time series and performed the MRAD detection method onall 100 realizations. This experimental design was chosen toevaluate MRAD’s ability to detect a given “attack” in the faceof arbitrary “normal” traffic. For example, the following showsone particular simulated setting, the starting point of the levelshift is 5644, the duration is 6465, the intensity is 1, and theHurst parameter H = 0.9. Figure 5 displays the results ofone realization among these 100 simulated traces. The toppanel plots the time series containing the injected level-shift.It is rather difficult to conclude from visual inspection whetherthere are outliers or where they are. The bottom panel showsthe MRAD outlier map (based on NOWA). It highlights anarea at locations from 5,500 to 12,000 (a relatively hotterzone), and scales from 6 to 10. This NOWA outlier mapsuggests that, these corresponding locations at the finest scale,contain anomalies. These identified locations are close to thoselocations where we input the level shift.

From all 100 time series, we can calculate the averages ofour metrics across the entire set. The results for the abovesimulation set are listed in Table III.

Table III shows that the true outlier proportion is around20% of the total observations, and our method flags around

7

21% (on average). Among all the true outliers, we detected95% of them (on average). Among all declared outliers, about5% of them are false positives. In addition, among all the non-declared observations, only less than 0.1% of them are falsenegatives. These numbers show that our method is good atdetecting level shifts.

Fig. 5. The color MRAD outlier map based on NOWA for one simulationtrace, which is fractional Gaussian noise plus a given local mean level shift.This map highlights the injected level shift by showing a hotter region aroundthe time interval (5,500, 12,000), and scales from 6 to 10.

B. Comparisons between MRAD method and single-scalemethods

An important issue to investigate is to what extent could wehave done just as well at outlier detection with aggregationto only a single scale instead of performing aggregations at10 or 15 scales. Of course, it would be extremely difficultto select the appropriate scale a priori since the duration orintensity of the next intrusion generally cannot be anticipated.We used simulation to compare the MRAD method and a naivesingle-scale method using each scale from the range of scaleswe used with MRAD. We simulate a fGn as the backgroundwith different level shifts to compare the performance ofthe naive methods and the MRAD method. Note that thedetection threshold of naive method and the MRAD methodare different.

In tables IV and V, we simulate fGn with Hurst parameterH = 0.90. The duration of the level shift is assumed tobe exponentially distributed with the three different meannumber of observations (λ): 40, 400 and 4000. The intensityof the level shift, δ, is also simulated with three differentvalues: 0.5, 1 and 2. For each combination of λ and δ, wesimulate 100 different realizations. Table IV shows the medianFDRs, and Table V provides the median TDRs under differentcombinations. It shows that at some scales, the naive methodhas smaller FDR (TDR) than the MRAD method, and atother scales, the naive method has larger FDR (TDR) thanthe MRAD method. In fact, the best performance in terms of

H λ δ MRAD Scale 1 Scale 5 Scale 9(1) (16) (256)

0.9 40 0.5 100.00% 100.00% 100.00% 100.00%1 100.00% 99.59% 100.00% 100.00%2 95.76% 96.58% 90.91% 100.00%

400 0.5 100.00% 99.17% 100.00% 100.00%1 97.81% 95.99% 90.91% 100.00%2 55.64% 74.42% 50.00% 63.09%

4000 0.5 64.85% 61.68% 47.62% 33.33%1 13.88% 34.95% 17.39% 16.67%2 5.83% 11.46% 5.15% 6.54%

TABLE IVMEDIAN FDR COMPARISONS BETWEEN THE SINGLE-SCALE DETECTION

METHODS AND THE MRAD METHOD. THE NUMBERS IN THEPARENTHESES ARE THE AGGREGATION BIN-SIZES AT THE

CORRESPONDING SCALES.

H λ δ MRAD Scale 1 Scale 5 Scale 9(1) (16) (256)

0.9 40 0.5 0.00% 0.00% 0.00% 0.00%1 0.00% 1.24% 0.00% 0.00%2 38.57% 28.04% 44.16% 0.00%

400 0.5 0.00% 0.62% 0.00% 0.00%1 1.57% 3.88% 5.49% 0.00%2 89.93% 32.54% 63.56% 54.77%

4000 0.5 2.85% 2.55% 3.74% 7.19%1 20.59% 7.70% 14.08% 24.33%2 97.03% 36.12% 67.68% 93.97%

TABLE VMEDIAN TDR COMPARISONS BETWEEN THE SINGLE-SCALE DETECTION

METHODS AND THE MRAD METHOD.

FDR is expected to be at the scale which is close to the meanduration. Obviously, the duration of an attack generally cannotbe known ahead of time. Our method is expected to have betteraverage performance than any single-scale detection methods,as shown in tables IV and V. Theorem 1 shows that the powerof the MRAD method is larger than the average power overscales, as discussed earlier. These two tables also suggeststhat when the intensity of the level shift increases, the FDRdecreases and the TDR increases.

C. Semi-experiments

In this subsection, we use several semi-experiments toillustrate the usefulness of the MRAD method. Because thereis a lack of labeled data traces available in the field ofstatistical anomaly detection, a useful test procedure for ananomaly detection method is to use semi-experiments.

The semi-experiment studied here, for anomaly detectionevaluation can be summarized as follows• Collect a trace from the Internet,• Simulate some specific types of network anomalies in a

laboratory environment and trace them,• Combine these two traces (by merging the traced anomaly

packets into the Internet trace) and then use detectionmethods to identify the injected anomalies.

In the following analysis, we collected a real trace from theUNC campus Internet link with a duration of three hours, andremoved those network flows without either the starting point

8

Fig. 6. The color MRAD outlier map for packet count series of thesemi-experiment example. It highlights two possible outlying regions: around100,000, and around 300,000.

or the end point within the trace. The remaining flows aretreated as the “normal” traffic. We use the center one hourtrace as the background traffic to obtain an approximatelystationary process. The network anomalies simulated in the labinclude port scans, rose attacks and TCP SYN flood attacks .In this section, we will analyze one example of port scan asthe injected anomaly. More examples are available at [26].Note that because the ‘”normal” trace contains all packetsfrom the Internet link, the MRAD method might also flagsome anomalies already present in the background trace. Aftercombining the anomaly trace with the background trace, wecomputed the packet, byte, and flow counts per 10ms timeinterval (same as the example discussed in Section II). Asdiscussed earlier, the background trace used here lasts for 1hour. The portscan simulated for this example lasted 6 minutes,i.e., 10% of the total trace.

A port scan is typically a series of small packets, which areintended to learn which computer network services, associatedwith a port number, the target computer provides. The portscan provides information for the attackers as to where toprobe for system weaknesses. Note that a high frequency portscan will dramatically increase the number of flows observed(i.e., a huge local mean level shift) and also increases thenumber of packets and bytes to a certain degree. If the probingpackets are really small, the increase of byte counts willtypically be dominated by the variability of the normal traffic.If the probing frequency is low enough, the increase of packetcounts will also be dominated by the variability of the normaltraffic. Thus, these anomalies are not easily detectable in theseries of packet counts and byte counts. We show here only theresults for packet counts. In our analysis, a medium frequencywas used to send out the small probing packets. See [13]for more about detection methods and characterization of portscan attacks.

Fig. 7. Two snapshots of the color MRAD zoomed movie for this motivatingexample. These two panels highlights the starting point (left panel) and theend point (right panel) of a long duration left shift.

The top panel in Figure 6 shows the time series plot of thepacket-count trace. It is hard to tell whether there are networkanomalies, and to identify the locations of the anomalies.The estimated Hurst parameter for this trace is 0.95, whichshows a strong long range dependent structure. We use thisestimate (0.95) of the Hurst parameter, and perform an MRADprocedure based on NOWA. The bottom panel in Figure 6 isthe MRAD outlier map based on NOWA. It highlights twohotter zones, around 100,000 and around 300,000. In fact,the zone around 100,000 corresponds to the location wherewe injected the network anomaly. We conjecture that thoseanomalies around 300,000 are some actual anomalies withinthe original Internet trace, and we are working on identifyingthese anomalies.

D. More analysis of a real trace

In this subsection, we continue to analyze the real networktrace which has been studied in Section II, by using a zoomedversion of the outlier map and a MATLAB movie of the localoutlier map.

The MRAD outlier map for the whole trace is in Figure 2.Note that the outlier map has smaller resolution, comparedto the number of observations of the original time series.Thus it is hard to find some outliers directly from this map(for example, one single outlier at the finest scale). In thissection, we develop zoomed versions of the MRAD outliermap (including the thresholded outlier map) to find additionalinsights from this data set.

Figure 7 shows two snapshots from the MRAD zoomedoutlier movie. When the time series has a large number ofobservations, such that the resolution of the MRAD outliermap is relatively small, it is natural to view the map locally.The MRAD zoomed outlier movie visualizes the whole map,in a horizontally sliding window, from the first group ofobservations to the last group of observations. Within eachgroup, the zoomed map will have a better resolution to revealmore insights of the time series. The MATLAB function,mradvisual.m, can generate this movie with appropriateoptions. The interpretation of each frame is the same as thewhole outlier map, which has been discussed in Section II

The two panels in Figure 7 highlight the mean level shiftat around time 300,000, which has been discussed in SectionII. The left panel highlights the starting point of this level

9

shift, which is around 275,000 in the original time series. Atsome locations, the map shows a relatively cooler color (blue)at finer scales, but hotter colors (yellow or red) at coarserscales. This suggests that some locations are not shown asanomalies at one scale, but will be flagged at another scale.Viewing multiple scales will boost the (low) intensity of thelevel shift, and help to identify network anomalies. The rightpanel shows similar features as the middle panel. Note thatthese two panels also suggest this part of network anomaliesstarting gradually, but ending rather rapidly. In the middlepanel around 275,000, the color changes gradually from coolerregions to hotter regions. In the right panel around 306,000,the color change from hotter to cooler is more like a jump.These changes are also visible in the zoomed time series plotin the top panels.

VIII. CONCLUSION AND DISCUSSION

The above examples and the theoretical properties in [27]show that the MRAD method helps to identify outliers, es-pecially when outliers are in the form of a slight mean levelshift, even when the level shift is not obvious in the originaltime scale. We also showed that the MRAD method uses amore conservative threshold, and has larger power on averagethan detection methods based on a single scale.

In this research, we assume that the Hurst parameter isknown, or can be correctly estimated from the time series.And we use one type of robust procedure to standardize theoriginal time series. It is natural to explore or develop a robustestimation of the Hurst parameter and a better robust procedurein terms of long range dependent time series.

The rejection region, as defined in (1), takes into accountmultiple comparisons at different scales. However, we did notconsider the multiple comparison in the time space. We plan toincorporate some multiple testing methods in the time space,for example control the FDR [3] in the time direction.

The aggregation method used in this paper is natural andsimple. It makes sense to explore other aggregation methods,such as wavelet methods, kernel methods, etc. Other typesof outliers should also be explored. It is also natural toconsider other types of long range dependent time series asbackground, such as fractional ARIMA and stable process.More challenging research remains to be done on this aspect.

ACKNOWLEDGEMENT

We gratefully thank Jeff Terrell, for collecting the semi-experiment traces and simulating the anomalies. Thanks alsogo to Professor Richard Smith, for his suggestions in devel-oping theoretical properties of the MRAD method.

REFERENCES

[1] Paul Barford, Jeffery Kline, David Plonka, and Amos Ron. A signalanalysis of network traffic anomalies. In Proceedings of the 2nd ACMSIGCOMM Workshop on Internet Measurement, pages 71–82, 2002.

[2] Vic Barnett and Toby Lewis. Outliers in Statistical Data. John Wiley& Sons, 3 edition, 1994.

[3] Yoav Benjamini and Yosef Hochberg. Controlling the false discoveryrate: a practical and powerful approach to multiple testing. Journal ofRoyal Statistical Society, Series B, 57(1):289–300, 1995.

[4] G. E. P. Box and G. C. Tiao. Intervention analysis with applicationsto economic and environmental problems. Journal of the AmericanStatistical Association, 70(349):70–79, 1975.

[5] D. Brauckhoff, B. Tell enbach, A. Wagner, M. May, and A. Lakhina.Impact of packet sampling on anomaly detection metrics. In 6th ACMSIGCOMM Conference on Internet Measurement, Rio de Janeiro, Brazil,October 2006.

[6] Ih Chang, George C. Tiao, and Chung Chen. Estimation of time seriesparameters in the presence of outliers. Technometrics, 30(2):193–204,1988.

[7] Irene Gijbels, Peter Hall, and Aloıs Kneip. On the estimation ofjump points in smooth curves. Annals of the Institute of StatisticalMathematics, 51:231–251, 1999.

[8] Frank R Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, andWerner A. Stahel. Robust Statistics: the approach based on InfluenceFunctions. John Wiley & Sons, 1986.

[9] D. M. Hawkins. Identification of Outliers. Chapman and Hall, 1980.[10] K. Ishibashi, R. Kawahara, T. Mori, T. Kondoh, and S. Asano. Effect

of sampling rate and monitoring granularity on anomaly detectability.In 10th IEEE Global Internet Symposium 2007, Anchorage, AK, May2007.

[11] Anukool Lakhina, Mark Crovella, and Christophe Diot. Diagnosingnetwork-wide traffic anomalies. In ACM SIGCOMM Computer Com-munication Review, volume 34, pages 219–230, 2004.

[12] Long Le and Felix Hernandez-Campos. UNC network data analysisstudy group: Summary page for LRD project, 2004. website availableat http://www-dirt.cs.unc.edu/net lrd/.

[13] Cynthia Bailey Lee, Chris Roedel, and Elena Silenok. Detection andCharacterization of Port Scan Attacks, 2003.

[14] Wenke Lee, Salvatore Stolfo, and Kui Mok. A data mining frameworkfor building intrusion detection models. In Proceedings of the 1999IEEE Symposium on Security and Privacy, pages 120–132, 1999.

[15] Will E. Leland, Murad S. Taqqu, Walter Willinger, and Daniel V.Wilson. On the self-similar nature of ethernet traffic (extended version).IEEE/ACM Transactions on Networking, 2(1):1–15, 1994.

[16] Jianning Mai, Chen-Nee Chuah, Ashwin Sridharan, Tao Ye, and HuiZang. Is sampled data sufficient for anomaly detection? In 6th ACMSIGCOMM Conference on Internet Measurement, Rio de Janeiro, Brazil,October 2006.

[17] John McHugh. Intrusion and intrusion detection. International Journalof Information Security, 1(1):14–35, 2001.

[18] R. T. Ogden. Essential Wavelets for Statistical Applications and DataAnalysis. Boston: Birkhauser, 1997.

[19] V. Paxson. Bro: A system for detecting network intruders in real-time.Computer Networks, 31:2435–2463, 1999.

[20] Martin Roesch. Snort - lightweight intrusion detection for networks. InProceedings of LISA 99: 13th Systems Administration Conference, pages229–238, 1999.

[21] Salvatore Stolfo, Wenke Lee, Philip Chan, Wei Fan, and Eleazar Eskin.Data mining-based intrusion detectors: An overview of the columbia idsproject. In ACM SIGMOD Record, volume 30, pages 5–14, 2001.

[22] Ruey S. Tsay. Outlier, level shifts, and variance changes in time series.Journal of Forecasting, 7:1–20, 1988.

[23] Brani Vidakovic. Statistical Modeling By Wavelets. Wiley, 1999.[24] M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman and Hall,

1995.[25] Walter Willinger, Murad S. Taqqu, Robert Sherman, and Daniel V.

Wilson. Self-similarity through high-variability: Statistical analysis ofethernet lan traffic at the source level. IEEE/ACM Transactions onNetworking, 3(1):71–86, 1997.

[26] Lingsong Zhang. MultiResolution Anomaly Detection programs, imagesand movies, 2006. available at http://www.unc.edu/˜lszhang/research.

[27] Lingsong Zhang. Functional Singular Value Decomposition and Mul-tiResolution Anomaly Detection. PhD thesis, University of NorthCarolina at Chapel Hill, 2007.

Documents

Multi-Resolution Anomaly Detection for the InternetMulti-Resolution Anomaly Detection for the Internet ... detection method should be implemented as a real time algo-rithm, and that