Noms2000.Ps

Embed Size (px)

Citation preview

  • 8/13/2019 Noms2000.Ps

    1/14

    Predictive Models for Proactive

    Network Management: Application to a

    Production Web Server

    Dongxu Shen Joseph L. Hellerstein

    Rensselaer Polytechnique Institute IBM T. J. Watson Research Center

    Troy, NY 12180 Hawthorne, NY 10598

    [email protected] [email protected]

    Abstract

    Proactive management holds the promise of taking corrective actions in advanceof service disruptions. Achieving this goal requires predictive models so that po-tential problems can be anticipated. Our approach builds on previous researchin which HTTP operations per second are studied in a web server. As in thisprior work, we model HTTP operations as two subprocesses, a (deterministic)

    trend subprocess and a (random but stationary) residual subprocess.Herein, the trend model is enhanced by using a low-pass filter. Further,we employ techniques that reduce the required data history, thereby reducingthe impact of changes in the trend process. As in the prior work, an autore-gressive model is used for the residual process. We study the limits of theautoregressive model in the prediction of network traffic. Then we demonstratethat long-range dependencies remain in the residual process even after autore-gressive components are removed, which impacts our ability to predict futureobservations. Last, we analyze the validity of assumptions employed, especiallythe normality assumption.

    Keywords

    Network traffic prediction, traffic modeling, AR model, network management,

    web server.

    1 Introduction

    Network robustness is a central concern for information service providers. Net-work failures will, at a minimum, cause user inconvenience. Often, failures have

  • 8/13/2019 Noms2000.Ps

    2/14

    a substantial financial impact, such as loss of business opportunities and cus-tomer dissatisfaction. Forecasting service level violations offers the opportunityto take corrective actions in advance of service disruptions. For example, pre-dicting excessive web traffic may prompt network managers to limit access tolow priority web sites.

    Forecasting is common in areas as diverse as weather prediction [2] andprojections of future economic performance [3]. However, not enough has beendone on network traffic. Herein, we address this area. However, due to theburstiness and non-stationarities of network traffic, there are inherent limits tothe accuracy of predictions.

    Despite these difficulties, we are strongly motivated to develop predictivemodels of networked systems, since having such models will enable a new gen-eration of proactive management technologies. There has been considerableworks in the area of fault detection [6, 7]. There are some efforts in proactivedetection[5, 8].

    Our work can be viewed as a sequel to [8]. This paper studies the numberof hypertext protocol (HTTP) operations per second. The data are detrendedusing a model of daily traffic patterns that consider time-of-day, day-of-week,and month, and an autoregressive model is used to characterize the residualprocess. Here we use the same dataset and variable. Also, as in the priorwork, we employ separate models of the non-stationary (trend) and stationarycomponents of the original process. However, we go beyond [8] in the followingways:

    1. We provide a different way to model the trend so that less historical dataare needed.

    2. In obtaining the trend model, we use a low pass filter to extract morestructure from the data.

    3. The assumptions underlying the residual process are examined in moredetails, especially the assumption of a Gaussian distribution.

    4. Both the current and prior work use autoregressive (AR) models for fore-casting. In the current work, we shed light on the limits of AR-basedpredictions.

    5. We show that long-range dependencies remain in the data even after thetrend and is removed. This situation poses challenges for prediction.

    The remainder of this paper is organized as follows. Section 2 presentsan overview for the problem. Section 3 models the non-stationary behavior.The residual process is modeled in Section 4 models. Evaluation of the wholemodeling scheme is provided in Section 5. Conclusions are given in Section 6.

  • 8/13/2019 Noms2000.Ps

    3/14

    0 1 2 3 4 5 6 70

    5

    10

    15

    0 1 2 3 4 5 6 70

    5

    10

    15

    0 1 2 3 4 5 6 70

    5

    10

    15

    Figure 1: Plot of HTTP operations for three weeks. X axis: day. Y axis:number of operations.

    2 Problem Overview

    The dataset we use is from a production web server at a large computer companyusing the collection facility described in [4]. The variable we chose to modelis the number of HTTP operations per second that are performed by the webserver. Data are collected at a five-minute interval, for a total of 288 intervalsin a day.

    Figure 1 plots three weeks of HTTP operations. From the plot, we noticethat there is an underlying trend for every week. For each workday morning, the

    HTTP operation increase as people arrive at work. Soon afterwards, the HTTPoperations peeks and remains there for most of the day. In late afternoon, valuesreturn to lower levels. On weekends, the usage remain low throughout the day.

    Clearly, HTTP operations per second is non-stationary in that the meanchanges with time-of-day and day-of-week. However, the previous figure sug-gests that the same non-stationary pattern is present from week to week (orthat it changes slowly). Thus, we can view the full process as having a stablemean for each time of the day along with fluctuations that are modeled by arandom variable with a zero mean. We treat the means as a separate subpro-cess, which we refer to as the trend subprocess. By subtracting the trendfrom the original data, we obtain the residual process, which is assumed tobe stationary.

    Our prediction scheme is structured as follows. First, the trend process is

    estimated from the original process. Next, the residual process is constructedby subtracting trend from the original process. Then, prediction is done on theresidual process. Finally, the trend is added back for the ultimate predictionresult. The modeling required to support these steps is divided into two parts:modeling of the non-stationary or trend subprocess and the modeling of theresidual subprocess.

  • 8/13/2019 Noms2000.Ps

    4/14

    0 5 10 15 20 250.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Figure 2: Plot of autocorrelation. X axis: day. Y axis: autocorrelation.

    3 Model of Non-stationarity

    3.1 Weekly Trend in the Data

    Let Xt denote the process over a week. That is, t ranges over 2887 = 2016values. We decompose Xt into a trend subprocess and a residual subprocess,i.e.,

    Xt = Yt+XRt (1)

    whereYt denotes the trend subprocess andXRt denotes the residual subprocess.

    At each time instant t, Yt is deterministic and XRt is a random variable. The

    indext is computed from i, the day of week, and j , the time of day, as follows:

    t= N i +j (2)

    where Nis number of samples in a day.In [8], trend is modeled on a daily basis. The reason why we model the

    process at a weekly level is indicated by Figure 2, which plots the autocorrelationfunction of three weeks of data. Note that the strongest correlation exists fordata at the interval of a week. This suggests a weekly pattern. Also, in Figure 3,we plotted the spectrum of data, obtained from the Fourier transform of 20weeks of data. The Xaxis represents number of cycles that happen in a week.We can notice there are two outstanding peaks away from the origin, at 1 and7, which correspond to a weekly cycle and a daily cycle. From the spectrum we

    notice a strong presence of frequency components at the cycle of a week. Thecomponents at a daily level ( 7 cycles/week ) is relatively weaker. This justifiesour modeling approach at a weekly level is more appropriate.

  • 8/13/2019 Noms2000.Ps

    5/14

    0 1 2 3 4 5 6 7 80

    0.5

    1

    1.5

    2

    2.5x 10

    5

    cycles/week

    Figure 3: Spectrum of data. X axis: cycles/week. Y axis: spectral amplitude.

    3.2 Trend Estimation

    This section addresses the estimation of the trend component Ytin Equation (1).Intuitively, Yt should be a smooth process, much smoother than the originalprocess Xt.

    In contrast with [8], our scheme uses only three consecutive weeks of theeight months of data to estimate each Yt. We restrict ourselves to such sub-sets for several reasons. First, we reduce the dependencies on historical data.This is important since the amount of historical data may be limited due toconstraints on disk space, network changes (which makes older data obsolete),

    and other factors. Second, the trend itself may change over time. There can benumerous factors affecting the trend, such as network upgrades, or large scalechanges in work assignments within the company. Usually those factors are notpredictable. In our data, for example, HTTP operations per second increasesover a period of months. Thus, including more than a few weeks of data mayadversely affect the trend model. Thus, we restrict ourselves to three weeks ofdata.

    Figure 4 is the power spectrum density of a week. Since we view the wholeprocess as the superposition of two subprocesses, the trend can be viewed asmainly containing low frequency components, while the high frequency partscorrespond to the residual process. Thus we use a low pass filter to filter out thehigh frequency component, then the filtered data are further averaged to obtainthe trend. The filter we use is a fourth order Butterworth digital low pass filter

    with normalized cutoff frequency 0.2. For knowledge about Butterworth filter,see [10].

    Here is the algorithm for estimating Yt. Let the impulse response of the lowpass filter be h(t). For week i, the data is Xit , and the filtered result is W

    it .

    There isWit =h(t) X

    it (3)

  • 8/13/2019 Noms2000.Ps

    6/14

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 130

    20

    10

    0

    10

    20

    30

    40

    Frequency

    PowerSpectrumM

    agnitude(dB)

    Figure 4: Power spectrum density plot of a typical weekday. X axis: normalizedfrequency. Y axis: spectral amplitude.

    where * denotes convolution. ThenYt is estimated as

    Yt =Ii=1W

    it

    I (4)

    whereIis the total number of weeks. In our work, Iis 3. Then the process Xtcan be detrended by subtracting Yt, i.e.,

    XRt =Xt Yt (5)

    In [8], the trend model incorporates monthly effects. Further, the trendmodel assumes that the interaction between time-of-day and day-of-week can beexpressed in an additive manner. In essence, this model assumes that patternsfor different days in a week should be similar. We refer to this as the dailymodelof trend.

    Our approach, which we refer to as the weekly model, differs from theforegoing. We observe that the interaction of time-of-day and day-of-week ef-fects are not additive. Thus, it makes sense to have a separate interactionterm for each time-of-day and day-of-week. Further, the assumption of similarpattern is only acceptable for weekdays. Obviously weekends have very differ-ent patterns from that of the weekdays and so cannot be modeled similarly.(Actually the work in [8] only considered weekdays.) Still, the daily modelis unable to capture pattern differences between weekdays, except through the

    mean. While weekly model more accurately reflects the trend process, it hasthe disadvantage of introducing more parameters.

    The effectiveness of the two modeling strategy is compared in Figure 5 andFigure 6 for five work days. The residual process is obtained by using the weeklyand daily models on the same length of data (three weeks).

    From the comparison we can see the weekly model can better detrend the

  • 8/13/2019 Noms2000.Ps

    7/14

    0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50.8

    0.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6Residual process ( 5 days, MondayFriday ), variance=0.0121

    days

    residualvaluehttpop/s

    Figure 5: Residual process after subtracting trend modeled by weekly model.

    0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8Residual process ( 5 days, MondayFriday ), variance=0.0176

    days

    residualvaluehttpop/s

    Figure 6: Residual process after subtracting trend modeled by daily model.

  • 8/13/2019 Noms2000.Ps

    8/14

    0 50 100 150 200 250 3003

    2.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    Figure 7: Residual after subtracting trend. X axis: time index in a day. Y axis:HTTP operations after removing the trend.

    original process, and the residual process variance is smaller. This is not sur-prising given the previous discussions.

    4 Modeling the Residual Process

    This section discusses the modeling ofXRt , the residual process. The residualprocess is estimated by subtracting the estimate ofYt from Xt. Figure 7 plotsthe data of one day after subtracting the trend. Clearly, this plot is more

    consistent with a stationary process than when the trend is present. However,the data are highly bursty.We consider predicting the residual process at time t+ n. It requires the

    estimation of its mean and variance. We use an autoregressive (AR) model.AR model is a simple and effective method in time series modeling. It is alsoadopted in other works analyzing network traffic (e.g., [6, 7]). A second orderAR model is used, which is denoted by AR(2). That is,

    XRt =1XRt1+ 2X

    Rt2+ t, (6)

    where XRt denotes the residual process, 1 and 2 are two AR(2) parameters,and t is the error term, which is assumed to be an independently and identi-cally distributed (i.i.d.) Gaussian random variables with a mean of zero and avariance of2

    . When XR

    t andXR

    t

    1

    are known, XRt+1

    is predicted as

    XRt+1 = 1XRt +2X

    Rt1 (7)

    For n step prediction,

    XRt+n= 1XRt+n1+ 2 XRt+n2 (8)

  • 8/13/2019 Noms2000.Ps

    9/14

    We need to add back the subtracted trend term Ytto get the final predictionvalue

    Xt+n= Yt+n+ XRt+n (9)

    5 Model Evaluation

    This section evaluates the effectiveness of the model developed in the precedingsections. Our focus is XRt . Indeed, we treat the Yt process as deterministic.Thus, the evaluation here addresses AR models. We use the characteristicfunction of the AR model to assess the limits of prediction using an AR model.

    Following this, we study in detail the error terms in the AR model, t.

    5.1 Characteristic Function

    Here, we study the accuracy of an n step prediction. As we predict further intothe future, the variance of the prediction error increases, which can be analyzedin a straightforward manner. As is commonly used in the literature (as in [8]),the AR(2) model can be expressed as

    XRt =

    j=0

    (j+11 j+12 )

    1 2tj (10)

    where1 and 2 are roots of equation 11B 2B2 = 0. For the AR processto be stable, we require |1|< 1, |2|

  • 8/13/2019 Noms2000.Ps

    10/14

    0 2 4 6 8 10 120

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Figure 8: A typical characteristic function of an AR(2) model. X axis: j. Yaxis: G(j).

    5.2 Prediction Error

    For the purpose of prediction, XRt+n can be expressed as

    XRt+n= n1j=0 G(j)t+nj+

    j=nG(j)t+nj. (13)

    The first term on the right-hand-side is based on future values from t+1 throught + n 1. The second term is determined by the present and past data at timet, t 1, t 2, ... . For an n step prediction, the first term is unknown. Weestimate it by using the expected value oft+nj, 0 j n, which is 0. Thus,

    XRt+n is estimated asXRt+n=

    j=nG(j)t+nj. (14)

    and the prediction error is

    e(n) = n1j=0 G(j)t+nj (15)

    The error term is a linear combination of zero mean i.i.d. Gaussian randomvariables. Thus, it is also Gaussian with mean of zero, and variance

    2e(n) = n1j=0 G

    2(j)2 . (16)

    For ann step prediction, from Equation 16, the error variance is determinedby two factors the error variance of one step prediction 2 , and characteristic

    function G. For an AR process, the variance oft is fixed. ThenG(j) determineshow error variance increases with n. For the characteristic function plotted inFigure 8, we see that the influence of past values is negligible after just fivesteps. That is, when we predict more than five steps ahead, the predicted valueis close to zero, and the error variance is large.

    To elaborate on the last point, we introduce the notion ofpredictable stepsfor an AR model. The intention is to quantify the number of steps into the

  • 8/13/2019 Noms2000.Ps

    11/14

    0 10 20 30 40 50 60 70 80 90 1000.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    Figure 9: Comparison of autocorrelation function. -: residual process. :AR process. X axis: lag. Y axis: autocorrelation.

    future for which it is meaningful to do prediction. Let s2(t +n) be the energyofXRt+n, where s

    2(t+ n) = (XRt+n)2. The ratio between s2(t+ n) and 2e(n)

    provides insight into the quality of ann step prediction at time t. This ratio is

    snr(t, n) = 10logs2(t +n)

    2e(n) . (17)

    This expression has the form of a signal-to-noise ratio. Here, the signal is thepredicted value, and the noise is the prediction variance. From the discussion

    of the characteristic function, we know that as AR predictions extend furtherinto the future, the predicted value of an AR process decreases, and its varianceincreases. Thus, snr is a decreasing function ofn.

    The foregoing discussion suggests that snrcan be used as a stopping rule forAR-based prediction. When snr drops below a specified threshold, AR-basedprediction should not continue. However, for our application, we can still usethe trend to give us information about the future. We can set XRt+n to 0, so

    that Xt+n = Yt+n. Thus, for largern, we only consider the trend componentwhen making a prediction.

    We briefly consider a second issue for using an AR model. This relatesto long-range dependence in the data. Others have noted that network trafficexhibits long range dependence (e.g., [11]). Figure 9 plots the autocorrelationfunction of the residual process in our data and the autocorrelation function of

    the AR process using parameters estimated from the residual process.We see that for large lags, the autocorrelation of AR process decays ex-

    ponentially fast, while the correlation of data decays at a much slower speed,exhibiting long-range dependence. This failure to capture long range depen-dencies is a drawback of the AR model and affects its ability to predict theresidual process, especially for larger n.

  • 8/13/2019 Noms2000.Ps

    12/14

    5.3 Probability of Violating a Threshold

    An important application of predictive models for proactive management isestimating the probability of violating a threshold (e.g., as in [8]). This isaccomplished by estimating the mean and variance of a future observation andthen using the assumption of Gaussian error terms to estimate the probabilityof violating a threshold.

    To provide more details, we consider an n step prediction. The randomvariable Xt+n can be expressed as Xt+n = Xt+n+ t+n, where Xt+n is givenby

    Xt+n= Yt+n+ XRt+n (18)

    which is the summation of the trend and predicted value of the residual process.The distribution of the random variable Xt+n is determined by the distri-

    bution oft+n, which is assumed to be a zero mean Gaussian random variable.Then Xt+n is also a Gaussian random variable with its mean given by Equa-tion 18, and variance by Equation 16.

    We consider a threshold T for which we want to estimate the probabilitythat Xt+n > T. As in [8], we transform Tinto units ofX

    Rt+n, which results in

    a time-varying threshold whose value is tht+n at time t + n. Thus, it suffices toestimate probability that XRt+n > tht+n given that the mean of Xt+n is

    Xt+n

    and its variance is 2e . LetPt(n) denote this probability. Then,

    Pt(n) = 1 (tht+nXt+n

    e)

    = (Xt+ntht+n

    e)

    (19)

    where ()is the cumulative distribution function of Gaussian distribution N(0, 1).

    5.4 Distribution of Error Term

    The above analysis is based on the assumption of a Gaussian distribution forthe error term. However, the error term distribution may not be Gaussian.Figure 10 is the quantile-quantile plot of one step prediction error distributionagainst that of Gaussian. The straight line is Gaussian, while the thick line isfor the error term. We notice considerable differences in the tail behavior of thetwo distributions in that the tail of the prediction error is fatter than thatof the Gaussian. That is, when the threshold falls on the tail, the Gaussianassumption over estimates the probability of a threshold violation. Also from

    the quantile plot, the prediction error distribution copes with that of Gaussianfor most of the parts except the tail. In reality we are usually concerned withcases when the probability of threshold violation is large enough and tend toneglect small probabilities. Then the error in the tail is not a major concern. Inthat sense, employing a Gaussian assumption is an acceptable approximationfor the error distribution.

  • 8/13/2019 Noms2000.Ps

    13/14

    2 1 0 1 2 3

    0.001

    0.003

    0.01

    0.02

    0.05

    0.10

    0.25

    0.50

    0.75

    0.90

    0.95

    0.98

    0.99

    0.997

    0.999

    Data

    Probability

    Normal Probability Plot

    Figure 10: Quantile-quantile plot of the prediction error.

    6 Conclusions

    Proactive management can allow service providers to take action in advance ofservice disruptions. However, being proactive requires a capability to predictsystem behavior. This paper investigates issues related to prediction in thecontext of web server data.

    Our focus has been to extend the work in [8] in several ways. Our exten-sions include: we enhance the trend model by employing filtering techniquesand by considering the trend over a shorter time span; we examine assumptionsunderlying the stationary component, including the assumption of a Gaussian

    distribution for the error terms in the residual process; the limits of predic-tion using AR models is analyzed; we show long-range dependence remains inthe data after the trend and stationary components are removed, which poseschallenges for prediction.

    We see a number of areas that should be investigated more fully. A fun-damental issue is the burstiness and non-stationary of the data. Our AR(2)model is unable to deal with long range dependence. A model capable of deal-ing with long range dependence may improve the prediction precision. On theother hand, we model the trend as a fixed process. The prediction performedon the residual process limits prediction quality, constrained by the ability ofAR model and the noisy nature of the residual process. We may improve ourprediction by predicting the trend instead, which is much less noisy than theresidual process. This requires an adaptive trend model to capture the trend

    evolution. Sometimes, the prediction on the trend may be of more interests fornetwork managers.

    Acknowledgments

    We wish to thank Sheng Ma for helpful comments and stimulating discus-sions.

  • 8/13/2019 Noms2000.Ps

    14/14

    References

    [1] George E.P. Box, Gwilym M. Jenkins, Time Series Forecasting and Con-trol, Prentice Hall, 1976.

    [2] M.Dutta, Executive Editor, Economics, Econometrics and the Links,North Holland, 1995.

    [3] Andrei S.Monin,Weather Forecasting as a Problem in Physics, MIT Press,Cambridge, MA 1972.

    [4] Adrian Cochcroft, Watching Your Web Servers, SunWorld OnLine,http://www.sunworld.com/swol-03-1996/swol-03-perf.html.

    [5] C.S.Hood, C. Ji, Proactive Network Fault Detection, Proceedings of IN-FOCOM, Kobe, Japan, 1997.

    [6] P. Hoogenboom, J. Lepreau, Computer System Performance DetectionUsing Time Series Models, Proc. of the Summer USENIX Conference,pp. 15-32, 1993.

    [7] Marina Throttan, C. Ji, Adaptive Thresholding for Proactive NetworkProblem Detection, Third IEEE International Workshop on SystemsManagement, pp. 108-116, Newport, Rhode Island, April, 1998.

    [8] Joseph L. Hellerstein, Fan Zhang, P.Shahabuddin, An Approach to Pre-dictive Detection for Service Management, Symposium on Integrated Net-

    work Management, 1999.

    [9] Thomas Kailath, Kalman Filtering: Theory and Pracitce, Prentice Hall,1993.

    [10] Bernard Widrow, Samuel D.Strearns, Adaptive Signal Processing, PrenticeHall, 1985.

    [11] W. E. Leland, M. S. Taqqu, Walter Willinger, D. V. Wilson, On the Self-Similarity Nature of Ethernet Traffic (Extended Version), IEEE Trans.on Networking, Vol. 2, No.1, pp. 1-14, Feb. 1994.